retrieval augmented generation

  • 文章类型: Journal Article
    背景:诊断错误会带来重大的健康风险,并导致患者死亡。随着电子健康记录的日益普及,机器学习模型为提高诊断质量提供了一条有前途的途径。目前的研究主要集中在一组有限的疾病和充足的训练数据,忽略数据可用性有限的诊断方案。
    目的:本研究旨在开发一种基于信息检索(IR)的框架,该框架可容纳数据稀疏性,以促进更广泛的诊断决策支持。
    方法:我们介绍了一个基于IR的诊断决策支持框架,称为CliniqIR。它使用临床文本记录,统一的医学语言系统词库,和3300万份PubMed摘要,以独立于训练数据可用性对广泛的诊断进行分类。CliniqIR旨在与任何IR框架兼容。因此,我们使用密集和稀疏检索方法实现了它。我们将CliniqIR的性能与预训练的临床变压器模型的性能进行了比较,例如在监督和零射设置下来自变压器的临床双向编码器表示(ClinicalBERT)。随后,我们结合了监督微调ClinicalBERT和CliniqIR的优势,构建了一个集成框架,提供最先进的诊断预测.
    结果:在没有任何训练数据的复杂诊断数据集(DC3)上,CliniqIR模型在其前3个预测中返回了正确的诊断。关于重症监护医学信息集市III数据集,CliniqIR模型在预测<5个训练样本的诊断方面超过ClinicalBERT,平均倒数排名差异为0.10。在零射击环境中,模型没有接受疾病特异性训练,CliniqIR仍然优于预训练的变压器模型,其平均倒数排名至少为0.10。此外,在大多数情况下,我们的集成框架超越了其各个组件的性能,证明其增强了做出精确诊断预测的能力。
    结论:我们的实验强调了IR在利用非结构化知识资源识别不常遇到的诊断方面的重要性。此外,我们的集成框架受益于结合监督和基于检索的模型的互补优势来诊断广泛的疾病.
    BACKGROUND: Diagnostic errors pose significant health risks and contribute to patient mortality. With the growing accessibility of electronic health records, machine learning models offer a promising avenue for enhancing diagnosis quality. Current research has primarily focused on a limited set of diseases with ample training data, neglecting diagnostic scenarios with limited data availability.
    OBJECTIVE: This study aims to develop an information retrieval (IR)-based framework that accommodates data sparsity to facilitate broader diagnostic decision support.
    METHODS: We introduced an IR-based diagnostic decision support framework called CliniqIR. It uses clinical text records, the Unified Medical Language System Metathesaurus, and 33 million PubMed abstracts to classify a broad spectrum of diagnoses independent of training data availability. CliniqIR is designed to be compatible with any IR framework. Therefore, we implemented it using both dense and sparse retrieval approaches. We compared CliniqIR\'s performance to that of pretrained clinical transformer models such as Clinical Bidirectional Encoder Representations from Transformers (ClinicalBERT) in supervised and zero-shot settings. Subsequently, we combined the strength of supervised fine-tuned ClinicalBERT and CliniqIR to build an ensemble framework that delivers state-of-the-art diagnostic predictions.
    RESULTS: On a complex diagnosis data set (DC3) without any training data, CliniqIR models returned the correct diagnosis within their top 3 predictions. On the Medical Information Mart for Intensive Care III data set, CliniqIR models surpassed ClinicalBERT in predicting diagnoses with <5 training samples by an average difference in mean reciprocal rank of 0.10. In a zero-shot setting where models received no disease-specific training, CliniqIR still outperformed the pretrained transformer models with a greater mean reciprocal rank of at least 0.10. Furthermore, in most conditions, our ensemble framework surpassed the performance of its individual components, demonstrating its enhanced ability to make precise diagnostic predictions.
    CONCLUSIONS: Our experiments highlight the importance of IR in leveraging unstructured knowledge resources to identify infrequently encountered diagnoses. In addition, our ensemble framework benefits from combining the complementary strengths of the supervised and retrieval-based models to diagnose a broad spectrum of diseases.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    大型语言模型(LLM)是基于变压器的神经网络,在来自不同来源的非常大的文本语料库上训练了数十亿个参数。由于LLM能够解析复杂的概念并生成基于上下文的响应,因此具有改善医疗保健的潜力。对LLM的兴趣并没有放过消化系统疾病学者,他们主要调查了基础LLM准确性,范围从25%到90%,并且受到缺乏标准化规则来报告面向LLM的研究方法和结果的影响。此外,一个关键问题是缺乏普遍接受的准确性定义,从二进制到标量解释,通常与分级者的专业知识联系在一起,而不参考临床指南。我们应对策略和挑战,以提高准确性。特别是,LLM可以使用检索增强生成(RAG)或监督微调(SFT)与来自人类反馈(RLHF)的强化学习来注入领域知识。RAG面临着上下文窗口限制和从提供的上下文进行准确信息检索的挑战。SFT,更深层次的适应方法,计算要求很高,需要专业知识。LLM可能会提高消化系统疾病领域的患者护理质量,医生经常从事筛查,针对广泛的病理进行治疗和监测,在这些病理中进行背景学习或使用RLHF进行SFT可以改善临床决策和患者预后.然而,尽管有潜力,LLM在医疗保健领域的安全部署仍然需要克服准确性方面的障碍,这表明需要将人类反馈与高级模型训练相结合的策略。
    Large Language Models (LLMs) are transformer-based neural networks with billions of parameters trained on very large text corpora from diverse sources. LLMs have the potential to improve healthcare due to their capability to parse complex concepts and generate context-based responses. The interest in LLMs has not spared digestive disease academics, who have mainly investigated foundational LLM accuracy, which ranges from 25% to 90% and is influenced by the lack of standardized rules to report methodologies and results for LLM-oriented research. In addition, a critical issue is the absence of a universally accepted definition of accuracy, varying from binary to scalar interpretations, often tied to grader expertise without reference to clinical guidelines. We address strategies and challenges to increase accuracy. In particular, LLMs can be infused with domain knowledge using Retrieval Augmented Generation (RAG) or Supervised Fine-Tuning (SFT) with reinforcement learning from human feedback (RLHF). RAG faces challenges with in-context window limits and accurate information retrieval from the provided context. SFT, a deeper adaptation method, is computationally demanding and requires specialized knowledge. LLMs may increase patient quality of care across the field of digestive diseases, where physicians are often engaged in screening, treatment and surveillance for a broad range of pathologies for which in-context learning or SFT with RLHF could improve clinical decision-making and patient outcomes. However, despite their potential, the safe deployment of LLMs in healthcare still needs to overcome hurdles in accuracy, suggesting a need for strategies that integrate human feedback with advanced model training.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    医学正在进入一个新时代,人工智能(AI)和深度学习对患者护理产生了可衡量的影响。这种影响在心血管医学中尤其明显。虽然这篇简短的观点论文的目的不是对人工智能在心血管医学中的许多应用进行深入的回顾,我们总结了在这一领域取得的一些重要进展。
    Medicine is entering a new era in which artificial intelligence (AI) and deep learning have a measurable impact on patient care. This impact is especially evident in cardiovascular medicine. While the purpose of this short opinion paper is not to provide an in-depth review of the many applications of AI in cardiovascular medicine, we summarize some of the important advances that have taken place in this domain.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们部署了一个即时增强的GPT-4模型,以提炼出关于债务换自然互换(DNS)的全球应用的全面数据集,环境保护的关键金融工具。我们的分析包括195个国家,并确定了21个尚未使用DNS的国家/地区作为DNS的主要候选者。很大一部分表明了对保护金融的一致承诺(与历史掉期记录相比,准确率为0.86)。相反,2010年以前在DNS中活跃的35个国家/地区已被确定为不适合。值得注意的是,阿根廷,努力应对飙升的通货膨胀和严重的主权债务危机,波兰,它实现了经济稳定,并获得了替代的欧盟保护基金,举例说明变化的适宜性景观。该研究的结果阐明了DNS作为经济和政治动荡中的保护策略的脆弱性。
    We deploy a prompt-augmented GPT-4 model to distill comprehensive datasets on the global application of debt-for-nature swaps (DNS), a pivotal financial tool for environmental conservation. Our analysis includes 195 nations and identifies 21 countries that have not yet used DNS before as prime candidates for DNS. A significant proportion demonstrates consistent commitments to conservation finance (0.86 accuracy as compared to historical swaps records). Conversely, 35 countries previously active in DNS before 2010 have since been identified as unsuitable. Notably, Argentina, grappling with soaring inflation and a substantial sovereign debt crisis, and Poland, which has achieved economic stability and gained access to alternative EU conservation funds, exemplify the shifting suitability landscape. The study\'s outcomes illuminate the fragility of DNS as a conservation strategy amid economic and political volatility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Preprint
    受试者筛选是所有临床试验的关键方面;然而,传统上,这是一项劳动密集型和容易出错的任务,需要大量的时间和资源。随着大型语言模型(LLM)和相关技术的出现,自然语言处理能力的范式转变为提高筛查工作的质量和效率提供了有希望的途径。本研究旨在测试启用生成预训练变压器版本4(GPT-4)的检索增强生成(RAG)过程,以准确识别和报告临床试验的纳入和排除标准。
    实施心力衰竭最佳治疗方案(COPILOT-HF)试验旨在招募有症状的心力衰竭患者。作为筛选过程的一部分,通过电子健康记录(EHR)查询创建潜在符合条件的患者列表.目前,EHR中的结构化数据只能用于确定6个纳入标准中的5个和17个排除标准中的5个。受过训练,但是没有许可,研究人员完成手动图表审查,以确定患者的资格,并记录他们对纳入和排除标准的评估.我们获得了研究人员在过去两年中完成的结构化评估和临床笔记,并开发了由RAG架构和GPT-4提供支持的基于临床笔记的问答系统的工作流程,我们将其命名为RECTIFIER(RAG启用的临床试验基础设施,用于排除审查)。我们使用了100名患者的笔记作为发展数据集,282名患者作为验证数据集,和1894名患者作为测试集。专家临床医生完成了对患者图表的盲目审查,以回答资格问题并确定“黄金标准”答案。我们计算了灵敏度,特异性,准确度,和马修斯相关系数(MCC)为每个问题和筛选方法。我们还进行了自举以计算每个统计量的置信区间。
    RECTIFIER和研究人员的回答与标准中的专家临床医生的回答密切相关,RECTIFIER的准确度在97.9%和100%之间(MCC0.837和1),研究人员的准确度在91.7%和100%之间(MCC0.644和1)。RECTIFIER在确定“有症状的心力衰竭”的纳入标准方面优于研究人员,准确率分别为97.9%和91.7%,MCC为0.924和0.721。总的来说,确定RECTIFIER合格的敏感性和特异性分别为92.3%(CI)和93.9%(CI),研究人员分别为90.1%(CI)和83.6%(CI),分别。
    基于GPT-4的解决方案具有在临床试验筛选中提高效率并降低成本的潜力。当使用新的工具如RECTIFIER时,重要的是要考虑自动化筛查过程的潜在危害,并制定适当的缓解策略,例如在患者参与之前进行最终的临床医师检查.
    UNASSIGNED: Subject screening is a key aspect of all clinical trials; however, traditionally, it is a labor-intensive and error-prone task, demanding significant time and resources. With the advent of large language models (LLMs) and related technologies, a paradigm shift in natural language processing capabilities offers a promising avenue for increasing both quality and efficiency of screening efforts. This study aimed to test the Retrieval-Augmented Generation (RAG) process enabled Generative Pretrained Transformer Version 4 (GPT-4) to accurately identify and report on inclusion and exclusion criteria for a clinical trial.
    UNASSIGNED: The Co-Operative Program for Implementation of Optimal Therapy in Heart Failure (COPILOT-HF) trial aims to recruit patients with symptomatic heart failure. As part of the screening process, a list of potentially eligible patients is created through an electronic health record (EHR) query. Currently, structured data in the EHR can only be used to determine 5 out of 6 inclusion and 5 out of 17 exclusion criteria. Trained, but non-licensed, study staff complete manual chart review to determine patient eligibility and record their assessment of the inclusion and exclusion criteria. We obtained the structured assessments completed by the study staff and clinical notes for the past two years and developed a workflow of clinical note-based question answering system powered by RAG architecture and GPT-4 that we named RECTIFIER (RAG-Enabled Clinical Trial Infrastructure for Inclusion Exclusion Review). We used notes from 100 patients as a development dataset, 282 patients as a validation dataset, and 1894 patients as a test set. An expert clinician completed a blinded review of patients\' charts to answer the eligibility questions and determine the \"gold standard\" answers. We calculated the sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) for each question and screening method. We also performed bootstrapping to calculate the confidence intervals for each statistic.
    UNASSIGNED: Both RECTIFIER and study staff answers closely aligned with the expert clinician answers across criteria with accuracy ranging between 97.9% and 100% (MCC 0.837 and 1) for RECTIFIER and 91.7% and 100% (MCC 0.644 and 1) for study staff. RECTIFIER performed better than study staff to determine the inclusion criteria of \"symptomatic heart failure\" with an accuracy of 97.9% vs 91.7% and an MCC of 0.924 vs 0.721, respectively. Overall, the sensitivity and specificity of determining eligibility for the RECTIFIER was 92.3% (CI) and 93.9% (CI), and study staff was 90.1% (CI) and 83.6% (CI), respectively.
    UNASSIGNED: GPT-4 based solutions have the potential to improve efficiency and reduce costs in clinical trial screening. When incorporating new tools such as RECTIFIER, it is important to consider the potential hazards of automating the screening process and set up appropriate mitigation strategies such as final clinician review before patient engagement.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号