Large language models

大型语言模型
  • 文章类型: Journal Article
    目标:尽管有监督的机器学习在从临床笔记中提取信息方面很受欢迎,创建大型带注释的数据集需要广泛的领域专业知识,并且非常耗时。同时,大型语言模型(LLM)已经证明了有希望的迁移学习能力。在这项研究中,我们探讨了最近的LLM是否可以减少对大规模数据注释的需求。
    方法:我们整理了769份乳腺癌病理报告的数据集,手动标记有12个类别,比较以下LLM的零射分类能力:GPT-4、GPT-3.5、Starling、和临床骆驼,具有3种模型的特定任务监督分类性能:随机森林,注意力长期短期记忆网络(LSTM-Att),和UCSF-BERT模型。
    结果:在所有12个任务中,GPT-4模型的性能明显优于最好的监督模型,LSTM-Att(平均宏F1评分为0.86对0.75),在具有高标签不平衡的任务上具有优势。其他LLM表现不佳。常见的GPT-4错误类别包括来自多个样本和历史的错误推断,复杂的任务设计,和几个LSTM-Att错误与测试集的泛化性差有关。
    结论:对于无法轻松收集大型注释数据集的任务,LLM可以减轻数据标记的负担。然而,如果LLM的使用令人望而却步,使用带有大型注释数据集的更简单的模型可以提供可比的结果。
    结论:GPT-4证明了通过减少对大型注释数据集的需求来加快临床NLP研究执行的潜力。这可能会增加临床研究中基于NLP的变量和结果的利用率。
    OBJECTIVE: Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations.
    METHODS: We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model.
    RESULTS: Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set.
    CONCLUSIONS: On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results.
    CONCLUSIONS: GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:系统地筛选已发表的文献以确定要在综述中综合的相关出版物是一项耗时且艰巨的任务。大型语言模型(LLM)是一种新兴技术,具有用于自动化语言相关任务的有前途的功能,可用于此目的。
    方法:LLM被用作自动化系统的一部分,用于根据定义的标准并根据每个出版物的标题和摘要评估出版物与某个主题的相关性。创建了一个Python脚本来生成由用于指令的文本字符串组成的结构化提示,title,abstract,以及提供给LLM的相关标准。LLM在Likert量表上评估了出版物的相关性(低相关性与高相关性)。通过指定阈值,然后可以定义包含/排除出版物的不同分类器。该方法与四个不同的公开可用的LLM一起使用,用于十个已发表的生物医学文献综述数据集和一个新的人类创建的数据集,用于假设的新系统文献综述。
    结果:分类器的性能取决于所使用的LLM和所分析的数据集。关于敏感性/特异性,FlanT5模型的分类器收益率为94.48%/31.78%,OpenHermes-NeuralChat模型为97.58%/19.12%,在10个已发布的数据集上,Mixtral模型为81.93%/75.19%,Platypus2模型为97.58%/38.34%。相同的分类器在12.58%的特异性下产生100%的灵敏度,4.54%,62.47%,和24.74%的新创建的数据集。更改方法的标准设置(对指令提示进行较小的调整和/或将Likert量表的范围从1-5更改为1-10)对性能产生了相当大的影响。
    结论:LLM可用于评估科学出版物与某些评论主题的相关性,并且基于这种方法的分类器显示出一些有希望的结果。迄今为止,在进行系统的文献综述时,如果前瞻性地使用这些系统会有多好,以及这可能会产生什么进一步的影响,人们对此知之甚少。然而,未来研究人员可能会越来越多地使用LLM来评估和分类科学出版物。
    BACKGROUND: Systematically screening published literature to determine the relevant publications to synthesize in a review is a time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for the automation of language-related tasks that may be useful for such a purpose.
    METHODS: LLMs were used as part of an automated system to evaluate the relevance of publications to a certain topic based on defined criteria and based on the title and abstract of each publication. A Python script was created to generate structured prompts consisting of text strings for instruction, title, abstract, and relevant criteria to be provided to an LLM. The relevance of a publication was evaluated by the LLM on a Likert scale (low relevance to high relevance). By specifying a threshold, different classifiers for inclusion/exclusion of publications could then be defined. The approach was used with four different openly available LLMs on ten published data sets of biomedical literature reviews and on a newly human-created data set for a hypothetical new systematic literature review.
    RESULTS: The performance of the classifiers varied depending on the LLM being used and on the data set analyzed. Regarding sensitivity/specificity, the classifiers yielded 94.48%/31.78% for the FlanT5 model, 97.58%/19.12% for the OpenHermes-NeuralChat model, 81.93%/75.19% for the Mixtral model and 97.58%/38.34% for the Platypus 2 model on the ten published data sets. The same classifiers yielded 100% sensitivity at a specificity of 12.58%, 4.54%, 62.47%, and 24.74% on the newly created data set. Changing the standard settings of the approach (minor adaption of instruction prompt and/or changing the range of the Likert scale from 1-5 to 1-10) had a considerable impact on the performance.
    CONCLUSIONS: LLMs can be used to evaluate the relevance of scientific publications to a certain review topic and classifiers based on such an approach show some promising results. To date, little is known about how well such systems would perform if used prospectively when conducting systematic literature reviews and what further implications this might have. However, it is likely that in the future researchers will increasingly use LLMs for evaluating and classifying scientific publications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)在各种医学领域都表现出令人印象深刻的表现,促使探索他们在急诊室(ED)分诊的高需求设置中的潜在效用。本研究评估了不同LLM和ChatGPT的分诊能力,基于LLM的聊天机器人,与受过专业培训的ED员工和未经培训的人员相比。我们进一步探讨了LLM响应是否可以指导未经培训的员工进行有效的分诊。
    目的:本研究旨在评估LLM和相关产品ChatGPT在ED分诊中与不同培训状态的人员相比的功效,并调查模型的反应是否可以提高未培训人员的分诊熟练程度。
    方法:由未经培训的医生对总共124个匿名病例进行了分类;当前可用的LLM的不同版本;ChatGPT;以及受过专业培训的评估者,他们随后根据曼彻斯特分诊系统(MTS)达成共识。原型插图改编自德国三级ED的案例。主要结果是评分者之间的协议水平,MTS级别的分配,通过二次加权科恩κ测量。还确定了过度和未充分就诊的程度。值得注意的是,使用零剂量方法提示ChatGPT的实例,而没有关于MTS的大量背景信息.测试的LLM包括原始GPT-4,Llama370B,双子座1.5和混合8x7b。
    结果:基于GPT-4的ChatGPT和未经培训的医生与专业评估者的共识分类基本一致(分别为κ=平均值0.67,SD0.037和κ=平均值0.68,SD0.056),显著超过基于GPT-3.5的ChatGPT的性能(κ=平均值0.54,SD0.024;P<.001)。当未经培训的医生使用此LLM进行第二意见分诊时,性能略有提高,但统计学上无统计学意义(κ=平均值0.70,SD0.047;P=0.97)。其他测试的LLM与基于GPT-4的ChatGPT相似或更差,或者显示出使用参数的奇怪分类行为。LLM和ChatGPT模型倾向于过度分类,而未受过训练的医生则不成熟。
    结论:WhileLLMandtheLLM-basedproductChatGPTdonotyetmatchprofessionallytrainedraters,他们最好的模型\'分诊熟练程度等于未经培训的ED医生。以目前的形式,因此,LLM或ChatGPT在ED分诊中没有表现出黄金标准的表现,在这项研究的背景下,当用作决策支持时,未能显著改善未经培训的医生分诊。较新的LLM版本相对于较旧版本的显着性能增强暗示了未来的改进与进一步的技术开发和特定的培训。
    BACKGROUND: Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage.
    OBJECTIVE: This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models\' responses can enhance the triage proficiency of untrained personnel.
    METHODS: A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters\' MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b.
    RESULTS: GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged.
    CONCLUSIONS: While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models\' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors\' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
    目的:本研究旨在评估用于微调LLM的样本量和样本选择技术,以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别(NER)。
    方法:随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别,一旦建立了适当的协议,注释者独立地注释了另外290个公开声明。从490个注释文档中,抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构(来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示)中微调语言模型的选择,以改善NER。多元回归用于评估样本量(句子)之间的关系,实体密度(每个句子的实体[EPS]),和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
    结果:在架构中,微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义(均P<.001)。在所有情况下,EPS和句子数是F1得分的显著预测因子(P<.001),除了GPT-2_large模型,其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量(以句子的数量衡量)的边际收益递减点,点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样,阈值回归模型表明每股收益的边际收益递减,点估计在1.36和1.38之间。
    结论:相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM,和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多,重要的是训练数据量和模型参数大小。
    BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
    OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
    METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
    RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
    CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Case Reports
    用于临床试验的结构化文档的生成是大型语言模型(LLM)的有希望的应用。我们分享机会,见解,以及使用LLM自动化临床试验文档的竞争挑战带来的挑战。
    作为辉瑞(组织者)发起的挑战的一部分,几个团队(参与者)创建了一个试点项目,用于为临床研究报告(CSR)生成安全表摘要.我们的评估框架使用自动化指标和专家评审来评估人工智能生成的文档的质量。
    比较分析揭示了不同解决方案的性能差异,特别是在事实准确性和精益写作方面。大多数参与者采用了具有生成性预训练变压器(GPT)模型的即时工程。
    我们讨论需要改进的地方,包括更好地摄取表格,添加上下文和微调。
    挑战结果证明了法学硕士在CSR中自动化表格总结的潜力,同时也揭示了人类参与和持续研究以优化该技术的重要性。
    UNASSIGNED: The generation of structured documents for clinical trials is a promising application of large language models (LLMs). We share opportunities, insights, and challenges from a competitive challenge that used LLMs for automating clinical trial documentation.
    UNASSIGNED: As part of a challenge initiated by Pfizer (organizer), several teams (participant) created a pilot for generating summaries of safety tables for clinical study reports (CSRs). Our evaluation framework used automated metrics and expert reviews to assess the quality of AI-generated documents.
    UNASSIGNED: The comparative analysis revealed differences in performance across solutions, particularly in factual accuracy and lean writing. Most participants employed prompt engineering with generative pre-trained transformer (GPT) models.
    UNASSIGNED: We discuss areas for improvement, including better ingestion of tables, addition of context and fine-tuning.
    UNASSIGNED: The challenge results demonstrate the potential of LLMs in automating table summarization in CSRs while also revealing the importance of human involvement and continued research to optimize this technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:皮肤科患者教育材料(PEM)的书写水平通常高于全国平均水平的七至八年级阅读水平。ChatGPT-3.5,GPT-4,DermGPT,和DocsGPT是响应用户提示的大型语言模型(LLM)。我们的项目评估了它们在指定阅读水平下生成皮肤病学PEM的用途。
    目的:本研究旨在评估在未指定和指定的阅读水平下,选择LLM在常见和罕见皮肤病学中产生PEM的能力。Further,该研究旨在评估这些LLM生成的PEM的意义保存情况,由皮肤科住院医师评估。
    方法:当前美国皮肤病学会PEMs的Flesch-Kincaid阅读水平(FKRL)评估了4种常见(特应性皮炎,寻常痤疮,牛皮癣,和带状疱疹)和4例罕见(大疱性表皮松解症,大疱性类天疱疮,层状鱼鳞病,和扁平苔藓)皮肤病。我们提示ChatGPT-3.5,GPT-4,DermGPT,和DocsGPT以“在[FKRL]中创建关于[条件]的患者教育讲义”,以在未指定的五年级和七年级FKRL中每个条件迭代生成10个PEM,使用MicrosoftWord可读性统计进行评估。由2名皮肤科住院医师评估了LLM中意义的保留。
    结果:当前的美国皮肤病学会PEMs对常见和罕见疾病的平均(SD)FKRL为9.35(1.26)和9.50(2.3),分别。对于常见疾病,LLM生产的PEM的FKRL介于9.8和11.21之间(未指定提示),在4.22和7.43之间(五年级提示),在5.98和7.28之间(七年级提示)。对于罕见疾病,LLM生产的PEM的FKRL范围在9.85和11.45之间(未指定提示),在4.22和7.43之间(五年级提示),在5.98和7.28之间(七年级提示)。在五年级阅读水平,与ChatGPT-3.5相比,GPT-4在常见和罕见条件下都能更好地生产PEM(分别为P=.001和P=.01),DermGPT(分别为P<.001和P=.03),和DocsGPT(分别为P<.001和P=.02)。在七年级的阅读水平,ChatGPT-3.5、GPT-4、DocsGPT、或DermGPT在生产常见条件下的PEM(所有P>.05);然而,对于罕见的情况,ChatGPT-3.5和DocsGPT的表现优于GPT-4(分别为P=.003和P<.001)。意义分析的保留表明,对于共同条件,DermGPT在整体阅读便利性方面排名最高,患者的可理解性,和准确性(14.75/15,98%);对于罕见的情况,GPT-4产生的施舍排名最高(14.5/15,97%)。
    结论:GPT-4的表现似乎优于ChatGPT-3.5,DocsGPT,和DermGPT在五年级FKRL的常见和罕见的情况下,尽管ChatGPT-3.5和DocsGPT在7级FKRL中在罕见情况下的表现均优于GPT-4。LLM生产的PEM可以可靠地满足七级FKRL的选择常见和罕见的皮肤病,并且易于阅读,患者可以理解,而且大多是准确的。LLM可能在提高健康素养和传播无障碍方面发挥作用,在皮肤病学中可以理解的PEM。
    BACKGROUND: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels.
    OBJECTIVE: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees.
    METHODS: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to \"Create a patient education handout about [condition] at a [FKRL]\" to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees.
    RESULTS: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%).
    CONCLUSIONS: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)是从文本数据推断的机器学习模型,该模型捕获了上下文中语言使用的微妙模式。现代LLM基于结合了变压器方法的神经网络架构。它们允许模型通过关注文本序列中的多个单词来将单词联系在一起。LLM已被证明对自然语言处理(NLP)中的一系列任务非常有效,包括分类和信息提取任务以及生成应用程序。
    目的:这项改编的Delphi研究的目的是收集研究人员关于LLM如何影响医疗保健和优势的意见,弱点,机遇,以及LLM在医疗保健中使用的威胁。
    方法:我们邀请了健康信息学领域的研究人员,护理信息学,和医学NLP分享他们对医疗保健中LLM使用的看法。我们从第一轮开始,根据我们的优势提出了开放的问题,弱点,机遇,威胁框架。在第二轮和第三轮,参与者对这些项目进行了评分。
    结果:第一个,第二,第三轮有28、23和21名参与者,分别。几乎所有参与者(26/28,第一轮93%和20/21,第三轮95%)都隶属于学术机构。就与用例相关的103项达成了协议,好处,风险,可靠性,采用方面,以及LLM在医疗保健领域的未来。参与者提供了几个用例,包括支持临床任务,文档任务,医学研究和教育,并同意基于LLM的系统将充当患者教育的健康助手。商定的好处包括提高数据处理和提取的效率,提高流程的自动化程度,提高医疗保健服务质量和整体健康结果,提供个性化护理,加速诊断和治疗过程,并改善患者和医疗保健专业人员之间的互动。总的来说,总体上确定了5种医疗保健风险:网络安全漏洞,潜在的病人错误信息,伦理问题,有偏见的决策的可能性,以及与不准确沟通相关的风险。基于LLM的系统中的过度自信被认为是对医学界的风险。6个商定的隐私风险包括使用不受监管的云服务,损害数据安全。暴露敏感的患者数据,违反保密规定,欺诈性使用信息,数据存储和通信中的漏洞,以及对患者数据的不当访问或使用。
    结论:与LLM相关的未来研究不仅应专注于测试其与NLP相关任务的可能性,还应考虑模型可能有助于的工作流程以及有关质量的要求,一体化,以及在实践中成功实施所需的法规。
    A large language model (LLM) is a machine learning model inferred from text data that captures subtle patterns of language use in context. Modern LLMs are based on neural network architectures that incorporate transformer methods. They allow the model to relate words together through attention to multiple words in a text sequence. LLMs have been shown to be highly effective for a range of tasks in natural language processing (NLP), including classification and information extraction tasks and generative applications.
    The aim of this adapted Delphi study was to collect researchers\' opinions on how LLMs might influence health care and on the strengths, weaknesses, opportunities, and threats of LLM use in health care.
    We invited researchers in the fields of health informatics, nursing informatics, and medical NLP to share their opinions on LLM use in health care. We started the first round with open questions based on our strengths, weaknesses, opportunities, and threats framework. In the second and third round, the participants scored these items.
    The first, second, and third rounds had 28, 23, and 21 participants, respectively. Almost all participants (26/28, 93% in round 1 and 20/21, 95% in round 3) were affiliated with academic institutions. Agreement was reached on 103 items related to use cases, benefits, risks, reliability, adoption aspects, and the future of LLMs in health care. Participants offered several use cases, including supporting clinical tasks, documentation tasks, and medical research and education, and agreed that LLM-based systems will act as health assistants for patient education. The agreed-upon benefits included increased efficiency in data handling and extraction, improved automation of processes, improved quality of health care services and overall health outcomes, provision of personalized care, accelerated diagnosis and treatment processes, and improved interaction between patients and health care professionals. In total, 5 risks to health care in general were identified: cybersecurity breaches, the potential for patient misinformation, ethical concerns, the likelihood of biased decision-making, and the risk associated with inaccurate communication. Overconfidence in LLM-based systems was recognized as a risk to the medical profession. The 6 agreed-upon privacy risks included the use of unregulated cloud services that compromise data security, exposure of sensitive patient data, breaches of confidentiality, fraudulent use of information, vulnerabilities in data storage and communication, and inappropriate access or use of patient data.
    Future research related to LLMs should not only focus on testing their possibilities for NLP-related tasks but also consider the workflows the models could contribute to and the requirements regarding quality, integration, and regulations needed for successful implementation in practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景大型语言模型(LLM),比如ChatGPT-4,双子座,和微软Copilot,在各个领域发挥了重要作用,包括医疗保健,他们提高健康素养并帮助患者决策。鉴于乳房成像程序的复杂性,准确和可理解的信息对于患者的参与和依从性至关重要。本研究旨在评估三个著名LLM提供的信息的可读性和准确性,ChatGPT-4,双子座,和微软Copilot,回应乳房成像中的常见问题,评估他们的潜力,以提高患者的理解和促进医疗沟通。方法我们从临床实践中收集了最常见的乳腺成像问题,并将其提交给LLM。然后,我们从可读性和准确性方面评估了回答。使用FleschReadingEase和Flesch-KincaidGradeLevel测试分析了LLM的可读性,并通过放射科医生开发的Likert型量表分析了准确性。结果研究发现LLM之间存在显著差异。Gemini和MicrosoftCopilot在可读性量表上得分更高(p<0.001)。表明他们的反应更容易理解。相比之下,ChatGPT-4在其反应中表现出更高的准确性(p<0.001)。结论虽然ChatGPT-4等LLM在提供准确响应方面表现出希望,可读性问题可能会限制其在患者教育中的效用。相反,双子座和微软Copilot,尽管不太准确,更容易接触到更广泛的患者受众。这些模型的持续调整和评估对于确保它们满足患者的不同需求至关重要。强调需要不断改进和监督人工智能技术在医疗保健中的部署。
    Background Large language models (LLMs), such as ChatGPT-4, Gemini, and Microsoft Copilot, have been instrumental in various domains, including healthcare, where they enhance health literacy and aid in patient decision-making. Given the complexities involved in breast imaging procedures, accurate and comprehensible information is vital for patient engagement and compliance. This study aims to evaluate the readability and accuracy of the information provided by three prominent LLMs, ChatGPT-4, Gemini, and Microsoft Copilot, in response to frequently asked questions in breast imaging, assessing their potential to improve patient understanding and facilitate healthcare communication. Methodology We collected the most common questions on breast imaging from clinical practice and posed them to LLMs. We then evaluated the responses in terms of readability and accuracy. Responses from LLMs were analyzed for readability using the Flesch Reading Ease and Flesch-Kincaid Grade Level tests and for accuracy through a radiologist-developed Likert-type scale. Results The study found significant variations among LLMs. Gemini and Microsoft Copilot scored higher on readability scales (p < 0.001), indicating their responses were easier to understand. In contrast, ChatGPT-4 demonstrated greater accuracy in its responses (p < 0.001). Conclusions While LLMs such as ChatGPT-4 show promise in providing accurate responses, readability issues may limit their utility in patient education. Conversely, Gemini and Microsoft Copilot, despite being less accurate, are more accessible to a broader patient audience. Ongoing adjustments and evaluations of these models are essential to ensure they meet the diverse needs of patients, emphasizing the need for continuous improvement and oversight in the deployment of artificial intelligence technologies in healthcare.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)是具有高级自然语言处理能力的计算人工智能系统,由于其能够提供对大量医学知识的实时访问,最近在医疗保健学生和教育工作者中普及。LLM技术在医学教育和培训中的应用各不相同,几乎没有经验证据支持其在临床教学环境中的使用。
    目的:研究的目的是确定和定性评估LLM技术在实时基于病房的教育环境中的潜在用例和局限性。
    方法:简短,通过在大型城市学术医疗中心的普通内科住院服务的每日就诊中实施该工具,对公开可用的ChatGPT-3.5(OpenAI)进行了单站点探索性评估。ChatGPT通过结构化和有机使用整合到回合中,使用基于Web的“chatbot”样式界面通过对话自由文本和离散查询与LLM交互。通过分析ChatGPT对话日志和临床会话中的相关速记注释,使用现象学查询的定性方法来识别与使用ChatGPT相关的关键见解。
    结果:确定的ChatGPT集成用例包括通过离散的医学知识查询来解决医学知识差距,建立鉴别诊断和参与双过程思维,具有挑战性的医学公理,使用认知辅助手段来支持急性护理决策,并通过促进与亚专科的对话来改善复杂的护理管理。潜在的额外用途包括与患者进行艰难的对话,探索伦理挑战和一般医学伦理教学,个人继续医学教育资源,开发基于病房的教学工具,支持和自动化临床文档,并支持生产力和任务管理。LLM偏见,错误信息,伦理,健康公平被确定为临床和培训使用的关注领域和潜在限制。还制定了有关道德和适当使用的行为准则,以指导团队在病房中的使用。
    结论:总体而言,ChatGPT提供了一种新颖的工具,可以通过快速的信息查询来增强基于病房的学习,二阶内容探索,并就生成的响应进行团队讨论。需要更多的研究来充分了解教育用途的背景,特别是关于该工具在临床环境中的风险和局限性及其对培训生发展的影响。
    BACKGROUND: Large language models (LLMs) are computational artificial intelligence systems with advanced natural language processing capabilities that have recently been popularized among health care students and educators due to their ability to provide real-time access to a vast amount of medical knowledge. The adoption of LLM technology into medical education and training has varied, and little empirical evidence exists to support its use in clinical teaching environments.
    OBJECTIVE: The aim of the study is to identify and qualitatively evaluate potential use cases and limitations of LLM technology for real-time ward-based educational contexts.
    METHODS: A brief, single-site exploratory evaluation of the publicly available ChatGPT-3.5 (OpenAI) was conducted by implementing the tool into the daily attending rounds of a general internal medicine inpatient service at a large urban academic medical center. ChatGPT was integrated into rounds via both structured and organic use, using the web-based \"chatbot\" style interface to interact with the LLM through conversational free-text and discrete queries. A qualitative approach using phenomenological inquiry was used to identify key insights related to the use of ChatGPT through analysis of ChatGPT conversation logs and associated shorthand notes from the clinical sessions.
    RESULTS: Identified use cases for ChatGPT integration included addressing medical knowledge gaps through discrete medical knowledge inquiries, building differential diagnoses and engaging dual-process thinking, challenging medical axioms, using cognitive aids to support acute care decision-making, and improving complex care management by facilitating conversations with subspecialties. Potential additional uses included engaging in difficult conversations with patients, exploring ethical challenges and general medical ethics teaching, personal continuing medical education resources, developing ward-based teaching tools, supporting and automating clinical documentation, and supporting productivity and task management. LLM biases, misinformation, ethics, and health equity were identified as areas of concern and potential limitations to clinical and training use. A code of conduct on ethical and appropriate use was also developed to guide team usage on the wards.
    CONCLUSIONS: Overall, ChatGPT offers a novel tool to enhance ward-based learning through rapid information querying, second-order content exploration, and engaged team discussion regarding generated responses. More research is needed to fully understand contexts for educational use, particularly regarding the risks and limitations of the tool in clinical settings and its impacts on trainee development.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:自动识别合格患者是临床研究的瓶颈。我们提出了Criteria2Query(C2Q)3.0,该系统利用GPT-4将临床试验资格标准文本半自动转换为可执行的临床数据库查询。
    方法:C2Q3.0集成了三个GPT-4提示进行概念提取,SQL查询生成,和推理。每个提示都是单独设计和评估的。概念提取提示以两名评估者来自20项临床试验的手动注释为基准,他后来还测量了SQL生成的准确性,并从5项临床试验中发现了GPT生成的SQL查询中的错误。推理提示由三名评估人员根据四个指标进行评估:可读性,正确性,连贯性,和有用性,使用更正的SQL查询和开放式反馈问卷。
    结果:在20项临床试验的518个概念中,GPT-4在概念提取中的F1评分为0.891。对于SQL生成,检测到七个类别的29个错误,逻辑错误是最常见的(n=10;34.48%)。推理评估产生了很高的一致性评级,平均得分为4.70,但可读性相对较低,均值为3.95。正确性和有用性的平均得分分别为3.97和4.37。
    结论:GPT-4显著提高了在C2Q3.0中提取临床试验合格标准概念的准确性。需要继续研究以确保大型语言模型的可靠性。
    OBJECTIVE: Automated identification of eligible patients is a bottleneck of clinical research. We propose Criteria2Query (C2Q) 3.0, a system that leverages GPT-4 for the semi-automatic transformation of clinical trial eligibility criteria text into executable clinical database queries.
    METHODS: C2Q 3.0 integrated three GPT-4 prompts for concept extraction, SQL query generation, and reasoning. Each prompt was designed and evaluated separately. The concept extraction prompt was benchmarked against manual annotations from 20 clinical trials by two evaluators, who later also measured SQL generation accuracy and identified errors in GPT-generated SQL queries from 5 clinical trials. The reasoning prompt was assessed by three evaluators on four metrics: readability, correctness, coherence, and usefulness, using corrected SQL queries and an open-ended feedback questionnaire.
    RESULTS: Out of 518 concepts from 20 clinical trials, GPT-4 achieved an F1-score of 0.891 in concept extraction. For SQL generation, 29 errors spanning seven categories were detected, with logic errors being the most common (n = 10; 34.48 %). Reasoning evaluations yielded a high coherence rating, with the mean score being 4.70 but relatively lower readability, with a mean of 3.95. Mean scores of correctness and usefulness were identified as 3.97 and 4.37, respectively.
    CONCLUSIONS: GPT-4 significantly improves the accuracy of extracting clinical trial eligibility criteria concepts in C2Q 3.0. Continued research is warranted to ensure the reliability of large language models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号