GPT

GPT
  • 文章类型: Journal Article
    目的:评估GPT-3.5,GPT-4和微调的GPT-3.5模型在将FleischnerSociety建议应用于肺结节的准确性。方法:我们根据Fleischner协会指南,为12个结节类别中的每一个生成了10个肺结节描述,将它们纳入一个虚构的报告(n=120)。GPT-3.5和GPT-4被提示根据报告提出后续建议。然后,我们将完整的指南纳入提示中并重新提交。最后,我们将提示重新提交给经过微调的GPT-3.5模型。结果使用R中的二元准确性分析进行分析。结果:GPT-3.5应用Fleischner协会指南的准确性为0.058(95%CI:0.02,0.12)。GPT-4的准确度提高到0.15(95%CI:0.09,0.23;准确度比较P=.02)。在推荐PET-CT和/或活检时,GPT-3.5和GPT-4的F评分均为0.00.在提示中明确包括Fleischner协会准则之后,GPT-3.5和GPT-4的准确性显著提高到0.42(95%CI:0.33,0.51;P<.001)和0.66(95%CI:0.57,0.74;P<.001),分别。GPT-4仍显著优于GPT-3.5(P<.001)。微调的GPT-3.5模型精度为0.46(95%CI:0.37,0.55),与包含指南的GPT-3.5模型没有不同(P=.53)。结论:GPT-3.5和GPT-4在应用广为人知的指南方面表现不佳,从未正确推荐活检。错误的知识和推理都导致了他们的糟糕表现。虽然GPT-4比GPT-3.5更准确,但其不准确率在临床实践中是不可接受的。这些结果强调了大型语言模型对于基于知识和推理的任务的局限性。
    Purpose: To evaluate the accuracy of GPT-3.5, GPT-4, and a fine-tuned GPT-3.5 model in applying Fleischner Society recommendations to lung nodules. Methods: We generated 10 lung nodule descriptions for each of the 12 nodule categories from the Fleischner Society guidelines, incorporating them into a single fictitious report (n = 120). GPT-3.5 and GPT-4 were prompted to make follow-up recommendations based on the reports. We then incorporated the full guidelines into the prompts and re-submitted them. Finally, we re-submitted the prompts to a fine-tuned GPT-3.5 model. Results were analyzed using binary accuracy analysis in R. Results: GPT-3.5 accuracy in applying Fleischner Society guidelines was 0.058 (95% CI: 0.02, 0.12). GPT-4 accuracy was improved at 0.15 (95% CI: 0.09, 0.23; P = .02 for accuracy comparison). In recommending PET-CT and/or biopsy, both GPT-3.5 and GPT-4 had an F-score of 0.00. After explicitly including the Fleischner Society guidelines in the prompt, GPT-3.5 and GPT-4 significantly improved their accuracy to 0.42 (95% CI: 0.33, 0.51; P < .001) and to 0.66 (95% CI: 0.57, 0.74; P < .001), respectively. GPT-4 remained significantly better than GPT-3.5 (P < .001). The fine-tuned GPT-3.5 model accuracy was 0.46 (95% CI: 0.37, 0.55), not different from the GPT-3.5 model with guidelines included (P = .53). Conclusion: GPT-3.5 and GPT-4 performed poorly in applying widely known guidelines and never correctly recommended biopsy. Flawed knowledge and reasoning both contributed to their poor performance. While GPT-4 was more accurate than GPT-3.5, its inaccuracy rate was unacceptable for clinical practice. These results underscore the limitations of large language models for knowledge and reasoning-based tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:结肠镜检查通常用于结直肠癌的筛查和监测。多个不同的指南对结肠镜检查之间的间隔提供了建议。这对于非专业医疗保健提供者来说可能是具有挑战性的。像ChatGPT这样的大型语言模型是解析患者病史和提供建议的潜在工具。然而,标准GPT模型不是为医疗用途而设计的,可以产生幻觉。克服这些挑战的一种方法是提供具有医学指南的上下文信息,以帮助模型准确响应查询。我们的研究将标准GPT4与提供相关筛查指南的上下文模型进行了比较。我们评估了这些模型是否可以为结肠镜检查的筛查和监测间隔提供正确的建议。
    方法:关于结直肠癌筛查和监测的相关指南被制定为GPT知识库。我们在标准GPT4和带有知识库的上下文模型上测试了62个示例案例场景(每个场景三次)。
    结果:上下文化GPT4模型在所有领域都优于标准GPT4。没有遗漏高风险特征,只有2例出现了其他高危特征的幻觉.在大多数情况下,提供了正确的结肠镜检查间隔。在几乎所有情况下都适当引用了准则。
    结论:情境化GPT4模型可以识别高风险特征,并引用适当的指南,而不会出现明显的幻觉。在大多数情况下,它为下一次结肠镜检查提供了正确的间隔。这提供了概念证明,经过适当改进的ChatGPT可以充当准确的医师助理。
    OBJECTIVE: Colonoscopy is commonly used in screening and surveillance for colorectal cancer. Multiple different guidelines provide recommendations on the interval between colonoscopies. This can be challenging for non-specialist healthcare providers to navigate. Large language models like ChatGPT are a potential tool for parsing patient histories and providing advice. However, the standard GPT model is not designed for medical use and can hallucinate. One way to overcome these challenges is to provide contextual information with medical guidelines to help the model respond accurately to queries. Our study compares the standard GPT4 against a contextualized model provided with relevant screening guidelines. We evaluated whether the models could provide correct advice for screening and surveillance intervals for colonoscopy.
    METHODS: Relevant guidelines pertaining to colorectal cancer screening and surveillance were formulated into a knowledge base for GPT. We tested 62 example case scenarios (three times each) on standard GPT4 and on a contextualized model with the knowledge base.
    RESULTS: The contextualized GPT4 model outperformed the standard GPT4 in all domains. No high-risk features were missed, and only two cases had hallucination of additional high-risk features. A correct interval to colonoscopy was provided in the majority of cases. Guidelines were appropriately cited in almost all cases.
    CONCLUSIONS: A contextualized GPT4 model could identify high-risk features and quote appropriate guidelines without significant hallucination. It gave a correct interval to the next colonoscopy in the majority of cases. This provides proof of concept that ChatGPT with appropriate refinement can serve as an accurate physician assistant.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号