prompt engineering

及时工程
  • 文章类型: Journal Article
    Introduction Large language models such as OpenAI\'s (San Francisco, CA) ChatGPT-3.5 hold immense potential to augment self-directed learning in medicine, but concerns have risen regarding its accuracy in specialized fields. This study compares ChatGPT-3.5 with an internet search engine in their ability to define the Randleman criteria and its five parameters within a self-directed learning environment. Methods Twenty-three medical students gathered information on the Randleman criteria. Each student was allocated 10 minutes to interact with ChatGPT-3.5, followed by 10 minutes to search the internet independently. Each ChatGPT-3.5 conversation, student summary, and internet reference were subsequently analyzed for accuracy, efficiency, and reliability. Results ChatGPT-3.5 provided the correct definition for 26.1% of students (6/23, 95% CI: 12.3% to 46.8%), while an independent internet search resulted in sources containing the correct definition for 100% of students (23/23, 95% CI: 87.5% to 100%, p = 0.0001). ChatGPT-3.5 incorrectly identified the Randleman criteria as a corneal ectasia staging system for 17.4% of students (4/23), fabricated a \"Randleman syndrome\" for 4.3% of students (1/23), and gave no definition for 52.2% of students (12/23). When a definition was given (47.8%, 11/23), a median of two of the five correct parameters was provided along with a median of two additional falsified parameters. Conclusion Internet search engine outperformed ChatGPT-3.5 in providing accurate and reliable information on the Randleman criteria. ChatGPT-3.5 gave false information, required excessive prompting, and propagated misunderstandings. Learners should exercise discernment when using ChatGPT-3.5. Future initiatives should evaluate the implementation of prompt engineering and updated large-language models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:本研究旨在开发用于测试问题映射的即时工程程序,然后与人类教师映射相比,确定使用ChatGPT进行测试问题映射的有效性。
    方法:我们进行了一项横断面研究,以比较ChatGPT和人类作图,使用来自综合药物治疗学课程系列模块的139个测试问题样本。测试问题由三名教职员工映射到模块目标和2016年药学教育标准认证委员会(2016年标准),以创建“正确答案”。创建了及时的工程程序,以促进与ChatGPT的映射,并将ChatGPT作图结果与人类作图进行了比较。
    结果:ChatGPT根据68.0%的案例中的人类共识,将测试问题直接映射到“正确答案”。在另外20.1%的病例中,该程序与至少一个个体人类反应相匹配,总共88.1%与人类映射器一致。当人类完全同意映射决定时,ChatGPT更有可能正确映射。
    结论:本研究提供了一个实用的用例,该用例具有为大学评估或课程委员会量身定制的即时工程,以促进有效的测试问题和教育成果映射。
    OBJECTIVE: This study aimed to develop a prompt engineering procedure for test question mapping and then determine the effectiveness of test question mapping using ChatGPT compared to human faculty mapping.
    METHODS: We conducted a cross-sectional study to compare ChatGPT and human mapping using a sample of 139 test questions from modules within an integrated pharmacotherapeutics course series. The test questions were mapped by three faculty members to both module objectives and the Accreditation Council for Pharmacy Education Standards 2016 (Standards 2016) to create the \"correct answer\". Prompt engineering procedures were created to facilitate mapping with ChatGPT, and ChatGPT mapping results were compared with human mapping.
    RESULTS: ChatGPT mapped test questions directly to the \"correct answer\" based on human consensus in 68.0% of cases, and the program matched with at least one individual human response in another 20.1% of cases for a total of 88.1% agreement with human mappers. When humans fully agreed with the mapping decision, ChatGPT was more likely to map correctly.
    CONCLUSIONS: This study presents a practical use case with prompt engineering tailored for college assessment or curriculum committees to facilitate efficient test question and educational outcomes mapping.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:系统评价的筛选过程是资源密集型的。尽管以前的机器学习解决方案已经报告了工作量的减少,他们冒着排除相关文件的风险。
    目的:我们评估了使用GPT-3.5和GPT-4的3层筛选方法的性能,以简化系统评价的标题和摘要筛选过程。我们的目标是开发一种筛选方法,最大限度地提高识别相关记录的灵敏度。
    方法:我们对2篇关于双相情感障碍治疗的系统综述进行了筛查,第一次审查有1381条记录,第二次审查有3146条记录。筛选使用GPT-3.5(gpt-3.5-turbo-0125)和GPT-4(gpt-4-0125-preview)跨三层进行:(1)研究设计,(2)目标患者,(3)干预和控制。使用针对每个研究定制的提示进行3层筛选。在这个过程中,根据每个研究的纳入标准进行信息提取,并使用基于GPT-4的流程进行筛选优化,无需人工调整.记录在每一层进行评估,并且在所有层都符合纳入标准的人随后被判定为包括在内。
    结果:在每一层,GPT-3.5和GPT-4每分钟都能处理大约110条记录,筛选第一项和第二项研究所需的总时间约为1小时和2小时,分别。在第一项研究中,GPT-3.5和GPT-4的敏感性/特异性分别为0.900/0.709和0.806/0.996.通过GPT-3.5和GPT-4的筛查均判断了用于荟萃分析的所有6条记录。在第二项研究中,GPT-3.5和GPT-4的敏感性/特异性分别为0.958/0.116和0.875/0.855.相关记录的敏感性与人类评估者一致:第一项研究为0.867-1.000,第二项研究为0.776-0.979。通过GPT-3.5和GPT-4的筛查均判断了用于荟萃分析的所有9条记录。在考虑GPT-4合理排除的记录后,GPT-4筛查的敏感性/特异性在第一项研究中为0.962/0.996,在第二项研究中为0.943/0.855。进一步的调查表明,GPT-3.5错误排除的病例是由于缺乏领域知识,而GPT-4错误排除的病例是由于对纳入标准的误解.
    结论:我们使用GPT-4的3层筛查方法显示出可接受的敏感性和特异性水平,支持其在系统评价筛查中的实际应用。未来的研究应旨在推广这种方法,并探索其在不同环境中的有效性,医学和非医学,充分确立其使用和操作可行性。
    BACKGROUND: The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers.
    OBJECTIVE: We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records.
    METHODS: We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study\'s inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included.
    RESULTS: On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria.
    CONCLUSIONS: Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    自然语言处理的最新进展,计算语言学,和人工智能(AI)推动了大型语言模型(LLM)在自动论文评分(AES)中的使用,提供有效和公正的写作评估。这项研究评估了AES任务中LLM的可靠性,专注于评分一致性和与人类评估者的一致性。我们探讨了即时工程的影响,温度设置,和多级评级维度对LLM评分性能的影响。结果表明,及时工程显著影响LLM的可靠性,与GPT-3.5和Claude2相比,GPT-4显着改善,在标准和样本参考理由提示下,评分准确性分别提高了112%和114%。温度设置也会影响LLM的输出一致性,较低的温度会产生更符合人类评估的分数,这对于保持大规模评估的公平性至关重要。关于多维写作评估,结果表明,在精心制作的即时工程下,GPT-4在思想(QWK=0.551)和组织(QWK=0.584)方面表现良好。这些发现为全面探索LLM的更广泛的教育影响铺平了道路,提供对他们完善和可能改变写作指令的能力的见解,评估,以及在AI驱动的教育时代提供诊断和个性化反馈。虽然本研究重视LLM驱动的多维AES的可靠性和一致性,未来的研究应该扩大其范围,以涵盖不同的写作流派和来自不同背景的更广泛的样本。
    Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs\' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究旨在评估不同的提示工程技术,特别是直接提示,思想链(CoT)和改进的CoT方法,影响GPT-3.5回答临床和基于计算的医学问题的能力,特别是那些风格像USMLE步骤1考试。为了实现这一点,我们分析了GPT-3.5对两组不同问题的回答:一组由GPT-4生成的1000个问题,另一组包含95个真实的USMLE第1步问题.这些问题跨越了各个领域和难度级别的一系列医学计算和临床场景。我们的分析显示,当使用直接提示时,GPT-3.5响应的准确性没有显着差异,CoT,或改进的CoT方法。例如,在USMLE样本中,直接提示成功率为61.7%,CoT的62.8%,改良型CoT为57.4%,p值为0.734。在对GPT-4产生的问题的回答中观察到类似的趋势,临床和基于计算的,p值高于0.05表明提示类型之间没有显着差异。从这项研究中得出的结论是,使用CoT提示工程不会显着改变GPT-3.5处理类似于USMLE考试的医学计算或临床情景问题的有效性。这一发现至关重要,因为它表明无论是否使用CoT技术而不是直接提示,ChatGPT的性能都保持一致。这种一致性可能有助于简化将ChatGPT等AI工具集成到医学教育中,使医疗保健专业人员能够轻松使用这些工具,不需要复杂的及时工程。
    This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5\'s responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5\'s effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:将GPT-4Turbo等大型语言模型(LLM)集成到诊断成像中面临着重大挑战,目前误诊率在30-50%之间。这项研究评估了及时的工程和置信度阈值如何提高神经放射学的诊断准确性。
    方法:我们使用GPT-4Turbo分析了《美国神经放射学杂志》上的751例神经放射学病例,并带有定制提示以提高诊断精度。
    结果:最初,GPT-4Turbo的基线诊断准确率为55.1%。通过重新格式化响应以列出五个诊断候选项,并应用90%的置信度阈值,最高诊断精度提高到72.9%,候选名单提供了85.9%的正确诊断,将误诊率降低至14.1%。然而,这一门槛减少了回应的病例数量。
    结论:战略提示工程和高置信度阈值显着减少误诊并提高神经放射学LLM诊断的准确性。需要更多的研究来优化这些方法,以实现更广泛的临床实施,平衡的准确性和实用性。
    OBJECTIVE: Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology.
    METHODS: We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision.
    RESULTS: Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded.
    CONCLUSIONS: Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:ChatGPT和其他ChatBots已成为以类似于自然人类语音的方式与信息进行交互的工具。因此,这项技术被用于不同的学科,包括商业,教育,甚至在生物医学领域。需要更好地了解ChatGPT如何用于推进老年学研究。因此,我们评估了ChatGPT对老年学研究中特定主题问题的回答,并为其在该领域的使用提供了集思广益的建议。
    方法:我们进行了半结构化头脑风暴会议,以确定ChatGPT在老年学研究中的用途。我们将一组多学科研究人员分为四个主题组:a)老年临床科学,b)基本的老年学,C)与电子健康记录(EHR)相关的信息学,和d)gero技术。每个小组都在理论上提示ChatGPT-,methods-,以及基于解释的问题,并根据标准化量表对准确性和完整性进行评级。
    结果:ChatGPT反应被所有组评定为一般准确。然而,回答的完整性被评为较低,除了信息学组的成员,他认为回答很全面。
    结论:ChatGPT准确地描述了老年学研究中的一些主要概念。然而,研究人员在批判性地评估其反应的完整性方面发挥着重要作用。拥有像ChatGPT这样的单一通用资源可能有助于总结该领域的优势证据,以确定知识差距并促进跨学科合作。
    BACKGROUND: ChatGPT and other ChatBots have emerged as tools for interacting with information in manners resembling natural human speech. Consequently, the technology is used across various disciplines, including business, education, and even in biomedical sciences. There is a need to better understand how ChatGPT can be used to advance gerontology research. Therefore, we evaluated ChatGPT responses to questions on specific topics in gerontology research, and brainstormed recommendations for its use in the field.
    METHODS: We conducted semi-structured brainstorming sessions to identify uses of ChatGPT in gerontology research. We divided a team of multidisciplinary researchers into four topical groups: a) gero-clinical science, b) basic geroscience, c) informatics as it relates to electronic health records (EHR), and d) gero-technology. Each group prompted ChatGPT on a theory-, methods-, and interpretation-based question and rated responses for accuracy and completeness based on standardized scales.
    RESULTS: ChatGPT responses were rated by all groups as generally accurate. However, the completeness of responses was rated lower, except by members of the informatics group, who rated responses as highly comprehensive.
    CONCLUSIONS: ChatGPT accurately depicts some major concepts in gerontological research. However, researchers have an important role in critically appraising the completeness of its responses. Having a single generalized resource like ChatGPT may help summarize the preponderance of evidence in the field to identify gaps in knowledge and promote cross-disciplinary collaboration.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本研究探讨了大型语言模型(LLM)在诊断颌骨畸形中的应用,旨在通过利用LLM的高级功能来增强数据解释来克服各种诊断方法的局限性。目标是提供简化复杂数据分析的工具,并使临床从业人员更易于访问和直观的诊断过程。
    进行了一项涉及颌骨畸形患者的实验,其中头影测量(SNB角度,面部角度,下颌单位长度)转换为文本进行LLM分析。多个LLM,包括LLAMA-2变体,GPT模型,和双子座Pro模型,针对各种方法进行了评估(基于阈值,机器学习模型)使用平衡的准确性和F1分数。
    我们的研究表明,更大的LLM可以有效地适应诊断任务,以最少的训练示例显示快速的性能饱和,并减少模糊分类,这凸显了他们强大的上下文学习能力。将复杂的头颅测量转换成直观的文本格式,不仅拓宽了信息的可获取性,而且增强了可解释性,为临床医生提供清晰可行的见解。
    将LLM集成到颌骨畸形的诊断中,标志着在使诊断过程更容易获得和减少对专业培训的依赖方面取得了重大进展。这些模型作为有价值的辅助工具,提供明确的,可理解的产出,便于临床医生更容易地做出决策,特别是那些经验较少或在有限的环境中获得专业知识。未来的改进和调整包括更全面和医学特定的数据集,预计将提高LLM的精度和实用性,有可能改变医疗诊断的格局。
    UNASSIGNED: This study examines the application of Large Language Models (LLMs) in diagnosing jaw deformities, aiming to overcome the limitations of various diagnostic methods by harnessing the advanced capabilities of LLMs for enhanced data interpretation. The goal is to provide tools that simplify complex data analysis and make diagnostic processes more accessible and intuitive for clinical practitioners.
    UNASSIGNED: An experiment involving patients with jaw deformities was conducted, where cephalometric measurements (SNB Angle, Facial Angle, Mandibular Unit Length) were converted into text for LLM analysis. Multiple LLMs, including LLAMA-2 variants, GPT models, and the Gemini-Pro model, were evaluated against various methods (Threshold-based, Machine Learning Models) using balanced accuracy and F1-score.
    UNASSIGNED: Our research demonstrates that larger LLMs efficiently adapt to diagnostic tasks, showing rapid performance saturation with minimal training examples and reducing ambiguous classification, which highlights their robust in-context learning abilities. The conversion of complex cephalometric measurements into intuitive text formats not only broadens the accessibility of the information but also enhances the interpretability, providing clinicians with clear and actionable insights.
    UNASSIGNED: Integrating LLMs into the diagnosis of jaw deformities marks a significant advancement in making diagnostic processes more accessible and reducing reliance on specialized training. These models serve as valuable auxiliary tools, offering clear, understandable outputs that facilitate easier decision-making for clinicians, particularly those with less experience or in settings with limited access to specialized expertise. Future refinements and adaptations to include more comprehensive and medically specific datasets are expected to enhance the precision and utility of LLMs, potentially transforming the landscape of medical diagnostics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:了解健康结果的多面性需要对社会,经济,以及塑造个人福祉的环境决定因素。在这些决定因素中,行为因素起着至关重要的作用,特别是精神活性物质的消费模式,这对公众健康有重要影响。全球疾病负担研究显示,由于物质使用,残疾调整寿命年的影响越来越大。成功识别患者的物质使用信息使临床护理团队能够更有效地解决与物质相关的问题,实现有针对性的支持并最终改善患者预后。
    目的:传统的自然语言处理(NLP)方法在准确解析与药物使用相关的多种临床语言方面存在局限性。大型语言模型(LLM)通过适应不同的语言模式来克服这些挑战。本研究调查了生成性预训练变压器(GPT)模型的应用,在特定的GPT-3.5-用于提取烟草,酒精,在零射和少射学习环境中,来自患者出院总结的物质使用信息。这项研究通过展示高级语言模型在提取对增强患者护理至关重要的细微差别信息方面的潜力,为医疗保健信息学的发展做出了贡献。
    方法:本文分析的主要数据源是重症监护医学信息集市III(MIMIC-III)数据集。在这个数据集中的所有笔记中,我们专注于出院总结。进行了及时的工程,涉及对不同提示的迭代探索。利用精心策划的例子和精致的提示,我们通过零射和少射提示策略来考察模型的熟练程度。
    结果:所呈现的结果突出了GPT在提取提及烟草的文本跨度方面的对比性能,酒精,以及在零射和少射学习场景中的物质使用。在零射设置中,烟草提取的准确性,酒精,物质使用信息非常高。然而,在少数镜头设置中,精度大大降低。相反,与零射学习相比,少射学习导致物质使用状态的设计显著增加,召回率和F1得分显著增加.然而,这种改进是以降低精度为代价的,不仅是提到使用的文本跨度,而且是使用的状态。
    结论:在精确提取提及物质使用的文本跨度方面的零射学习的卓越表现证明了其在全面回忆很重要的情况下的有效性。相反,当准确确定物质使用状态是主要焦点时,少量学习具有优势,即使它涉及精度的权衡。结果有助于加强早期发现和干预策略,更精确地定制治疗计划,最终,有助于全面了解患者的健康状况。通过将这些人工智能驱动的方法集成到电子健康记录系统中,临床医生可以立即获益,对物质使用的全面见解,从而形成不仅及时而且更加个性化和有效的干预措施。
    背景:
    BACKGROUND: Understanding the multifaceted nature of health outcomes requires a comprehensive examination of the social, economic, and environmental determinants that shape individual well-being. Among these determinants, behavioral factors play a crucial role, particularly the consumption patterns of psychoactive substances, which have important implications on public health. The Global Burden of Disease Study shows a growing impact in disability-adjusted life years due to substance use. The successful identification of patients\' substance use information equips clinical care teams to address substance-related issues more effectively, enabling targeted support and ultimately improving patient outcomes.
    OBJECTIVE: Traditional natural language processing methods face limitations in accurately parsing diverse clinical language associated with substance use. Large language models offer promise in overcoming these challenges by adapting to diverse language patterns. This study investigates the application of the generative pretrained transformer (GPT) model in specific GPT-3.5 for extracting tobacco, alcohol, and substance use information from patient discharge summaries in zero-shot and few-shot learning settings. This study contributes to the evolving landscape of health care informatics by showcasing the potential of advanced language models in extracting nuanced information critical for enhancing patient care.
    METHODS: The main data source for analysis in this paper is Medical Information Mart for Intensive Care III data set. Among all notes in this data set, we focused on discharge summaries. Prompt engineering was undertaken, involving an iterative exploration of diverse prompts. Leveraging carefully curated examples and refined prompts, we investigate the model\'s proficiency through zero-shot as well as few-shot prompting strategies.
    RESULTS: Results show GPT\'s varying effectiveness in identifying mentions of tobacco, alcohol, and substance use across learning scenarios. Zero-shot learning showed high accuracy in identifying substance use, whereas few-shot learning reduced accuracy but improved in identifying substance use status, enhancing recall and F1-score at the expense of lower precision.
    CONCLUSIONS: Excellence of zero-shot learning in precisely extracting text span mentioning substance use demonstrates its effectiveness in situations in which comprehensive recall is important. Conversely, few-shot learning offers advantages when accurately determining the status of substance use is the primary focus, even if it involves a trade-off in precision. The results contribute to enhancement of early detection and intervention strategies, tailor treatment plans with greater precision, and ultimately, contribute to a holistic understanding of patient health profiles. By integrating these artificial intelligence-driven methods into electronic health record systems, clinicians can gain immediate, comprehensive insights into substance use that results in shaping interventions that are not only timely but also more personalized and effective.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:大型语言模型(LLM)已被提出作为解决大量患者医疗咨询请求(PMAR)的解决方案。这项研究探讨了LLM是否可以对PMAR产生高质量的草稿响应,以迅速的工程满足患者和临床医生。
    方法:我们设计了一种新颖的涉及人类的迭代过程,以训练和验证对LLM的提示,以创建对PMAR的适当响应。GPT-4用于生成对消息的响应。我们更新了提示,并在每次迭代中评估临床医生和患者对LLM生成的草稿响应的接受程度,并在独立的验证数据集上测试了优化的提示。优化的提示在电子健康记录生产环境中实施,并由69名初级保健临床医生进行了测试。
    结果:经过3次提示工程迭代,在验证数据集(N=200)中,医生对草案适用性的接受度从62%增加到84%(P<.001),测试数据集中74%的草稿被评为“有帮助”。“患者还注意到,与训练数据集中的原始提示相比,优化提示的消息音调(78%)和整体质量(80%)显着提高。对于76%的信息,患者无法区分人类和LLM生成的PMAR反应草案,与早期偏爱人类产生的反应相反。大多数(72%)的临床医生认为它可以减少处理InBasket消息的认知负荷。
    结论:由临床医生和患者反馈协同告知,仅在LLM提示中进行调整可以有效地创建对PMAR的临床相关和有用的草稿响应。
    OBJECTIVE: Large Language Models (LLMs) have been proposed as a solution to address high volumes of Patient Medical Advice Requests (PMARs). This study addresses whether LLMs can generate high quality draft responses to PMARs that satisfies both patients and clinicians with prompt engineering.
    METHODS: We designed a novel human-involved iterative processes to train and validate prompts to LLM in creating appropriate responses to PMARs. GPT-4 was used to generate response to the messages. We updated the prompts, and evaluated both clinician and patient acceptance of LLM-generated draft responses at each iteration, and tested the optimized prompt on independent validation data sets. The optimized prompt was implemented in the electronic health record production environment and tested by 69 primary care clinicians.
    RESULTS: After 3 iterations of prompt engineering, physician acceptance of draft suitability increased from 62% to 84% (P <.001) in the validation dataset (N = 200), and 74% of drafts in the test dataset were rated as \"helpful.\" Patients also noted significantly increased favorability of message tone (78%) and overall quality (80%) for the optimized prompt compared to the original prompt in the training dataset, patients were unable to differentiate human and LLM-generated draft PMAR responses for 76% of the messages, in contrast to the earlier preference for human-generated responses. Majority (72%) of clinicians believed it can reduce cognitive load in dealing with InBasket messages.
    CONCLUSIONS: Informed by clinician and patient feedback synergistically, tuning in LLM prompt alone can be effective in creating clinically relevant and useful draft responses to PMARs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号