GPT

GPT
  • 文章类型: Journal Article
    为了评估响应能力,在公共医疗系统耳鼻喉科工作竞争考试中,ChatGPT3.5和互联网连接的GPT-4引擎(MicrosoftCopilot),以耳鼻喉科专家的真实分数为对照组。2023年9月,将135个分为理论和实践部分的问题输入到ChatGPT3.5和连接互联网的GPT-4中。将AI反应的准确性与参加考试的耳鼻喉科医生的官方结果进行了比较,采用Stata14.2进行统计分析。副驾驶(GPT-4)的表现优于ChatGPT3.5。副驾驶取得88.5分的成绩,而ChatGPT得了60分。两个AI的错误答案都存在差异。尽管ChatGPT很熟练,Copilot表现出卓越的性能,在参加考试的108名耳鼻喉科医生中排名第二,而ChatGPT排在第83位。与ChatGPT3.5相比,由具有互联网访问功能的GPT-4(Copilot)提供的聊天在回答多项选择的医疗问题方面表现出卓越的性能。
    To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT\'s proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在美国,五分之一的成年人目前是患有严重疾病或残疾的个人的家庭照顾者。与专业护理人员不同,家庭照顾者通常在没有正式准备或培训的情况下承担这一角色。因此,迫切需要提高家庭护理人员提供优质护理的能力。利用技术作为教育工具或辅助护理是一种有前途的方法,有可能提高家庭护理人员的学习和护理能力。大型语言模型(LLM)可以用作支持护理人员的基础技术。LLM可以归类为基础模型(FM),它是在广泛的数据集上训练的大规模模型,可以适应一系列不同的领域任务。尽管有潜力,FM有“幻觉”的关键弱点,“模型产生的信息可能具有误导性或不准确。当语言模型被部署为护理人员的一线帮助工具时,信息可靠性至关重要。
    目的:本研究旨在(1)通过使用FM和护理知识库来开发可靠的护理语言模型(CaLM),(2)使用需要更少的计算资源的小型FM开发可访问的CaLM,(3)与大型调频相比,评估模型的性能。
    方法:我们使用检索增强生成(RAG)框架结合FM微调开发了一种CaLM,通过将模型基于护理知识库来提高FM答案的质量。CaLM的关键组成部分是护理知识库,微调调频,和一个回收模块。我们使用2个小型FM作为CaLM(LLaMA[大型语言模型MetaAI]2和Falcon,具有70亿个参数)的基础,并采用了大型FM(GPT-3.5,估计有1750亿个参数)作为基准。我们通过从互联网上收集各种类型的文档来开发护理知识库。我们专注于阿尔茨海默病和相关痴呆症患者的护理人员。我们使用通常用于评估语言模型的基准指标及其可靠性来评估模型的性能,以提供准确的答案参考。
    结果:RAG框架提高了本研究中使用的所有FM在所有措施中的性能。不出所料,在所有指标上,大型FM的表现都优于小型FM。有趣的是,在所有指标中,使用RAG的小型微调FM的表现明显优于GPT3.5。具有小FM的微调LLaMA2在返回带有答案的参考方面比GPT3.5(即使使用RAG)表现更好。
    结论:研究表明,可以使用具有特定于护理领域的知识库的小型FM开发可靠且可访问的CaLM。
    BACKGROUND: In the United States, 1 in 5 adults currently serves as a family caregiver for an individual with a serious illness or disability. Unlike professional caregivers, family caregivers often assume this role without formal preparation or training. Thus, there is an urgent need to enhance the capacity of family caregivers to provide quality care. Leveraging technology as an educational tool or an adjunct to care is a promising approach that has the potential to enhance the learning and caregiving capabilities of family caregivers. Large language models (LLMs) can potentially be used as a foundation technology for supporting caregivers. An LLM can be categorized as a foundation model (FM), which is a large-scale model trained on a broad data set that can be adapted to a range of different domain tasks. Despite their potential, FMs have the critical weakness of \"hallucination,\" where the models generate information that can be misleading or inaccurate. Information reliability is essential when language models are deployed as front-line help tools for caregivers.
    OBJECTIVE: This study aimed to (1) develop a reliable caregiving language model (CaLM) by using FMs and a caregiving knowledge base, (2) develop an accessible CaLM using a small FM that requires fewer computing resources, and (3) evaluate the model\'s performance compared with a large FM.
    METHODS: We developed a CaLM using the retrieval augmented generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers by grounding the model on a caregiving knowledge base. The key components of the CaLM are the caregiving knowledge base, a fine-tuned FM, and a retriever module. We used 2 small FMs as candidates for the foundation of the CaLM (LLaMA [large language model Meta AI] 2 and Falcon with 7 billion parameters) and adopted a large FM (GPT-3.5 with an estimated 175 billion parameters) as a benchmark. We developed the caregiving knowledge base by gathering various types of documents from the internet. We focused on caregivers of individuals with Alzheimer disease and related dementias. We evaluated the models\' performances using the benchmark metrics commonly used in evaluating language models and their reliability for providing accurate references with their answers.
    RESULTS: The RAG framework improved the performance of all FMs used in this study across all measures. As expected, the large FM performed better than the small FMs across all metrics. Interestingly, the small fine-tuned FMs with RAG performed significantly better than GPT 3.5 across all metrics. The fine-tuned LLaMA 2 with a small FM performed better than GPT 3.5 (even with RAG) in returning references with the answers.
    CONCLUSIONS: The study shows that a reliable and accessible CaLM can be developed using small FMs with a knowledge base specific to the caregiving domain.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    牛皮癣是一种免疫介导的皮肤病,影响全球约3%的人口。这种情况的正确管理需要评估体表面积(BSA)以及指甲和关节的参与。最近,自然语言处理(NLP)与电子医疗记录(EMR)的集成在推进疾病分类和研究方面显示出了希望。这项研究评估了商业AI平台ChatGPT-4的性能,在分析银屑病患者的非结构化EMR数据时,特别是在识别受影响的身体区域。
    Psoriasis is an immune-mediated skin disease affecting approximately 3% of the global population. Proper management of this condition necessitates the assessment of the Body Surface Area (BSA) and the involvement of nails and joints. Recently, the integration of Natural Language Processing (NLP) with Electronic Medical Records (EMRs) has shown promise in advancing disease classification and research. This study evaluates the performance of ChatGPT-4, a commercial AI platform, in analyzing unstructured EMR data of psoriasis patients, particularly in identifying affected body areas.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:人工智能(AI)的集成,特别是深度学习模型,改变了医疗技术的格局,特别是在使用成像和生理数据的诊断领域。在耳鼻喉科,AI在中耳疾病的图像分类中显示出希望。然而,现有的模型通常缺乏患者特定的数据和临床背景,限制其普遍适用性。GPT-4Vision(GPT-4V)的出现使得多模态诊断方法成为可能,将语言处理与图像分析相结合。
    目的:在本研究中,我们通过整合患者特异性数据和耳镜下鼓膜图像,研究了GPT-4V在诊断中耳疾病中的有效性.
    方法:本研究的设计分为两个阶段:(1)建立具有适当提示的模型和(2)验证最佳提示模型对图像进行分类的能力。总的来说,305个中耳疾病的耳镜图像(急性中耳炎,中耳胆脂瘤,慢性中耳炎,和渗出性中耳炎)来自2010年4月至2023年12月期间访问新州大学或济池医科大学的患者。使用提示和患者数据建立优化的GPT-4V设置,并使用最佳提示创建的模型来验证GPT-4V在190张图像上的诊断准确性。为了比较GPT-4V与医生的诊断准确性,30名临床医生完成了由190张图像组成的基于网络的问卷。
    结果:多模态人工智能方法实现了82.1%的准确率,优于认证儿科医生的70.6%,但落后于耳鼻喉科医生的95%以上。该模型对急性中耳炎的疾病特异性准确率为89.2%,76.5%为慢性中耳炎,79.3%为中耳胆脂瘤,渗出性中耳炎占85.7%,这突出了对疾病特异性优化的需求。与医生的比较显示了有希望的结果,提示GPT-4V增强临床决策的潜力。
    结论:尽管有其优势,必须解决数据隐私和道德考虑等挑战。总的来说,这项研究强调了多模式AI在提高诊断准确性和改善耳鼻喉科患者护理方面的潜力.需要进一步的研究以在不同的临床环境中优化和验证这种方法。
    The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis.
    In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane.
    The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients\' data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images.
    The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model\'s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making.
    Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这项横断面研究中,我们评估了完整性,可读性,GPT-4响应4种提示产生的心血管疾病预防信息的句法复杂性。
    In this cross-sectional study, we evaluated the completeness, readability, and syntactic complexity of cardiovascular disease prevention information produced by GPT-4 in response to 4 kinds of prompts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)在自然语言处理(NLP)中显示出非凡的能力,特别是在标记数据稀缺或昂贵的领域,例如临床领域。然而,为了解开隐藏在这些LLM中的临床知识,我们需要设计有效的提示,引导他们在没有任何任务特定训练数据的情况下执行特定的临床NLP任务.这被称为上下文学习,这是一门艺术和科学,需要了解不同LLM的优势和劣势,并迅速采用工程方法。
    目的:本研究的目的是评估各种即时工程技术的有效性,包括2个新引入的类型-启发式和合奏提示,使用预训练的语言模型进行零射和少射临床信息提取。
    方法:这项全面的实验研究评估了不同的提示类型(简单的前缀,简单的完形填空,思想链,预期,启发式,和合奏)跨越5个临床NLP任务:临床意义消歧,生物医学证据提取,共同参照决议,药物状态提取,和药物属性提取。使用3种最先进的语言模型评估了这些提示的性能:GPT-3.5(OpenAI),双子座(谷歌),和LLaMA-2(Meta)。该研究将零射与少射提示进行了对比,并探讨了合奏方法的有效性。
    结果:研究表明,针对特定任务的提示定制对于LLM在零射临床NLP中的高性能至关重要。在临床意义上的消歧,GPT-3.5在启发式提示下达到0.96的准确性,在生物医学证据提取中达到0.94的准确性。启发式提示,伴随着一连串的思想提示,跨任务非常有效。在复杂的场景中,很少有机会提示提高性能,和集合方法利用了多种即时优势。GPT-3.5在任务和提示类型上的表现始终优于Gemini和LLaMA-2。
    结论:本研究对即时工程方法进行了严格的评估,并介绍了临床信息提取的创新技术,证明了临床领域上下文学习的潜力。这些发现为未来基于提示的临床NLP研究提供了明确的指导方针。促进非NLP专家参与临床NLP进步。据我们所知,这是在这个生成人工智能时代,对临床NLP的不同提示工程方法进行实证评估的首批作品之一,我们希望它能激励和指导未来在这一领域的研究。
    BACKGROUND: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches.
    OBJECTIVE: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models.
    METHODS: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches.
    RESULTS: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types.
    CONCLUSIONS: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工智能模型可以从医学文献和临床病例中学习,并产生与人类专家相媲美的答案。然而,在分析包含图像和图表的复杂数据方面仍然存在挑战。
    目的:本研究旨在评估ChatGPT-4Vision(GPT-4V)对一组100个问题的回答能力和准确性,包括基于图像的问题,来自2023年耳鼻喉科委员会认证考试。
    方法:回答来自2023年耳鼻喉科委员会认证考试的100个问题,包括基于图像的问题,使用GPT-4V产生。使用不同的提示评估准确率,和图像的存在,临床领域的问题,并检查了答案内容的变化。
    结果:纯文本输入的准确率为,平均而言,24.7%,但增加了英语翻译和提示,提高到47.3%(P<.001)。纯文本输入的平均无响应率为46.3%;加上英文翻译和提示(P<.001),这一比例降至2.7%。在所有类型的输入中,基于图像的问题的准确率低于纯文本问题。相对较高的无反应率。头颈部过敏和鼻腔过敏领域的一般问题和问题具有相对较高的准确率,随着翻译和提示的增加而增加。在内容方面,与解剖学相关的问题准确率最高。对于所有内容类型,翻译和提示的增加提高了准确率。至于基于图像的问题的性能,纯文本输入的平均正确回答率为30.4%,输入文本加图像的比例为41.3%(P=.02)。
    结论:对耳鼻喉科委员会认证考试的人工智能答题能力的检查提高了我们对其在该领域的潜力和局限性的理解。尽管随着翻译和提示的增加而注意到了改进,基于图像的问题的准确率低于基于文本的问题,这表明GPT-4V在这一阶段还有改进的空间。此外,文本加图像输入在基于图像的问题中回答更高的比率。我们的发现暗示了GPT-4V在医学中的有用性和潜力;然而,未来需要考虑安全使用方法。
    BACKGROUND: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams.
    OBJECTIVE: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination.
    METHODS: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined.
    RESULTS: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02).
    CONCLUSIONS: Examination of artificial intelligence\'s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    当前的精神卫生保健模式侧重于临床恢复和症状缓解。该模型的疗效受治疗师对患者恢复潜力和治疗关系深度的信任影响。精神分裂症是一种具有严重症状的慢性疾病,康复的可能性是一个有争议的问题。随着人工智能(AI)融入医疗保健领域,重要的是检查其评估精神分裂症等主要精神疾病恢复潜力的能力。
    本研究旨在评估大型语言模型(LLM)与心理健康专业人员相比的能力,以评估有或没有专业治疗的精神分裂症的预后以及长期的积极和消极结果。
    Vignettes被输入到LLM界面中,并由4个AI平台进行了10次评估:ChatGPT-3.5,ChatGPT-4,GoogleBard,还有克劳德.共收集了80项评估,并对照现有规范进行了基准评估,以分析哪些精神卫生专业人员(全科医生、精神病医生,临床心理学家,和心理健康护士)和公众思考有或没有专业治疗的精神分裂症预后以及精神分裂症干预措施的积极和消极长期结果。
    对于专业治疗精神分裂症的预后,ChatGPT-3.5非常悲观,而ChatGPT-4,克劳德,和巴德与专业观点一致,但与普通公众不同。所有LLM都认为,未经专业治疗的精神分裂症将保持静止或恶化。对于长期结果,ChatGPT-4和Claude预测的负面结果比Bard和ChatGPT-3.5更多。为了取得积极成果,ChatGPT-3.5和Claude比Bard和ChatGPT-4更悲观。
    在考虑“治疗”条件时,发现4个LLM中有3个与心理健康专业人员的预测密切相关,这证明了该技术在提供专业临床预后方面的潜力。ChatGPT-3.5的悲观评估是一个令人不安的发现,因为它可能会降低患者开始或坚持精神分裂症治疗的动机。总的来说,尽管法学硕士在加强医疗保健方面有希望,它们的应用需要严格的验证和与人类专业知识的和谐融合。
    UNASSIGNED: The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model\'s efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia.
    UNASSIGNED: This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes.
    UNASSIGNED: Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions.
    UNASSIGNED: For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4.
    UNASSIGNED: The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the \"with treatment\" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:沟通是医疗专业人员的核心能力,对患者安全至关重要。虽然医学课程强调沟通训练,传统格式,例如真实或模拟的患者互动,可以表现出心理压力,并且在重复中受到限制。最近出现的大型语言模型(LLM),如生成式预训练变压器(GPT),提供了克服这些限制的机会。
    目的:本研究的目的是探索GPT驱动的聊天机器人实践历史获取的可行性,沟通的核心能力之一。
    方法:我们开发了一个交互式聊天机器人界面,使用GPT-3.5和一个特定的提示,包括聊天机器人优化的疾病脚本和行为组件。遵循混合方法方法,我们邀请医学生自愿练习历史。为了确定GPT是否作为模拟患者提供合适的答案,使用定量和定性方法记录和分析对话.我们分析了问题和答案与提供的脚本一致的程度,以及医学合理性的答案。最后,学生填写了Chatbot可用性问卷(CUQ)。
    结果:共有28名学生与我们的聊天机器人一起练习(平均年龄23.4,SD2.9岁)。我们总共记录了826个问答对(QAP),每次对话的QAP中位数为27.5,与历史记录有关的QAP中位数为94.7%(n=782)。当脚本明确涵盖问题时(n=502,60.3%),GPT提供的答案主要基于明确的脚本信息(n=471,94.4%).对于脚本未涵盖的问题(n=195,23.4%),GPT答案使用了56.4%(n=110)的虚构信息。关于合理性,860个QAP中有842个(97.9%)被评为合理。在14个(2.1%)令人难以置信的答案中,GPT提供了被评为社会理想的答案,离开角色身份,忽略脚本信息,不合逻辑的推理,和计算错误。尽管有这些结果,CUQ显示出整体积极的用户体验(77/100分)。
    结论:我们的数据显示,如GPT,可以提供模拟的患者体验,并产生良好的用户体验和大多数合理的答案。我们的分析显示,GPT提供的答案使用明确的脚本信息或基于可用信息,可以理解为归纳推理。虽然罕见,基于GPT的聊天机器人在某些情况下提供了不合理的信息,主要趋势是社会期望的,而不是医学上合理的信息。
    BACKGROUND: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions.
    OBJECTIVE: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication.
    METHODS: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ).
    RESULTS: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points).
    CONCLUSIONS: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:临床研究论文的系统回顾是一个劳动密集型且耗时的过程,通常涉及对数千个标题和摘要的筛选。此过程的准确性和效率对于审查和后续医疗保健决策的质量至关重要。传统方法严重依赖人类审稿人,通常需要大量的时间和资源投入。
    目的:本研究旨在评估OpenAI生成预训练变压器(GPT)和GPT-4应用程序编程接口(API)在准确有效地从现实世界中识别相关标题和摘要方面的性能。临床评论数据集,并将其性能与2位独立人类审阅者的真实标签进行比较。
    方法:我们介绍了一种新颖的工作流程,使用ChatGPT和GPT-4API在临床综述中筛选标题和摘要。创建了一个Python脚本,以调用具有自然语言筛选标准的API,以及由至少2名人类审阅者过滤的标题和抽象数据集的语料库。我们将我们的模型与6篇综述论文中的人类综述论文的性能进行了比较,筛选超过24,000个标题和摘要。
    结果:我们的结果显示准确度为0.91,宏观F1评分为0.60,排除论文的敏感性为0.91,纳入论文的敏感性为0.76。2个独立的人类筛查者之间的评分者间差异为κ=0.46,而我们提出的方法与基于共识的人类决策之间的患病率和偏倚调整的κ为κ=0.96。在随机选择的论文子集上,GPT模型证明了能够为其决策提供推理的能力,并在被要求解释错误分类的推理时纠正了其最初的决策。
    结论:大型语言模型有可能简化临床审查过程,为研究人员节省宝贵的时间和精力,并有助于提高临床评价的整体质量。通过优先考虑工作流程,并作为研究人员和审稿人的辅助而不是替代,GPT-4等模型可以提高效率,并在医学研究中得出更准确可靠的结论。
    BACKGROUND: The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources.
    OBJECTIVE: This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers.
    METHODS: We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts.
    RESULTS: Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications.
    CONCLUSIONS: Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号