GPT

GPT
  • 文章类型: Journal Article
    为了评估响应能力,在公共医疗系统耳鼻喉科工作竞争考试中,ChatGPT3.5和互联网连接的GPT-4引擎(MicrosoftCopilot),以耳鼻喉科专家的真实分数为对照组。2023年9月,将135个分为理论和实践部分的问题输入到ChatGPT3.5和连接互联网的GPT-4中。将AI反应的准确性与参加考试的耳鼻喉科医生的官方结果进行了比较,采用Stata14.2进行统计分析。副驾驶(GPT-4)的表现优于ChatGPT3.5。副驾驶取得88.5分的成绩,而ChatGPT得了60分。两个AI的错误答案都存在差异。尽管ChatGPT很熟练,Copilot表现出卓越的性能,在参加考试的108名耳鼻喉科医生中排名第二,而ChatGPT排在第83位。与ChatGPT3.5相比,由具有互联网访问功能的GPT-4(Copilot)提供的聊天在回答多项选择的医疗问题方面表现出卓越的性能。
    To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT\'s proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    人口统计,健康的社会决定因素,越来越多地研究电子健康记录中的非结构化文本中记录的家族史,以了解如何将这些信息与结构化数据一起使用以改善医疗保健结果。GPT模型发布后,许多研究已经应用GPT模型从叙述性临床笔记中提取这些信息。不同于现有的工作,我们的研究重点是通过向GPT模型提供最少的信息来研究在一起提取这些信息时的零镜头学习.我们利用针对人口统计注释的去识别的真实世界临床笔记,各种社会决定因素,和家族史信息。鉴于GPT模型可能提供与原始数据中的文本不同的文本,我们探索了两组评估指标,包括传统的NER评价指标和语义相似度评价指标,完全理解表演。我们的结果表明,GPT-3.5方法在人口统计学提取上平均达到0.975F1,关于社会决定因素提取的0.615F1,家族史提取0.722F1。我们相信这些结果可以通过模型微调或少量学习得到进一步改善。通过案例研究,我们还确定了GPT模型的局限性,这需要在未来的研究中解决。
    Demographics, social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在美国,五分之一的成年人目前是患有严重疾病或残疾的个人的家庭照顾者。与专业护理人员不同,家庭照顾者通常在没有正式准备或培训的情况下承担这一角色。因此,迫切需要提高家庭护理人员提供优质护理的能力。利用技术作为教育工具或辅助护理是一种有前途的方法,有可能提高家庭护理人员的学习和护理能力。大型语言模型(LLM)可以用作支持护理人员的基础技术。LLM可以归类为基础模型(FM),它是在广泛的数据集上训练的大规模模型,可以适应一系列不同的领域任务。尽管有潜力,FM有“幻觉”的关键弱点,“模型产生的信息可能具有误导性或不准确。当语言模型被部署为护理人员的一线帮助工具时,信息可靠性至关重要。
    目的:本研究旨在(1)通过使用FM和护理知识库来开发可靠的护理语言模型(CaLM),(2)使用需要更少的计算资源的小型FM开发可访问的CaLM,(3)与大型调频相比,评估模型的性能。
    方法:我们使用检索增强生成(RAG)框架结合FM微调开发了一种CaLM,通过将模型基于护理知识库来提高FM答案的质量。CaLM的关键组成部分是护理知识库,微调调频,和一个回收模块。我们使用2个小型FM作为CaLM(LLaMA[大型语言模型MetaAI]2和Falcon,具有70亿个参数)的基础,并采用了大型FM(GPT-3.5,估计有1750亿个参数)作为基准。我们通过从互联网上收集各种类型的文档来开发护理知识库。我们专注于阿尔茨海默病和相关痴呆症患者的护理人员。我们使用通常用于评估语言模型的基准指标及其可靠性来评估模型的性能,以提供准确的答案参考。
    结果:RAG框架提高了本研究中使用的所有FM在所有措施中的性能。不出所料,在所有指标上,大型FM的表现都优于小型FM。有趣的是,在所有指标中,使用RAG的小型微调FM的表现明显优于GPT3.5。具有小FM的微调LLaMA2在返回带有答案的参考方面比GPT3.5(即使使用RAG)表现更好。
    结论:研究表明,可以使用具有特定于护理领域的知识库的小型FM开发可靠且可访问的CaLM。
    BACKGROUND: In the United States, 1 in 5 adults currently serves as a family caregiver for an individual with a serious illness or disability. Unlike professional caregivers, family caregivers often assume this role without formal preparation or training. Thus, there is an urgent need to enhance the capacity of family caregivers to provide quality care. Leveraging technology as an educational tool or an adjunct to care is a promising approach that has the potential to enhance the learning and caregiving capabilities of family caregivers. Large language models (LLMs) can potentially be used as a foundation technology for supporting caregivers. An LLM can be categorized as a foundation model (FM), which is a large-scale model trained on a broad data set that can be adapted to a range of different domain tasks. Despite their potential, FMs have the critical weakness of \"hallucination,\" where the models generate information that can be misleading or inaccurate. Information reliability is essential when language models are deployed as front-line help tools for caregivers.
    OBJECTIVE: This study aimed to (1) develop a reliable caregiving language model (CaLM) by using FMs and a caregiving knowledge base, (2) develop an accessible CaLM using a small FM that requires fewer computing resources, and (3) evaluate the model\'s performance compared with a large FM.
    METHODS: We developed a CaLM using the retrieval augmented generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers by grounding the model on a caregiving knowledge base. The key components of the CaLM are the caregiving knowledge base, a fine-tuned FM, and a retriever module. We used 2 small FMs as candidates for the foundation of the CaLM (LLaMA [large language model Meta AI] 2 and Falcon with 7 billion parameters) and adopted a large FM (GPT-3.5 with an estimated 175 billion parameters) as a benchmark. We developed the caregiving knowledge base by gathering various types of documents from the internet. We focused on caregivers of individuals with Alzheimer disease and related dementias. We evaluated the models\' performances using the benchmark metrics commonly used in evaluating language models and their reliability for providing accurate references with their answers.
    RESULTS: The RAG framework improved the performance of all FMs used in this study across all measures. As expected, the large FM performed better than the small FMs across all metrics. Interestingly, the small fine-tuned FMs with RAG performed significantly better than GPT 3.5 across all metrics. The fine-tuned LLaMA 2 with a small FM performed better than GPT 3.5 (even with RAG) in returning references with the answers.
    CONCLUSIONS: The study shows that a reliable and accessible CaLM can be developed using small FMs with a knowledge base specific to the caregiving domain.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    先前的研究评估了大型语言模型(LLM)在医学学科中的能力;然而,很少有人专注于图像分析,没有特别针对心血管成像或核心脏病学。
    本研究评估了四个LLM-GPT-4,GPT-4Turbo,GPT-4omni(GPT-4o)(开放式AI),和Gemini(GoogleInc.)-在回答2023年美国核心脏病学会委员会准备考试的问题时,反映核心脏病学认证委员会(CBNC)考试的范围。
    我们使用了168个问题:141个纯文本和27个基于图像的问题,分为四个部分镜像CBNC考试。每个LLM都有相同的标准化提示,并应用于每个部分30次,以说明随机性。对除了GPT-4o之外的所有模型评估六周内的性能。McNemar测试比较了正确的反应比例。
    GPT-4,双子座,GPT4-Turbo,GPT-4o正确回答了56.8%的中位数百分位数(95%置信区间55.4%-58.0%),40.5%(39.9%-42.9%),60.7%(59.9%-61.3%)和63.1%(62.5-64.3%)的问题,分别。GPT4o的表现明显优于其他型号(p=0.007与GPT-4Turbo,p<0.001vs.GPT-4和双子座)。与GPT-4,双子座相比,GPT-4o在纯文本问题上表现出色,和GPT-4Turbo(p<0.001,p<0.001,p=0.001),而双子座在基于图像的问题上表现较差(全部p<0.001)。
    GPT-4o在四个LLM中表现出卓越的性能,达到的分数可能在通过类似于CBNC考试的测试所需的范围之内或之外。尽管需要改进医学图像解释,GPT-4o显示出支持医生回答基于文本的临床问题的潜力。
    UNASSIGNED: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology.
    UNASSIGNED: This study assesses four LLMs - GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.) - in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.
    UNASSIGNED: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar\'s test compared correct response proportions.
    UNASSIGNED: GPT-4, Gemini, GPT4-Turbo, and GPT-4o correctly answered median percentiles of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.9% - 61.3%) and 63.1% (62.5 - 64.3%) of questions, respectively. GPT4o significantly outperformed other models (p=0.007 vs. GPT-4Turbo, p<0.001 vs. GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (p<0.001, p<0.001, and p=0.001), while Gemini performed worse on image-based questions (p<0.001 for all).
    UNASSIGNED: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:ChatGPT的快速发展引起了极大的兴趣,并在公共和学术领域引起了广泛的讨论,特别是在医学教育的背景下。
    目的:本研究旨在通过与三年级医学生的对比分析,评估ChatGPT在肺病学检查中的表现。
    方法:在这项横断面研究中,我们对2个不同的组进行了比较分析.第一组包括244名三年级医学生,他们以前参加过我们机构2020年的肺科检查,这是用法语进行的。第二组将ChatGPT-3.5分为两组对话:没有语境化(V1)和语境化(V2)。在V1和V2中,ChatGPT收到了对学生的相同问题集。
    结果:V1在放射学方面表现出非凡的熟练程度,微生物学,和胸外科,超过了这些领域的大多数医学生。然而,它面临着病理学的挑战,药理学,和临床肺炎。相比之下,V2在各种问题类别中始终提供更准确的回答,无论专业化。与医学生相比,ChatGPT在多项选择题中表现欠佳。V2擅长回答结构化的开放式问题。两次ChatGPT谈话,特别是V2,在解决低难度和中等难度的问题方面优于学生。有趣的是,学生在面对极具挑战性的问题时表现出更高的熟练程度。V1未能通过考试。相反,V2成功取得考试成功,表现优于139名(62.1%)医学生。
    结论:虽然ChatGPT可以访问基于Web的全面数据集,它的表现与普通医学生的表现非常相似。结果受问题格式的影响,项目复杂性,和上下文细微差别。该模型在需要信息综合的医疗环境中面临挑战,先进的分析能力,和临床判断,以及在非英语语言评估中以及面对主流互联网来源之外的数据时。
    BACKGROUND: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education.
    OBJECTIVE: This study aimed to evaluate ChatGPT\'s performance in a pulmonology examination through a comparative analysis with that of third-year medical students.
    METHODS: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution\'s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students.
    RESULTS: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students.
    CONCLUSIONS: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究评估了多模态大型语言模型(LLM)的诊断准确性,ChatGPT-4,使用具有基准数据集的彩色眼底照片(CFP)识别青光眼,无需事先训练或微调。
    使用可公开访问的视网膜眼底青光眼挑战“REFUGE”数据集进行分析。输入数据由整个400个图像测试集组成。任务涉及将眼底图像分类为“可能的青光眼”或“可能的非青光眼”。我们构建了一个混淆矩阵来可视化ChatGPT-4的预测结果,重点是二元分类的准确性(青光眼与非青光眼)。
    ChatGPT-4显示出90%的准确性,95%的置信区间(CI)为87.06%-92.94%。敏感度为50%(95%CI:34.51%-65.49%),而特异性为94.44%(95%CI:92.08%-96.81%)。精度记录为50%(95%CI:34.51%-65.49%),F1评分为0.50。
    ChatGPT-4在没有预先对CFP进行微调的情况下实现了相对较高的诊断准确性。考虑到专业医疗领域数据的稀缺性,包括眼科,使用先进的人工智能技术,如LLM,与其他形式的AI相比,可能需要更少的数据进行培训,并可能节省时间和财务资源。它还可能为开发创新工具以支持专业医疗服务铺平道路,特别是那些依赖于多模态数据进行诊断和随访的数据,不受资源限制。
    UNASSIGNED: This study evaluates the diagnostic accuracy of a multimodal large language model (LLM), ChatGPT-4, in recognizing glaucoma using color fundus photographs (CFPs) with a benchmark dataset and without prior training or fine tuning.
    UNASSIGNED: The publicly accessible Retinal Fundus Glaucoma Challenge \"REFUGE\" dataset was utilized for analyses. The input data consisted of the entire 400 image testing set. The task involved classifying fundus images into either \'Likely Glaucomatous\' or \'Likely Non-Glaucomatous\'. We constructed a confusion matrix to visualize the results of predictions from ChatGPT-4, focusing on accuracy of binary classifications (glaucoma vs non-glaucoma).
    UNASSIGNED: ChatGPT-4 demonstrated an accuracy of 90% with a 95% confidence interval (CI) of 87.06%-92.94%. The sensitivity was found to be 50% (95% CI: 34.51%-65.49%), while the specificity was 94.44% (95% CI: 92.08%-96.81%). The precision was recorded at 50% (95% CI: 34.51%-65.49%), and the F1 Score was 0.50.
    UNASSIGNED: ChatGPT-4 achieved relatively high diagnostic accuracy without prior fine tuning on CFPs. Considering the scarcity of data in specialized medical fields, including ophthalmology, the use of advanced AI techniques, such as LLMs, might require less data for training compared to other forms of AI with potential savings in time and financial resources. It may also pave the way for the development of innovative tools to support specialized medical care, particularly those dependent on multimodal data for diagnosis and follow-up, irrespective of resource constraints.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于符号任务计划易于理解和部署在工程机器人体系结构中,因此它是一种广泛用于实施机器人自主性的方法。然而,符号任务计划的技术很难在现实世界中扩展,高度动态,人机协作方案,因为在行动效果可能不立竿见影的规划领域表现不佳,或由于机器人工作区环境的变化而需要频繁的重新规划。长期计划的有效性,计划长度,和计划时间可能会阻碍机器人的效率,并对整体人机交互的流畅性产生负面影响。我们提出了一个框架,我们称之为照烧,专门旨在弥合符号任务计划和机器学习方法之间的差距。其基本原理是训练大型语言模型(LLM),即GPT-3,与计划域定义语言(PDDL)兼容的神经符号任务计划器,然后利用其生成能力来克服象征性任务计划者固有的一些限制。潜在的好处包括(I)在规划域复杂性增加的情况下具有更好的可扩展性,由于LLM的响应时间与输入和输出的组合长度成线性比例,而不是像象征性任务计划者那样的超线性,和(ii)综合行动而不是端到端的计划的能力,并使每个操作在生成后立即可执行,而不是等待整个计划可用,这反过来又实现了并发计划和执行。在过去的一年里,研究界已经付出了巨大的努力来评估LLM的整体认知能力,交替的成功。相反,使用Teriyaki,我们的目标是在特定规划领域提供与传统规划师相当的总体规划性能,在利用其他指标中的LLM功能的同时,特别是那些与他们的短期和中期生成能力有关的,用于构建前瞻预测规划模型。选定领域的初步结果表明,我们的方法可以:(i)解决1,000个样本的测试数据集中的95.5%的问题;(ii)产生比传统符号计划者短13.5%的计划;(iii)将计划可用性的平均总等待时间减少多达61.4%。
    Symbolic task planning is a widely used approach to enforce robot autonomy due to its ease of understanding and deployment in engineered robot architectures. However, techniques for symbolic task planning are difficult to scale in real-world, highly dynamic, human-robot collaboration scenarios because of the poor performance in planning domains where action effects may not be immediate, or when frequent re-planning is needed due to changed circumstances in the robot workspace. The validity of plans in the long term, plan length, and planning time could hinder the robot\'s efficiency and negatively affect the overall human-robot interaction\'s fluency. We present a framework, which we refer to as Teriyaki, specifically aimed at bridging the gap between symbolic task planning and machine learning approaches. The rationale is training Large Language Models (LLMs), namely GPT-3, into a neurosymbolic task planner compatible with the Planning Domain Definition Language (PDDL), and then leveraging its generative capabilities to overcome a number of limitations inherent to symbolic task planners. Potential benefits include (i) a better scalability in so far as the planning domain complexity increases, since LLMs\' response time linearly scales with the combined length of the input and the output, instead of super-linearly as in the case of symbolic task planners, and (ii) the ability to synthesize a plan action-by-action instead of end-to-end, and to make each action available for execution as soon as it is generated instead of waiting for the whole plan to be available, which in turn enables concurrent planning and execution. In the past year, significant efforts have been devoted by the research community to evaluate the overall cognitive capabilities of LLMs, with alternate successes. Instead, with Teriyaki we aim to providing an overall planning performance comparable to traditional planners in specific planning domains, while leveraging LLMs capabilities in other metrics, specifically those related to their short- and mid-term generative capabilities, which are used to build a look-ahead predictive planning model. Preliminary results in selected domains show that our method can: (i) solve 95.5% of problems in a test data set of 1,000 samples; (ii) produce plans up to 13.5% shorter than a traditional symbolic planner; (iii) reduce average overall waiting times for a plan availability by up to 61.4%.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工智能(AI)的集成,特别是深度学习模型,改变了医疗技术的格局,特别是在使用成像和生理数据的诊断领域。在耳鼻喉科,AI在中耳疾病的图像分类中显示出希望。然而,现有的模型通常缺乏患者特定的数据和临床背景,限制其普遍适用性。GPT-4Vision(GPT-4V)的出现使得多模态诊断方法成为可能,将语言处理与图像分析相结合。
    目的:在本研究中,我们通过整合患者特异性数据和耳镜下鼓膜图像,研究了GPT-4V在诊断中耳疾病中的有效性.
    方法:本研究的设计分为两个阶段:(1)建立具有适当提示的模型和(2)验证最佳提示模型对图像进行分类的能力。总的来说,305个中耳疾病的耳镜图像(急性中耳炎,中耳胆脂瘤,慢性中耳炎,和渗出性中耳炎)来自2010年4月至2023年12月期间访问新州大学或济池医科大学的患者。使用提示和患者数据建立优化的GPT-4V设置,并使用最佳提示创建的模型来验证GPT-4V在190张图像上的诊断准确性。为了比较GPT-4V与医生的诊断准确性,30名临床医生完成了由190张图像组成的基于网络的问卷。
    结果:多模态人工智能方法实现了82.1%的准确率,优于认证儿科医生的70.6%,但落后于耳鼻喉科医生的95%以上。该模型对急性中耳炎的疾病特异性准确率为89.2%,76.5%为慢性中耳炎,79.3%为中耳胆脂瘤,渗出性中耳炎占85.7%,这突出了对疾病特异性优化的需求。与医生的比较显示了有希望的结果,提示GPT-4V增强临床决策的潜力。
    结论:尽管有其优势,必须解决数据隐私和道德考虑等挑战。总的来说,这项研究强调了多模式AI在提高诊断准确性和改善耳鼻喉科患者护理方面的潜力.需要进一步的研究以在不同的临床环境中优化和验证这种方法。
    The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis.
    In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane.
    The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients\' data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images.
    The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model\'s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making.
    Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM)在临床信息处理中起着至关重要的作用,展示跨不同语言任务的强大概括。然而,现有LLM,尽管意义重大,缺乏临床应用的优化,在幻想和可解释性方面提出挑战。检索增强生成(RAG)模型通过提供答案生成的来源来解决这些问题,从而减少错误。本研究探讨RAG技术在临床胃肠病学中的应用,以增强对胃肠道疾病的知识生成。
    我们使用由25个胃肠道疾病指南组成的语料库对嵌入模型进行了微调。与基础模型相比,微调模型的命中率提高了18%,gte-base-zh.此外,它的性能优于OpenAI的嵌入模型20%。使用带有骆驼索引的RAG框架,我们开发了一个中国胃肠病学聊天机器人,名为“胃机器人”,“这显著提高了答案的准确性和上下文相关性,最大限度地减少错误和传播误导性信息的风险。
    在使用RAGAS框架评估GastroBot时,我们观察到95%的上下文召回率。对源头的忠诚,为93.73%。答案的相关性表现出很强的相关性,达到92.28%。这些发现强调了GastroBot在提供有关胃肠道疾病的准确和上下文相关信息方面的有效性。在对GastroBot进行手动评估期间,与其他型号相比,我们的GastroBot模型提供了大量有价值的知识,同时确保结果的完整性和一致性。
    研究结果表明,将RAG方法纳入临床胃肠病学可以增强大型语言模型的准确性和可靠性。作为该方法的实际实现,GastroBot在上下文理解和响应质量方面表现出显着增强。模型的不断探索和完善有望推动胃肠病学领域的临床信息处理和决策支持。
    UNASSIGNED: Large Language Models (LLMs) play a crucial role in clinical information processing, showcasing robust generalization across diverse language tasks. However, existing LLMs, despite their significance, lack optimization for clinical applications, presenting challenges in terms of illusions and interpretability. The Retrieval-Augmented Generation (RAG) model addresses these issues by providing sources for answer generation, thereby reducing errors. This study explores the application of RAG technology in clinical gastroenterology to enhance knowledge generation on gastrointestinal diseases.
    UNASSIGNED: We fine-tuned the embedding model using a corpus consisting of 25 guidelines on gastrointestinal diseases. The fine-tuned model exhibited an 18% improvement in hit rate compared to its base model, gte-base-zh. Moreover, it outperformed OpenAI\'s Embedding model by 20%. Employing the RAG framework with the llama-index, we developed a Chinese gastroenterology chatbot named \"GastroBot,\" which significantly improves answer accuracy and contextual relevance, minimizing errors and the risk of disseminating misleading information.
    UNASSIGNED: When evaluating GastroBot using the RAGAS framework, we observed a context recall rate of 95%. The faithfulness to the source, stands at 93.73%. The relevance of answers exhibits a strong correlation, reaching 92.28%. These findings highlight the effectiveness of GastroBot in providing accurate and contextually relevant information about gastrointestinal diseases. During manual assessment of GastroBot, in comparison with other models, our GastroBot model delivers a substantial amount of valuable knowledge while ensuring the completeness and consistency of the results.
    UNASSIGNED: Research findings suggest that incorporating the RAG method into clinical gastroenterology can enhance the accuracy and reliability of large language models. Serving as a practical implementation of this method, GastroBot has demonstrated significant enhancements in contextual comprehension and response quality. Continued exploration and refinement of the model are poised to drive forward clinical information processing and decision support in the gastroenterology field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    医疗数据具有独特的特殊性和专业性,需要大量的领域专业知识来进行注释。精确的数据注释对于异常检测任务至关重要,使培训过程变得复杂。域泛化(DG)是增强医学图像异常检测(AD)的重要方法。本文介绍了一种新的多模态异常检测框架,称为MedicalCLIP。MedicalCLIP在异常检测任务中利用多模态数据,并在图像和文本的模态中建立不规则的约束。MedicalCLIP的关键在于学习模态内的详细表示,与文本语义引导的跨模态对比学习相结合,允许模型专注于语义信息,同时捕获更详细的信息,从而实现更细粒度的异常检测。MedicalCLIP依靠GPT提示来生成文本,减少对医疗数据专业描述的需求。医学数据的文本构造有助于提高多模态模型对异常检测任务的泛化能力。此外,在文本图像对比度增强过程中,模型从图像数据中选择和提取信息的能力得到提高。通过分层对比损失,在图像表示过程中实现了细粒度的表示。MedicalCLIP已在各种医疗数据集上得到验证,在医疗数据异常检测中显示出值得称赞的领域泛化性能。在异常分类和分割度量方面都观察到了改进。在涉及大脑数据的异常分类(AC)任务中,该方法在性能上比现有的最佳方法提高了2.81。
    Medical data have unique specificity and professionalism, requiring substantial domain expertise for their annotation. Precise data annotation is essential for anomaly-detection tasks, making the training process complex. Domain generalization (DG) is an important approach to enhancing medical image anomaly detection (AD). This paper introduces a novel multimodal anomaly-detection framework called MedicalCLIP. MedicalCLIP utilizes multimodal data in anomaly-detection tasks and establishes irregular constraints within modalities for images and text. The key to MedicalCLIP lies in learning intramodal detailed representations, which are combined with text semantic-guided cross-modal contrastive learning, allowing the model to focus on semantic information while capturing more detailed information, thus achieving more fine-grained anomaly detection. MedicalCLIP relies on GPT prompts to generate text, reducing the demand for professional descriptions of medical data. Text construction for medical data helps to improve the generalization ability of multimodal models for anomaly-detection tasks. Additionally, during the text-image contrast-enhancement process, the model\'s ability to select and extract information from image data is improved. Through hierarchical contrastive loss, fine-grained representations are achieved in the image-representation process. MedicalCLIP has been validated on various medical datasets, showing commendable domain generalization performance in medical-data anomaly detection. Improvements were observed in both anomaly classification and segmentation metrics. In the anomaly classification (AC) task involving brain data, the method demonstrated a 2.81 enhancement in performance over the best existing approach.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号