Large language model

大型语言模型
  • 文章类型: Letter
    这封信评估了Gravina等人关于ChatGPT在为炎症性肠病患者提供医疗信息方面的潜力的文章。虽然有希望,它强调了对推理+动作和检索增强生成等先进技术的需求,以提高准确性和可靠性。强调简单的问答测试是不够的,它需要更细致的评估方法,以真正衡量大型语言模型在临床应用中的能力。
    This letter evaluates the article by Gravina et al on ChatGPT\'s potential in providing medical information for inflammatory bowel disease patients. While promising, it highlights the need for advanced techniques like reasoning + action and retrieval-augmented generation to improve accuracy and reliability. Emphasizing that simple question and answer testing is insufficient, it calls for more nuanced evaluation methods to truly gauge large language models\' capabilities in clinical applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    聊天机器人,基于大型语言模型,越来越多地用于公共卫生。然而,聊天机器人响应的有效性一直存在争议,它们在近视预防和控制方面的表现尚未得到充分探索。本研究旨在评估三个著名的聊天机器人ChatGPT的有效性,克劳德,和Bard-in回应有关近视的公共卫生问题。
    关于近视的19个公共卫生问题(包括三个政策主题,基础知识和措施)由三个聊天机器人单独回应。洗牌后,每个聊天机器人响应由4名评估者独立评估全面性,准确性和相关性。
    这项研究的问题经过了可靠的测试。所有3个聊天机器人的单词计数响应之间存在显着差异。从最多到最少,订单是ChatGPT,巴德,还有克劳德.所有3个聊天机器人的综合得分都超过5分中的4分。ChatGPT在评估的所有方面得分最高。然而,所有聊天机器人都表现出缺点,比如给出捏造的回应。
    聊天机器人在公共卫生领域显示出巨大潜力,ChatGPT是最好的。未来使用聊天机器人作为公共卫生工具将需要快速开发其使用和监测标准,以及持续的研究,聊天机器人的评估和改进。
    UNASSIGNED:  Chatbots, which are based on large language models, are increasingly being used in public health. However, the effectiveness of chatbot responses has been debated, and their performance in myopia prevention and control has not been fully explored. This study aimed to evaluate the effectiveness of three well-known chatbots-ChatGPT, Claude, and Bard-in responding to public health questions about myopia.
    UNASSIGNED:  Nineteen public health questions about myopia (including three topics of policy, basics and measures) were responded individually by three chatbots. After shuffling the order, each chatbot response was independently rated by 4 raters for comprehensiveness, accuracy and relevance.
    UNASSIGNED:  The study\'s questions have undergone reliable testing. There was a significant difference among the word count responses of all 3 chatbots. From most to least, the order was ChatGPT, Bard, and Claude. All 3 chatbots had a composite score above 4 out of 5. ChatGPT scored the highest in all aspects of the assessment. However, all chatbots exhibit shortcomings, such as giving fabricated responses.
    UNASSIGNED:  Chatbots have shown great potential in public health, with ChatGPT being the best. The future use of chatbots as a public health tool will require rapid development of standards for their use and monitoring, as well as continued research, evaluation and improvement of chatbots.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    服务的定义已经从2000年代之前对制造业中物质价值的关注发展到以服务业的显着增长为基础的以客户为中心的价值。由于通过第四次工业革命和COVID-19将数字技术纳入其中,数字化转型对服务行业的公司至关重要。这项研究利用变压器(BERT)的双向编码器表示来分析2000年至2022年间注册的3029项与客户服务行业和数字化转型相关的国际专利。通过主题建模,这项研究确定了客户服务行业的10个主要主题,并分析了它们的年度趋势。我们的研究结果表明,截至2022年,频率最高的趋势是以用户为中心的网络服务设计,而云计算在过去五年中经历了最急剧的增长。自互联网诞生以来,以用户为中心的网络服务一直在稳步发展。云计算是2023年为客户服务数字化转型而大力开发的关键技术之一。这项研究确定了客户服务行业专利的时间序列趋势,并提出了使用BERTopic预测技术未来趋势的有效性。
    The definition of service has evolved from a focus on material value in manufacturing before the 2000s to a customer-centric value based on the significant growth of the service industry. Digital transformation has become essential for companies in the service industry due to the incorporation of digital technology through the Fourth Industrial Revolution and COVID-19. This study utilised Bidirectional Encoder Representations from Transformer (BERT) to analyse 3029 international patents related to the customer service industry and digital transformation registered between 2000 and 2022. Through topic modelling, this study identified 10 major topics in the customer service industry and analysed their yearly trends. Our findings show that as of 2022, the trend with the highest frequency is user-centric network service design, while cloud computing has experienced the steepest increase in the last five years. User-centric network services have been steadily developing since the inception of the Internet. Cloud computing is one of the key technologies being developed intensively in 2023 for the digital transformation of customer service. This study identifies time series trends of customer service industry patents and suggests the effectiveness of using BERTopic to predict future trends in technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM)正在医学环境中引起人们的兴趣。例如,LLM可以通过基于临床注释提供合理的鉴别诊断来连贯地响应医疗查询。然而,有很多问题需要探索,例如评估开源LLM和闭源LLM之间的差异,以及LLM对医疗和非医疗用户查询的性能。在这项研究中,我们评估了多个LLM,包括Llama-2聊天,维库纳,Medllama2,Bard/Gemini,克劳德,ChatGPT3.5和ChatGPT-4,以及非LLM方法(Google搜索和Phenomizer),涉及他们从类似教科书的临床医生问题中识别遗传条件的能力以及与63种遗传条件相关的相应外行人翻译。对于开源LLM,较大的模型比较小的LLM更准确:7b,13b,大于33b的参数模型获得的精度范围为21%-49%,41%-51%,54%-68%,分别。闭源LLM优于开源LLM,ChatGPT-4表现最好(89%-90%)。11个LLM和Google搜索中有3个在临床医生和外行人提示之间存在明显的性能差距。我们还评估了上下文提示和关键字删除如何影响开源LLM性能。模型提供了2种类型的上下文提示:列表类型提示,这提高了LLM性能,和定义类型提示,没有。我们进一步分析了从描述中删除稀有术语,这降低了7个评估的LLM中的5个的准确性。最后,我们观察到真实个体的描述表现要低得多;LLM以最大21%的准确率回答了这些问题。
    Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals\' descriptions; LLMs answered these questions with a maximum 21% accuracy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:放射学报告的生成过程通常耗时耗力,容易出现不完整,异质性,和错误。通过采用基于自然语言处理(NLP)的技术,本研究探讨了通过ChatGPT(生成预训练变压器)的显着能力提高放射学报告生成效率的潜力,一个突出的大型语言模型(LLM)。
    方法:使用来自重症监护医学信息集市(MIMIC)胸部X射线数据库的1000条记录样本,这次调查雇佣了克劳德.ai提取初始放射学报告关键字。然后,ChatGPT使用一致的3步提示模板大纲生成放射学报告。采用各种词汇和句子相似性技术来评估AI助手生成的报告与医学专业人员撰写的参考报告之间的对应关系。
    结果:结果显示,NLP模型的性能各不相同,Bart(双向和自回归变形金刚)和XLM(跨语言语言模型)显示出很高的熟练程度(平均相似性得分高达99.3%),密切反映医生的报告。相反,DeBERTa(带有解开的注意力的解码增强BERT)和序列匹配模型得分较低,表明与医学语言的一致性较低。在印象部分,单词嵌入模型的平均相似度为84.4%,而其他像Jaccard指数的表现则较低。
    结论:总体而言,该研究强调了NLP模型产生放射学报告的能力与医学专业人员的语言一致的显著差异.成对比较和Kruskal-Wallis检验证实了这些差异,强调在放射学报告生成中仔细选择和评估NLP模型的必要性。这项研究强调了ChatGPT简化和改进放射学报告流程的潜力。对提高临床实践的效率和准确性具有重要意义。
    OBJECTIVE: The process of generating radiology reports is often time-consuming and labor-intensive, prone to incompleteness, heterogeneity, and errors. By employing natural language processing (NLP)-based techniques, this study explores the potential for enhancing the efficiency of radiology report generation through the remarkable capabilities of ChatGPT (Generative Pre-training Transformer), a prominent large language model (LLM).
    METHODS: Using a sample of 1000 records from the Medical Information Mart for Intensive Care (MIMIC) Chest X-ray Database, this investigation employed Claude.ai to extract initial radiological report keywords. ChatGPT then generated radiology reports using a consistent 3-step prompt template outline. Various lexical and sentence similarity techniques were employed to evaluate the correspondence between the AI assistant-generated reports and reference reports authored by medical professionals.
    RESULTS: Results showed varying performance among NLP models, with Bart (Bidirectional and Auto-Regressive Transformers) and XLM (Cross-lingual Language Model) displaying high proficiency (mean similarity scores up to 99.3%), closely mirroring physician reports. Conversely, DeBERTa (Decoding-enhanced BERT with disentangled attention) and sequence-matching models scored lower, indicating less alignment with medical language. In the Impression section, the Word-Embedding model excelled with a mean similarity of 84.4%, while others like the Jaccard index showed lower performance.
    CONCLUSIONS: Overall, the study highlights significant variations across NLP models in their ability to generate radiology reports consistent with medical professionals\' language. Pairwise comparisons and Kruskal-Wallis tests confirmed these differences, emphasizing the need for careful selection and evaluation of NLP models in radiology report generation. This research underscores the potential of ChatGPT to streamline and improve the radiology reporting process, with implications for enhancing efficiency and accuracy in clinical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:人工智能(AI)聊天机器人,比如ChatGPT,取得了重大进展。这些聊天机器人,在医疗保健专业人员和患者中特别受欢迎,正在通过个性化信息改变患者教育和疾病体验。准确,及时的病人教育对于知情决策至关重要,特别是关于前列腺特异性抗原筛查和治疗方案。然而,必须严格评估人工智能聊天机器人医疗信息的准确性和可靠性。测试ChatGPT对前列腺癌知识的研究正在兴起,但需要持续评估,以确保向患者提供的信息的质量和安全性.
    目的:本研究旨在评估质量,准确度,以及ChatGPT-4对患者提出的常见前列腺癌问题的反应的可读性。
    方法:总的来说,根据同行评审文献中的信息主题和Google趋势数据,采用归纳方法制定了8个问题。适用于AI的患者教育材料评估工具(PEMAT-AI)的改编版本,全球质量评分,4名独立审稿人使用DISCERN-AI工具来评估AI反应的质量。这8个人工智能输出由7位泌尿科专家判断,使用开发的评估框架来评估准确性,安全,适当性,可操作性,和有效性。人工智能反应的可读性是使用既定的算法评估的(FleschReadingEase评分,GunningFogIndex,Flesch-Kincaid等级,Coleman-Liau指数,和Gobbledygook[SMOG]指数的简单度量)。开发了一个简短的工具(参考评估AI[REF-AI])来分析AI输出提供的参考,评估参考幻觉,相关性,和参考文献的质量。
    结果:PEMAT-AI可理解性得分非常好(平均79.44%,SD10.44%),DISCERN-AI评分为“良好”质量(平均13.88,标准差0.93),总体质量评分较高(平均4.46/5,SD0.50)。人工智能自然语言评估工具的合并平均准确率为3.96(SD0.91),安全性为4.32(SD0.86),适当性4.45(SD0.81),可操作性为4.05(SD1.15),和有效性4.09(SD0.98)。可读性算法的共识是“难以阅读”(FleschReadingEase得分平均45.97,SD8.69;GunningFogIndex平均14.55,SD4.79),平均11年级的阅读水平,相当于15至17岁的青少年(Flesch-Kincaid等级平均12.12,SD4.34;Coleman-Liau指数平均12.75,SD1.98;SMOG指数平均11.06,SD3.20)。REF-AI识别出2种参考幻觉,而大多数参考文献(28/30,93%)适当地补充了文本。大多数参考文献(26/30,86%)来自信誉良好的政府组织,少数是科学文献的直接引用。
    结论:我们的分析发现,ChatGPT-4对常见前列腺癌查询提供了普遍良好的响应,使其成为前列腺癌护理中患者教育的潜在有价值的工具。客观的质量评估工具表明,自然语言处理输出通常是可靠和适当的,但是还有改进的空间。
    BACKGROUND: Artificial intelligence (AI) chatbots, such as ChatGPT, have made significant progress. These chatbots, particularly popular among health care professionals and patients, are transforming patient education and disease experience with personalized information. Accurate, timely patient education is crucial for informed decision-making, especially regarding prostate-specific antigen screening and treatment options. However, the accuracy and reliability of AI chatbots\' medical information must be rigorously evaluated. Studies testing ChatGPT\'s knowledge of prostate cancer are emerging, but there is a need for ongoing evaluation to ensure the quality and safety of information provided to patients.
    OBJECTIVE: This study aims to evaluate the quality, accuracy, and readability of ChatGPT-4\'s responses to common prostate cancer questions posed by patients.
    METHODS: Overall, 8 questions were formulated with an inductive approach based on information topics in peer-reviewed literature and Google Trends data. Adapted versions of the Patient Education Materials Assessment Tool for AI (PEMAT-AI), Global Quality Score, and DISCERN-AI tools were used by 4 independent reviewers to assess the quality of the AI responses. The 8 AI outputs were judged by 7 expert urologists, using an assessment framework developed to assess accuracy, safety, appropriateness, actionability, and effectiveness. The AI responses\' readability was assessed using established algorithms (Flesch Reading Ease score, Gunning Fog Index, Flesch-Kincaid Grade Level, The Coleman-Liau Index, and Simple Measure of Gobbledygook [SMOG] Index). A brief tool (Reference Assessment AI [REF-AI]) was developed to analyze the references provided by AI outputs, assessing for reference hallucination, relevance, and quality of references.
    RESULTS: The PEMAT-AI understandability score was very good (mean 79.44%, SD 10.44%), the DISCERN-AI rating was scored as \"good\" quality (mean 13.88, SD 0.93), and the Global Quality Score was high (mean 4.46/5, SD 0.50). Natural Language Assessment Tool for AI had pooled mean accuracy of 3.96 (SD 0.91), safety of 4.32 (SD 0.86), appropriateness of 4.45 (SD 0.81), actionability of 4.05 (SD 1.15), and effectiveness of 4.09 (SD 0.98). The readability algorithm consensus was \"difficult to read\" (Flesch Reading Ease score mean 45.97, SD 8.69; Gunning Fog Index mean 14.55, SD 4.79), averaging an 11th-grade reading level, equivalent to 15- to 17-year-olds (Flesch-Kincaid Grade Level mean 12.12, SD 4.34; The Coleman-Liau Index mean 12.75, SD 1.98; SMOG Index mean 11.06, SD 3.20). REF-AI identified 2 reference hallucinations, while the majority (28/30, 93%) of references appropriately supplemented the text. Most references (26/30, 86%) were from reputable government organizations, while a handful were direct citations from scientific literature.
    CONCLUSIONS: Our analysis found that ChatGPT-4 provides generally good responses to common prostate cancer queries, making it a potentially valuable tool for patient education in prostate cancer care. Objective quality assessment tools indicated that the natural language processing outputs were generally reliable and appropriate, but there is room for improvement.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: English Abstract
    OBJECTIVE: To evaluate the quality of recommendations provided by ChatGPT regarding inguinal hernia repair.
    METHODS: ChatGPT was asked 5 questions about surgical management of inguinal hernias. The chat-bot was assigned the role of expert in herniology and requested to search only specialized medical databases and provide information about references and evidence. Herniology experts and surgeons (non-experts) rated the quality of recommendations generated by ChatGPT using 4-point scale (from 0 to 3 points). Statistical correlations were explored between participants\' ratings and their stance regarding artificial intelligence.
    RESULTS: Experts scored the quality of ChatGPT responses lower than non-experts (2 (1-2) vs. 2 (2-3), p<0.001). The chat-bot failed to provide valid references and actual evidence, as well as falsified half of references. Respondents were optimistic about the future of neural networks for clinical decision-making support. Most of them were against restricting their use in healthcare.
    CONCLUSIONS: We would not recommend non-specialized large language models as a single or primary source of information for clinical decision making or virtual searching assistant.
    UNASSIGNED: Оценить качество рекомендаций языковой модели (ЯМ) ChatGPT по лечению паховой грыжи.
    UNASSIGNED: ChatGPT было задано 5 вопросов о хирургическом лечении паховых грыж. Чат-боту отведена роль эксперта в области герниологии и предложено провести поиск только в специализированных медицинских базах данных, предоставив информацию об источниках и уровне их доказательности. Эксперты в области герниологии и общие хирурги (не эксперты) оценили качество рекомендаций, полученных с помощью ChatGPT, по 4-балльной шкале (от 0 до 3 баллов). Изучены статистические закономерности между оценками респондентов и их мнением относительно перспектив использования искусственного интеллекта.
    UNASSIGNED: Качество ответов ChatGPT экспертами оценено ниже (2 [1—2] балла), чем не экспертами (2 [2—3]), (p<0,001). Чат-бот не справился с предоставлением достоверных ссылок на источники и указанием уровня доказательности, а также сфальсифицировал половину приведенных ссылок. Респонденты с оптимизмом смотрят на будущее нейросетей как инструмента принятия клинических решений; большинство из них выступают против ограничения их использования в здравоохранении.
    UNASSIGNED: Основываясь на результатах данного исследования, в настоящее время нельзя рекомендовать применение неспециализированных ЯМ в качестве единственного или основного источника информации для принятия решения или виртуального помощника по поиску медицинской информации.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    聊天生成器预训练变压器(ChatGPT-4)的消费者可用性和自动响应功能,一个大的语言模型,平衡此应用程序用于患者的健康查询,并可能作为一个辅助作用,以尽量减少行政和临床负担。
    评估ChatGPT-4响应患者有关尺侧副韧带(UCL)损伤的询问的能力,并将这些结果与Google的表现进行比较。
    横断面研究。
    GoogleWebSearch被用作基准,因为它是全球使用最广泛的搜索引擎,也是唯一在查询提示时生成常见问题(FAQ)的搜索引擎,允许通过系统的方法进行比较。查询“尺侧副韧带重建”已输入谷歌,和十大常见问题解答,答案,他们的消息来源被记录下来。系统会提示ChatGPT-4使用相同的查询对FAQ进行Google搜索,并记录答案的来源以进行比较。再次重复此过程,以获得10个需要数字而不是开放式答复的新问题。最后,对临床准确性的反应进行独立分级(0级=不准确,1级=有点准确,2级=准确)由2名受奖学金培训的运动医学博士(D.W.A,J.S.D.)对搜索引擎和答案来源视而不见。
    ChatGPT-4使用了比Google更大比例的学术资源来提供前10个常见问题解答的答案,尽管这没有统计学意义(90%vs50%;P=.14)。就问题重叠而言,Google和ChatGPT-4上最常见的问题中有40%是相同的。比较常见问题和数字响应时,20%的答案完全重叠,30%显示部分重叠,其余50%未显示任何重叠.ChatGPT-4用来回答这些常见问题的所有来源都是学术性的,而谷歌使用的来源只有20%是学术性的(P=.0007)。剩下的Google来源包括社交媒体(40%),医疗实践(20%),单外科医生网站(10%),和商业网站(10%)。与Google相比,ChatGPT-4给出的答案的平均(±标准偏差)准确性显着提高了前10个常见问题解答(1.9±0.2vs1.2±0.6;P=.001)和前10个带有数字答案的问题(1.8±0.4vs1±0.8;P=.013)。
    ChatGPT-4能够提供与UCL损伤和重建有关的临床相关内容的反应。与GoogleWebSearch相比,ChatGPT-4使用了更大比例的学术网站来提供对代表患者查询的常见问题解答的回复,并提供了更准确的答案。往前走,在回答有关UCL损伤和重建的问题时,ChatGPT有可能用作临床辅助手段。但在临床环境中综合或自主使用之前,需要进一步验证.
    UNASSIGNED: The consumer availability and automated response functions of chat generator pretrained transformer (ChatGPT-4), a large language model, poise this application to be utilized for patient health queries and may have a role in serving as an adjunct to minimize administrative and clinical burden.
    UNASSIGNED: To evaluate the ability of ChatGPT-4 to respond to patient inquiries concerning ulnar collateral ligament (UCL) injuries and compare these results with the performance of Google.
    UNASSIGNED: Cross-sectional study.
    UNASSIGNED: Google Web Search was used as a benchmark, as it is the most widely used search engine worldwide and the only search engine that generates frequently asked questions (FAQs) when prompted with a query, allowing comparisons through a systematic approach. The query \"ulnar collateral ligament reconstruction\" was entered into Google, and the top 10 FAQs, answers, and their sources were recorded. ChatGPT-4 was prompted to perform a Google search of FAQs with the same query and to record the sources of answers for comparison. This process was again replicated to obtain 10 new questions requiring numeric instead of open-ended responses. Finally, responses were graded independently for clinical accuracy (grade 0 = inaccurate, grade 1 = somewhat accurate, grade 2 = accurate) by 2 fellowship-trained sports medicine surgeons (D.W.A, J.S.D.) blinded to the search engine and answer source.
    UNASSIGNED: ChatGPT-4 used a greater proportion of academic sources than Google to provide answers to the top 10 FAQs, although this was not statistically significant (90% vs 50%; P = .14). In terms of question overlap, 40% of the most common questions on Google and ChatGPT-4 were the same. When comparing FAQs with numeric responses, 20% of answers were completely overlapping, 30% demonstrated partial overlap, and the remaining 50% did not demonstrate any overlap. All sources used by ChatGPT-4 to answer these FAQs were academic, while only 20% of sources used by Google were academic (P = .0007). The remaining Google sources included social media (40%), medical practices (20%), single-surgeon websites (10%), and commercial websites (10%). The mean (± standard deviation) accuracy for answers given by ChatGPT-4 was significantly greater compared with Google for the top 10 FAQs (1.9 ± 0.2 vs 1.2 ± 0.6; P = .001) and top 10 questions with numeric answers (1.8 ± 0.4 vs 1 ± 0.8; P = .013).
    UNASSIGNED: ChatGPT-4 is capable of providing responses with clinically relevant content concerning UCL injuries and reconstruction. ChatGPT-4 utilized a greater proportion of academic websites to provide responses to FAQs representative of patient inquiries compared with Google Web Search and provided significantly more accurate answers. Moving forward, ChatGPT has the potential to be used as a clinical adjunct when answering queries about UCL injuries and reconstruction, but further validation is warranted before integrated or autonomous use in clinical settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:风湿性疾病的复杂性给临床医生制定个性化治疗计划带来了相当大的挑战。诸如ChatGPT之类的大型语言模型(LLM)可以实现治疗决策支持。
    目的:将ChatGPT-3.5和GPT-4制定的治疗计划与临床风湿病委员会(RB)制定的治疗计划进行比较。
    方法:制作虚构患者小插图,并查询GPT-3.5、GPT-4和RB,以提供各自的一线和二线治疗计划以及潜在理由。四位来自不同中心的风湿病专家,对治疗计划的起源视而不见,选择总体首选治疗方案并评估治疗计划的安全性,EULAR指南遵守情况,医疗充分性,整体质量,使用5点Likert量表证明治疗计划及其完整性以及患者小插图困难。
    结果:收集了20个虚构的插图,涵盖了各种风湿性疾病和不同的难度水平,总共评估了160个等级。在68.8%(110/160)的病例中,评估者更喜欢RB的治疗计划,而不是GPT-4(16.3%;26/160)和GPT-3.5(15.0%;24/160)。与GPT-3.5相比,GPT-4的计划更频繁地选择用于一线治疗。RB和GPT-4的一线治疗计划之间没有观察到显著的安全性差异。风湿病学家的计划在指南依从性方面获得了更高的评分,医疗适宜性,完整性和整体质量。评分与小插图难度无关。LLM生成的计划明显更长,更详细。
    结论:GPT-4和GPT-3.5产生了安全的,风湿性疾病的高质量治疗计划,在临床决策支持中展示希望。未来的研究应该调查详细的标准化提示和LLM使用对临床决策的影响。
    BACKGROUND: The complex nature of rheumatic diseases poses considerable challenges for clinicians when developing individualized treatment plans. Large language models (LLMs) such as ChatGPT could enable treatment decision support.
    OBJECTIVE: To compare treatment plans generated by ChatGPT-3.5 and GPT-4 to those of a clinical rheumatology board (RB).
    METHODS: Fictional patient vignettes were created and GPT-3.5, GPT-4, and the RB were queried to provide respective first- and second-line treatment plans with underlying justifications. Four rheumatologists from different centers, blinded to the origin of treatment plans, selected the overall preferred treatment concept and assessed treatment plans\' safety, EULAR guideline adherence, medical adequacy, overall quality, justification of the treatment plans and their completeness as well as patient vignette difficulty using a 5-point Likert scale.
    RESULTS: 20 fictional vignettes covering various rheumatic diseases and varying difficulty levels were assembled and a total of 160 ratings were assessed. In 68.8% (110/160) of cases, raters preferred the RB\'s treatment plans over those generated by GPT-4 (16.3%; 26/160) and GPT-3.5 (15.0%; 24/160). GPT-4\'s plans were chosen more frequently for first-line treatments compared to GPT-3.5. No significant safety differences were observed between RB and GPT-4\'s first-line treatment plans. Rheumatologists\' plans received significantly higher ratings in guideline adherence, medical appropriateness, completeness and overall quality. Ratings did not correlate with the vignette difficulty. LLM-generated plans were notably longer and more detailed.
    CONCLUSIONS: GPT-4 and GPT-3.5 generated safe, high-quality treatment plans for rheumatic diseases, demonstrating promise in clinical decision support. Future research should investigate detailed standardized prompts and the impact of LLM usage on clinical decisions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    教医学生获得所需的技能,解释,apply,沟通临床信息是医学教育不可或缺的一部分。此过程的一个关键方面涉及为学生提供有关其自由文本临床笔记质量的反馈。
    本研究的目标是评估大型语言模型ChatGPT3.5的能力,对医学生的自由文本历史和身体笔记进行评分。
    这是一个单一的机构,回顾性研究。标准化的患者学到了预先指定的临床病例,作为病人,与医学生互动。每个学生都写了自由文本历史和他们互动的物理笔记。学生的笔记由标准化患者和ChatGPT使用由85个案例元素组成的预先指定的评分规则进行独立评分。准确度的度量是正确的百分比。
    研究人群由168名一年级医学生组成。总共有14,280分。ChatGPT错误得分率为1.0%,标准化患者错误评分率为7.2%。ChatGPT错误率为86%,低于标准化患者错误率。ChatGPT平均不正确得分为12(SD11)显着低于标准化患者平均不正确得分为85(SD74;P=0.002)。
    与标准化患者相比,ChatGPT显示出较低的错误率。这是第一项评估生成预训练变压器(GPT)计划对医学生的标准化基于患者的免费文本临床笔记进行评分的能力的研究。预计,在不久的将来,大型语言模型将为执业医师提供有关其自由文本注释的实时反馈。GPT人工智能程序代表了医学教育和医学实践的重要进步。
    UNASSIGNED: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes.
    UNASSIGNED: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students\' free-text history and physical notes.
    UNASSIGNED: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students\' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct.
    UNASSIGNED: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002).
    UNASSIGNED: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students\' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号