语言模型 language model-医云文献数字医云科研云海量医学决策数据服务

language model 关注

语言模型

文献(27篇)

百科

视频

1 Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

为命名实体识别任务微调大型语言模型的样本量考虑因素：方法论研究。影响指数 : 暂无
发表时间：May 2024 16
来源期刊：JMIR AI PMID：38875593

DOI：10.2196/52095
文章类型： Journal Article

背景：大型语言模型（LLM）具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
目的：本研究旨在评估用于微调LLM的样本量和样本选择技术，以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别（NER）。
方法：随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别，一旦建立了适当的协议，注释者独立地注释了另外290个公开声明。从490个注释文档中，抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构（来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示）中微调语言模型的选择，以改善NER。多元回归用于评估样本量(句子)之间的关系，实体密度(每个句子的实体[EPS])，和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
结果：在架构中，微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义（均P<.001）。在所有情况下，EPS和句子数是F1得分的显著预测因子(P<.001)，除了GPT-2_large模型，其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量（以句子的数量衡量）的边际收益递减点，点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样，阈值回归模型表明每股收益的边际收益递减，点估计在1.36和1.38之间。
结论：相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM，和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多，重要的是训练数据量和模型参数大小。
BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
2 Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks.

ChatGPT - 3.5 的肿瘤学知识评估： ASCO - SEP 基准的比较研究。影响指数 : 暂无
发表时间：Jan 2024 12
来源期刊：JMIR AI PMID：38875575

DOI：10.2196/50442
文章类型： Journal Article

背景：ChatGPT（OpenAI）是一种最先进的大型语言模型，它使用人工智能（AI）来解决各种主题的问题。美国临床肿瘤学会自我评估计划（ASCO-SEP）创建了一个全面的教育计划，以帮助医生跟上该领域的许多快速进步。题库由多项选择题组成，涉及癌症护理的许多方面，包括诊断,治疗,和支持性护理。随着ChatGPT应用程序的快速扩展，确定ChatGPT-3.5的知识是否符合肿瘤学家建议遵循的既定标准变得至关重要.
目的：本研究旨在评估ChatGPT-3.5\的知识是否与肿瘤学家期望遵守的既定基准相一致。这将使我们对该工具作为临床决策支持的潜在应用有更深入的了解。
方法：我们对ASCO-SEP上ChatGPT-3.5的性能进行了系统评估，医学肿瘤学家培训和实践的领先教育和评估工具。提取了超过1000个涵盖癌症护理范围的多项选择题。问题按癌症类型或学科分类，亚分类为治疗，诊断,或其他。如果ChatGPT-3.5选择ASCO-SEP定义的答案，则答案评分为正确。
结果：总体而言，ChatGPT-3.5提供的正确答案得分为56.1％（583/1040）。该程序显示出不同癌症类型或学科的准确性。在与发展疗法相关的问题中观察到最高的准确性（8/10；80％正确），而在与胃肠道癌症相关的问题中观察到最低的准确性（102/209；48.8％正确）。在预定义的诊断子类别中，程序的性能没有显着差异，治疗,和其他（P=.16，大于.05）。
结论：本研究使用ASCO-SEP评估了ChatGPT-3.5的肿瘤学知识，旨在解决临床决策中关于ChatGPT等AI工具的不确定性。我们的研究结果表明，虽然ChatGPT-3.5为肿瘤学中的AI提供了一个充满希望的前景，其目前在ASCO-SEP测试中的表现需要进一步完善，以达到必要的能力水平。未来的评估可以探索ChatGPT的临床决策支持能力与现实世界的临床情景，它很容易融入医疗工作流程，以及它在医疗保健环境中促进跨学科合作和患者参与的潜力。
BACKGROUND: ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow.
OBJECTIVE: This study aims to evaluate whether ChatGPT-3.5\'s knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making.
METHODS: We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP.
RESULTS: Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program\'s performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05).
CONCLUSIONS: This study evaluated ChatGPT-3.5\'s oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT\'s clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
3 Potential of Large Language Models in Health Care: Delphi Study.

大型语言模型在医疗保健中的潜力：德尔菲研究。影响指数 : 7.076
发表时间：05 2024 13
来源期刊：J Med Internet Res PMID：38739445

DOI：10.2196/52399
文章类型： Journal Article

背景：大型语言模型（LLM）是从文本数据推断的机器学习模型，该模型捕获了上下文中语言使用的微妙模式。现代LLM基于结合了变压器方法的神经网络架构。它们允许模型通过关注文本序列中的多个单词来将单词联系在一起。LLM已被证明对自然语言处理（NLP）中的一系列任务非常有效，包括分类和信息提取任务以及生成应用程序。
目的：这项改编的Delphi研究的目的是收集研究人员关于LLM如何影响医疗保健和优势的意见，弱点,机遇,以及LLM在医疗保健中使用的威胁。
方法：我们邀请了健康信息学领域的研究人员，护理信息学，和医学NLP分享他们对医疗保健中LLM使用的看法。我们从第一轮开始，根据我们的优势提出了开放的问题，弱点,机遇,威胁框架。在第二轮和第三轮，参与者对这些项目进行了评分。
结果：第一个，第二,第三轮有28、23和21名参与者，分别。几乎所有参与者（26/28，第一轮93％和20/21，第三轮95％）都隶属于学术机构。就与用例相关的103项达成了协议，好处,风险,可靠性,采用方面，以及LLM在医疗保健领域的未来。参与者提供了几个用例，包括支持临床任务，文档任务,医学研究和教育，并同意基于LLM的系统将充当患者教育的健康助手。商定的好处包括提高数据处理和提取的效率，提高流程的自动化程度，提高医疗保健服务质量和整体健康结果，提供个性化护理，加速诊断和治疗过程，并改善患者和医疗保健专业人员之间的互动。总的来说,总体上确定了5种医疗保健风险：网络安全漏洞，潜在的病人错误信息，伦理问题，有偏见的决策的可能性，以及与不准确沟通相关的风险。基于LLM的系统中的过度自信被认为是对医学界的风险。6个商定的隐私风险包括使用不受监管的云服务，损害数据安全。暴露敏感的患者数据，违反保密规定,欺诈性使用信息，数据存储和通信中的漏洞，以及对患者数据的不当访问或使用。
结论：与LLM相关的未来研究不仅应专注于测试其与NLP相关任务的可能性，还应考虑模型可能有助于的工作流程以及有关质量的要求，一体化,以及在实践中成功实施所需的法规。
A large language model (LLM) is a machine learning model inferred from text data that captures subtle patterns of language use in context. Modern LLMs are based on neural network architectures that incorporate transformer methods. They allow the model to relate words together through attention to multiple words in a text sequence. LLMs have been shown to be highly effective for a range of tasks in natural language processing (NLP), including classification and information extraction tasks and generative applications.
The aim of this adapted Delphi study was to collect researchers\' opinions on how LLMs might influence health care and on the strengths, weaknesses, opportunities, and threats of LLM use in health care.
We invited researchers in the fields of health informatics, nursing informatics, and medical NLP to share their opinions on LLM use in health care. We started the first round with open questions based on our strengths, weaknesses, opportunities, and threats framework. In the second and third round, the participants scored these items.
The first, second, and third rounds had 28, 23, and 21 participants, respectively. Almost all participants (26/28, 93% in round 1 and 20/21, 95% in round 3) were affiliated with academic institutions. Agreement was reached on 103 items related to use cases, benefits, risks, reliability, adoption aspects, and the future of LLMs in health care. Participants offered several use cases, including supporting clinical tasks, documentation tasks, and medical research and education, and agreed that LLM-based systems will act as health assistants for patient education. The agreed-upon benefits included increased efficiency in data handling and extraction, improved automation of processes, improved quality of health care services and overall health outcomes, provision of personalized care, accelerated diagnosis and treatment processes, and improved interaction between patients and health care professionals. In total, 5 risks to health care in general were identified: cybersecurity breaches, the potential for patient misinformation, ethical concerns, the likelihood of biased decision-making, and the risk associated with inaccurate communication. Overconfidence in LLM-based systems was recognized as a risk to the medical profession. The 6 agreed-upon privacy risks included the use of unregulated cloud services that compromise data security, exposure of sensitive patient data, breaches of confidentiality, fraudulent use of information, vulnerabilities in data storage and communication, and inappropriate access or use of patient data.
Future research related to LLMs should not only focus on testing their possibilities for NLP-related tasks but also consider the workflows the models could contribute to and the requirements regarding quality, integration, and regulations needed for successful implementation in practice.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
4 Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study.

使用大型语言模型生成的简化心血管疾病信息的提示评估：横断面研究。影响指数 : 7.076
发表时间：Apr 2024 22
来源期刊：J Med Internet Res PMID：38648104

DOI：10.2196/55388
文章类型： Journal Article

在这项横断面研究中，我们评估了完整性，可读性,GPT-4响应4种提示产生的心血管疾病预防信息的句法复杂性。
In this cross-sectional study, we evaluated the completeness, readability, and syntactic complexity of cardiovascular disease prevention information produced by GPT-4 in response to 4 kinds of prompts.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
5 An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

零分临床自然语言处理中大型语言模型提示策略的实证评估：算法开发和验证研究。影响指数 : 暂无
发表时间：Apr 2024 8
来源期刊：JMIR Med Inform PMID：38587879

DOI：10.2196/55318
文章类型： Journal Article

背景：大型语言模型（LLM）在自然语言处理（NLP）中显示出非凡的能力，特别是在标记数据稀缺或昂贵的领域，例如临床领域。然而,为了解开隐藏在这些LLM中的临床知识，我们需要设计有效的提示,引导他们在没有任何任务特定训练数据的情况下执行特定的临床NLP任务.这被称为上下文学习，这是一门艺术和科学，需要了解不同LLM的优势和劣势，并迅速采用工程方法。
目的：本研究的目的是评估各种即时工程技术的有效性，包括2个新引入的类型-启发式和合奏提示，使用预训练的语言模型进行零射和少射临床信息提取。
方法：这项全面的实验研究评估了不同的提示类型（简单的前缀，简单的完形填空，思想链，预期,启发式，和合奏)跨越5个临床NLP任务：临床意义消歧，生物医学证据提取，共同参照决议，药物状态提取，和药物属性提取。使用3种最先进的语言模型评估了这些提示的性能：GPT-3.5（OpenAI），双子座（谷歌），和LLaMA-2（Meta）。该研究将零射与少射提示进行了对比，并探讨了合奏方法的有效性。
结果：研究表明，针对特定任务的提示定制对于LLM在零射临床NLP中的高性能至关重要。在临床意义上的消歧，GPT-3.5在启发式提示下达到0.96的准确性，在生物医学证据提取中达到0.94的准确性。启发式提示，伴随着一连串的思想提示，跨任务非常有效。在复杂的场景中，很少有机会提示提高性能，和集合方法利用了多种即时优势。GPT-3.5在任务和提示类型上的表现始终优于Gemini和LLaMA-2。
结论：本研究对即时工程方法进行了严格的评估，并介绍了临床信息提取的创新技术，证明了临床领域上下文学习的潜力。这些发现为未来基于提示的临床NLP研究提供了明确的指导方针。促进非NLP专家参与临床NLP进步。据我们所知,这是在这个生成人工智能时代，对临床NLP的不同提示工程方法进行实证评估的首批作品之一，我们希望它能激励和指导未来在这一领域的研究。
BACKGROUND: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches.
OBJECTIVE: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models.
METHODS: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches.
RESULTS: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types.
CONCLUSIONS: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
6 Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.

GPT - 4V 在回答日本耳鼻喉科委员会认证考试问题中的表现：评估研究。影响指数 : 暂无
发表时间：Mar 2024 28
来源期刊：JMIR Med Educ PMID：38546736

DOI：10.2196/57054
文章类型： Journal Article

背景：人工智能模型可以从医学文献和临床病例中学习，并产生与人类专家相媲美的答案。然而,在分析包含图像和图表的复杂数据方面仍然存在挑战。
目的：本研究旨在评估ChatGPT-4Vision（GPT-4V）对一组100个问题的回答能力和准确性，包括基于图像的问题，来自2023年耳鼻喉科委员会认证考试。
方法：回答来自2023年耳鼻喉科委员会认证考试的100个问题，包括基于图像的问题，使用GPT-4V产生。使用不同的提示评估准确率，和图像的存在，临床领域的问题，并检查了答案内容的变化。
结果：纯文本输入的准确率为，平均而言,24.7%，但增加了英语翻译和提示，提高到47.3%(P<.001)。纯文本输入的平均无响应率为46.3％；加上英文翻译和提示（P<.001），这一比例降至2.7％。在所有类型的输入中，基于图像的问题的准确率低于纯文本问题。相对较高的无反应率。头颈部过敏和鼻腔过敏领域的一般问题和问题具有相对较高的准确率，随着翻译和提示的增加而增加。在内容方面,与解剖学相关的问题准确率最高。对于所有内容类型，翻译和提示的增加提高了准确率。至于基于图像的问题的性能，纯文本输入的平均正确回答率为30.4%，输入文本加图像的比例为41.3%(P=.02)。
结论：对耳鼻喉科委员会认证考试的人工智能答题能力的检查提高了我们对其在该领域的潜力和局限性的理解。尽管随着翻译和提示的增加而注意到了改进，基于图像的问题的准确率低于基于文本的问题，这表明GPT-4V在这一阶段还有改进的空间。此外，文本加图像输入在基于图像的问题中回答更高的比率。我们的发现暗示了GPT-4V在医学中的有用性和潜力；然而，未来需要考虑安全使用方法。
BACKGROUND: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams.
OBJECTIVE: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination.
METHODS: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined.
RESULTS: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02).
CONCLUSIONS: Examination of artificial intelligence\'s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
7 Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study.

比较生成 AI 的观点，心理健康专家,和公众对精神分裂症康复的看法：案例研究。影响指数 : 6.332
发表时间：Mar 2024 18
来源期刊：JMIR Ment Health PMID：38533615

DOI：10.2196/53043
文章类型： Journal Article

■当前的精神卫生保健模式侧重于临床恢复和症状缓解。该模型的疗效受治疗师对患者恢复潜力和治疗关系深度的信任影响。精神分裂症是一种具有严重症状的慢性疾病，康复的可能性是一个有争议的问题。随着人工智能(AI)融入医疗保健领域，重要的是检查其评估精神分裂症等主要精神疾病恢复潜力的能力。
■本研究旨在评估大型语言模型（LLM）与心理健康专业人员相比的能力，以评估有或没有专业治疗的精神分裂症的预后以及长期的积极和消极结果。
■Vignettes被输入到LLM界面中，并由4个AI平台进行了10次评估：ChatGPT-3.5，ChatGPT-4，GoogleBard，还有克劳德.共收集了80项评估，并对照现有规范进行了基准评估，以分析哪些精神卫生专业人员(全科医生、精神病医生,临床心理学家,和心理健康护士）和公众思考有或没有专业治疗的精神分裂症预后以及精神分裂症干预措施的积极和消极长期结果。
■对于专业治疗精神分裂症的预后，ChatGPT-3.5非常悲观，而ChatGPT-4，克劳德，和巴德与专业观点一致，但与普通公众不同。所有LLM都认为，未经专业治疗的精神分裂症将保持静止或恶化。对于长期结果，ChatGPT-4和Claude预测的负面结果比Bard和ChatGPT-3.5更多。为了取得积极成果，ChatGPT-3.5和Claude比Bard和ChatGPT-4更悲观。
■在考虑“治疗”条件时，发现4个LLM中有3个与心理健康专业人员的预测密切相关，这证明了该技术在提供专业临床预后方面的潜力。ChatGPT-3.5的悲观评估是一个令人不安的发现，因为它可能会降低患者开始或坚持精神分裂症治疗的动机。总的来说,尽管法学硕士在加强医疗保健方面有希望，它们的应用需要严格的验证和与人类专业知识的和谐融合。
UNASSIGNED: The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model\'s efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia.
UNASSIGNED: This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes.
UNASSIGNED: Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions.
UNASSIGNED: For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4.
UNASSIGNED: The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the \"with treatment\" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
8 Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.

GPT - 4V （ ision ）在日本国家医学执照考试中的能力：评估研究。影响指数 : 暂无
发表时间：Mar 2024 12
来源期刊：JMIR Med Educ PMID：38470459

DOI：10.2196/54393
文章类型： Journal Article

背景：将大型语言模型（LLM）应用于医学的先前研究集中在基于文本的信息上。最近,LLM的多模态变体获得了识别图像的能力。
目的：我们旨在评估生成预训练变压器（GPT）-4V的图像识别能力，OpenAI最近开发的多模式LLM，在医疗领域，通过测试视觉信息如何影响其性能来回答第117次日本国家医疗执照考试中的问题。
方法：我们专注于108个问题，其中包含一个或多个图像作为问题的一部分，并在两种条件下将相同的问题呈现给GPT-4V：（1）同时包含问题文本和相关图像，以及（2）仅包含问题文本。然后，我们使用精确的McNemar测试比较了两种条件之间的准确性差异。
结果：在带有图像的108个问题中，GPT-4V的准确性是68％（73/108）时，呈现图像和72％（78/108）时，没有图像（P=.36）。对于2个问题类别，临床和一般，有和没有图像的准确度分别为71%(70/98)对78%(76/98;P=.21)和30%(3/10)对20%(2/10;P≥.99)，分别。
结论：来自图像的其他信息并未显着改善GPT-4V在日本国家医学执照考试中的性能。
BACKGROUND: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images.
OBJECTIVE: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination.
METHODS: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test.
RESULTS: Among the 108 questions with images, GPT-4V\'s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively.
CONCLUSIONS: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
9 Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study.

利用生成的 AI 工具支持医疗保健研究中数字解决方案的开发：案例研究。影响指数 : 暂无
发表时间：Mar 2024 6
来源期刊：JMIR Hum Factors PMID：38446539

DOI：10.2196/52885
文章类型： Clinical Study

背景：生成人工智能具有通过提高编码质量来彻底改变健康技术产品开发的潜力，效率,文档,质量评估和审查，和故障排除。
目的：本文探讨了商用生成人工智能工具（ChatGPT）在开发数字健康行为改变干预措施中的应用，旨在支持患者参与商业数字糖尿病预防计划。
方法：我们检查了容量，优势,以及ChatGPT在支持数字产品理念概念化方面的局限性，干预内容开发，和软件工程过程，包括软件需求生成，软件设计,和代码生产。总的来说,11名评估人员，每个人在从医学和实施科学到计算机科学的研究领域都有至少10年的经验，参与了输出审查过程(ChatGPT与人工产生的输出)。所有人都熟悉或事先接触过原始的个性化自动消息传递系统干预。评估人员根据可理解性对ChatGPT产生的产出进行了评级，可用性，新奇，相关性，完整性，和效率。
结果：大多数指标都获得了积极的评分。我们发现ChatGPT可以（1）支持开发人员更快地实现高质量的产品；（2）促进技术和非技术团队成员之间的非技术沟通和系统理解，围绕快速和易于构建的医疗技术计算解决方案的开发目标。
结论：ChatGPT可以作为参与软件开发生命周期的研究人员的有用推动者，从产品概念化到功能识别，从用户故事开发到代码生成。
背景：ClinicalTrials.govNCT04049500;https://clinicaltrials.gov/ct2/show/NCT04049500。
BACKGROUND: Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting.
OBJECTIVE: This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program.
METHODS: We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency.
RESULTS: Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies.
CONCLUSIONS: ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation.
BACKGROUND: ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
10 Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

学习使用生成性 AI 辅助进行稀有和复杂的诊断：流行的大型语言模型的定性研究。影响指数 : 暂无
发表时间：Feb 2024 13
来源期刊：JMIR Med Educ PMID：38349725

DOI：10.2196/51391
文章类型： Journal Article

背景：患有罕见和复杂疾病的患者通常会出现诊断延迟和误诊，因为有关这些疾病的全面知识仅限于少数医学专家。在这种情况下，大型语言模型（LLM）已成为强大的知识聚合工具，在临床决策支持和教育领域具有应用。
目的：本研究旨在探索3种流行的LLM的潜力，即巴德(谷歌有限责任公司)，ChatGPT-3.5(OpenAI),和GPT-4(OpenAI)，在医学教育中加强对罕见和复杂疾病的诊断，同时研究及时工程对其性能的影响。
方法：我们对公开的复杂和罕见病例进行了实验，以实现这些目标。我们实施了各种提示策略，以使用开放式和多项选择提示来评估这些模型的性能。此外,我们使用了多数投票策略来利用语言模型中不同的推理路径，旨在提高其可靠性。此外，我们将他们的表现与人类受访者和MedAlpaca的表现进行了比较，专门为医疗任务设计的生成LLM。
结果：值得注意的是，所有LLM的表现都优于平均人类共识和MedAlpaca，最低利润率为5%和13%，分别,在诊断病例挑战收集的所有30例病例中。在经常误诊的病例类别上，吟游诗人与MedAlpaca并列，但超过人类平均共识14%，而GPT-4和ChatGPT-3.5在中度误诊病例类别中的表现优于MedAlpaca和人类受访者，其最低准确度得分分别为28％和11％，分别。多数投票策略,特别是GPT-4，在诊断复杂病例收集的所有病例中，总体得分最高，超越其他LLM。在重症监护III数据集的医疗信息集市上，Bard和GPT-4获得了最高的诊断准确性评分，多项选择提示得分93%，而ChatGPT-3.5和MedAlpaca得分分别为73%和47%，分别。此外，我们的研究结果表明，对于提高LLM的性能，并不存在一刀切的提示方法，而且单一的策略并不普遍适用于所有LLM。
结论：我们的研究结果揭示了LLM的诊断能力，以及与确定符合每种语言模型特征和特定任务要求的最佳提示策略相关的挑战。强调了提示工程的意义，为使用这些语言模型进行医学培训的研究人员和从业人员提供有价值的见解。此外，这项研究代表了了解LLM如何在罕见和复杂的医学病例中增强诊断推理的关键一步，为开发有效的教育工具和准确的诊断工具铺平道路，以改善患者护理和结果。
BACKGROUND: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.
OBJECTIVE: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.
METHODS: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks.
RESULTS: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs.
CONCLUSIONS: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model\'s characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)

language model 关注

1 Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

2 Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks.

3 Potential of Large Language Models in Health Care: Delphi Study.

4 Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study.

5 An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

6 Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.

7 Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study.

8 Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.

9 Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study.

10 Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.