语言模型 language model-医云文献数字医云科研云海量医学决策数据服务

language model 关注

语言模型

文献(139篇)

百科

视频

1 Measuring cognitive effort using tabular transformer-based language models of electronic health record-based audit log action sequences.

使用基于电子健康记录的审计日志动作序列的基于表格转换器的语言模型测量认知努力。影响指数 : 7.942
发表时间：Jul 2024 13
来源期刊：J Am Med Inform Assoc PMID：39001791

DOI：10.1093/jamia/ocae171
文章类型： Journal Article

目的：为了开发和验证一种新的措施，动作熵，用于评估与基于电子健康记录(EHR)的工作活动相关的认知努力。
方法：包括2019年来自四个外科重症监护病房的主治医师和高级执业提供者（APP）的基于EHR的审核日志。神经语言模型(LM)分别针对出席者和APP动作序列进行了训练和验证。行动熵被计算为与下一个行动的预测概率相关的交叉熵，基于先前的行动。要验证度量，进行了一项配对研究，以评估已知高认知努力情景中动作熵的差异，即,注意患者和EHR收件箱之间的切换。
结果：纳入了65名临床医生，他们对8956名独特患者进行了基于5.904.429EHR的审计日志操作。与非切换场景相比，所有注意力切换场景都与更高的动作熵相关（P<.001），除了APP之间的从收件箱切换场景。出席者之间的最大差异是收件箱注意切换：与非切换场景相比，切换的行动熵高1.288（95％CI，1.256-1.320）标准偏差（SD）。对于APP，最大的区别是收件箱切换，与非切换场景相比，切换的动作熵高2.354(95%CI，2.311-2.397)。
结论：我们开发了一个基于LM的指标，动作熵，用于评估与基于EHR的行为相关的认知负担。当针对高认知努力的已知情况进行评估时，该指标显示出判别效度和统计意义(即，注意切换)。通过额外的验证，该指标可用作筛查工具,用于评估与较高认知负担相关的行为行为表型.
结论：基于LM的行动熵度量-依赖于EHR行动序列-为评估基于EHR的工作流程中的认知努力提供了机会。
OBJECTIVE: To develop and validate a novel measure, action entropy, for assessing the cognitive effort associated with electronic health record (EHR)-based work activities.
METHODS: EHR-based audit logs of attending physicians and advanced practice providers (APPs) from four surgical intensive care units in 2019 were included. Neural language models (LMs) were trained and validated separately for attendings\' and APPs\' action sequences. Action entropy was calculated as the cross-entropy associated with the predicted probability of the next action, based on prior actions. To validate the measure, a matched pairs study was conducted to assess the difference in action entropy during known high cognitive effort scenarios, namely, attention switching between patients and to or from the EHR inbox.
RESULTS: Sixty-five clinicians performing 5 904 429 EHR-based audit log actions on 8956 unique patients were included. All attention switching scenarios were associated with a higher action entropy compared to non-switching scenarios (P < .001), except for the from-inbox switching scenario among APPs. The highest difference among attendings was for the from-inbox attention switching: Action entropy was 1.288 (95% CI, 1.256-1.320) standard deviations (SDs) higher for switching compared to non-switching scenarios. For APPs, the highest difference was for the to-inbox switching, where action entropy was 2.354 (95% CI, 2.311-2.397) SDs higher for switching compared to non-switching scenarios.
CONCLUSIONS: We developed a LM-based metric, action entropy, for assessing cognitive burden associated with EHR-based actions. The metric showed discriminant validity and statistical significance when evaluated against known situations of high cognitive effort (ie, attention switching). With additional validation, this metric can potentially be used as a screening tool for assessing behavioral action phenotypes that are associated with higher cognitive burden.
CONCLUSIONS: An LM-based action entropy metric-relying on sequences of EHR actions-offers opportunities for assessing cognitive effort in EHR-based workflows.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
2 The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis.

ChatGPT 解释文本和减少抄袭的能力：描述性分析。影响指数 : 暂无
发表时间：Jul 2024 8
来源期刊：JMIR Med Educ PMID：38989841

DOI：10.2196/53308
文章类型： Journal Article

■OpenAI对ChatGPT的引入引起了极大的关注。在其能力中，释义突出。
■本研究旨在调查该聊天机器人产生的释义文本中剽窃的令人满意的水平。
■向ChatGPT提交了三个不同长度的文本。然后指示ChatGPT使用五个不同的提示来解释所提供的文本。在研究的后续阶段，案文分为不同的段落，ChatGPT被要求单独解释每个段落。最后,在第三阶段，ChatGPT被要求解释它以前生成的文本。
■ChatGPT生成的文本中的平均抄袭率为45％（SD10％）。ChatGPT在提供的文本中表现出抄袭的大幅减少（平均差异-0.51，95％CI-0.54至-0.48；P<.001）。此外，当将第二次尝试与初始尝试进行比较时，抄袭率显着下降（平均差-0.06,95%CI-0.08至-0.03；P<.001）。文本中的段落数量表明与抄袭的百分比有值得注意的关联，由单个段落组成的文本表现出最低的抄袭率(P<.001)。
■尽管ChatGPT显著减少了文本中的抄袭，现有的抄袭水平仍然相对较高。这突显了研究人员在将这种聊天机器人纳入他们的工作时的关键谨慎。
UNASSIGNED: The introduction of ChatGPT by OpenAI has garnered significant attention. Among its capabilities, paraphrasing stands out.
UNASSIGNED: This study aims to investigate the satisfactory levels of plagiarism in the paraphrased text produced by this chatbot.
UNASSIGNED: Three texts of varying lengths were presented to ChatGPT. ChatGPT was then instructed to paraphrase the provided texts using five different prompts. In the subsequent stage of the study, the texts were divided into separate paragraphs, and ChatGPT was requested to paraphrase each paragraph individually. Lastly, in the third stage, ChatGPT was asked to paraphrase the texts it had previously generated.
UNASSIGNED: The average plagiarism rate in the texts generated by ChatGPT was 45% (SD 10%). ChatGPT exhibited a substantial reduction in plagiarism for the provided texts (mean difference -0.51, 95% CI -0.54 to -0.48; P<.001). Furthermore, when comparing the second attempt with the initial attempt, a significant decrease in the plagiarism rate was observed (mean difference -0.06, 95% CI -0.08 to -0.03; P<.001). The number of paragraphs in the texts demonstrated a noteworthy association with the percentage of plagiarism, with texts consisting of a single paragraph exhibiting the lowest plagiarism rate (P<.001).
UNASSIGNED: Although ChatGPT demonstrates a notable reduction of plagiarism within texts, the existing levels of plagiarism remain relatively high. This underscores a crucial caution for researchers when incorporating this chatbot into their work.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
3 News Media Framing of Suicide Circumstances and Gender: Mixed Methods Analysis.

自杀情况和性别的新闻媒体框架：混合方法分析。影响指数 : 6.332
发表时间：Jul 2024 3
来源期刊：JMIR Ment Health PMID：38959061

DOI：10.2196/49879
文章类型： Journal Article

背景：自杀是全球死亡的主要原因。新闻报道准则旨在遏制不安全报道的影响；然而，在新闻报道中自杀的框架可能因情况和死者的性别等重要特征而有所不同。
目的：本研究旨在研究新闻媒体对自杀报道使用污名化或荣耀化的语言进行陷害的程度，以及性别和自杀情况在这种陷害方面的差异。
方法：我们分析了200篇有关自杀的新闻文章，并应用经过验证的自杀污名量表来识别污名化和荣耀化的语言。我们用2个广泛使用的指标来评估语言相似性，余弦相似性和互信息得分，使用基于机器学习的大型语言模型。
结果：男性自杀的新闻报道比女性自杀的报道更类似于污名化（P<.001）和美化（P=.005）语言。考虑到自杀的情况,互信息得分表明，在使用污名化或美化语言的性别差异最明显的文章归因于法律（0.155），关系（0.268），或心理健康问题（0.251）为原因。
结论：语言差异，按性别,在报告自杀时使用污名化或美化语言可能会加剧自杀差异。
BACKGROUND: Suicide is a leading cause of death worldwide. Journalistic reporting guidelines were created to curb the impact of unsafe reporting; however, how suicide is framed in news reports may differ by important characteristics such as the circumstances and the decedent\'s gender.
OBJECTIVE: This study aimed to examine the degree to which news media reports of suicides are framed using stigmatized or glorified language and differences in such framing by gender and circumstance of suicide.
METHODS: We analyzed 200 news articles regarding suicides and applied the validated Stigma of Suicide Scale to identify stigmatized and glorified language. We assessed linguistic similarity with 2 widely used metrics, cosine similarity and mutual information scores, using a machine learning-based large language model.
RESULTS: News reports of male suicides were framed more similarly to stigmatizing (P<.001) and glorifying (P=.005) language than reports of female suicides. Considering the circumstances of suicide, mutual information scores indicated that differences in the use of stigmatizing or glorifying language by gender were most pronounced for articles attributing legal (0.155), relationship (0.268), or mental health problems (0.251) as the cause.
CONCLUSIONS: Linguistic differences, by gender, in stigmatizing or glorifying language when reporting suicide may exacerbate suicide disparities.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
4 A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients.

对 GPT - 4 和 PaLM2 诊断 MIMIC - IV 患者合并症的性能进行系统评估。影响指数 : 暂无
发表时间：Feb 2024
来源期刊：Health Care Sci PMID：38939167

DOI：10.1002/hcs2.79
文章类型： Journal Article

■鉴于医院的诊断错误率高得惊人，以及大型语言模型(LLM)的最新发展，我们着手测量两种流行的LLM:GPT-4和PaLM2的诊断灵敏度.评估LLM诊断能力的小规模研究显示了有希望的结果，GPT-4在诊断测试用例方面表现出很高的准确性。然而,需要对真实电子患者数据进行更大的评估,以提供更可靠的估计.
■为了填补文献中的这一空白，我们使用了一个去识别的电子健康记录(EHR)数据集,该数据集包含波士顿贝斯以色列女执事医疗中心收治的约30万名患者.这个数据集包含血液，成像,微生物学和生命体征信息以及患者的医疗诊断代码。根据现有的EHR数据，医生为每个病人策划了一套诊断，我们称之为地面真相诊断。然后，我们设计了精心编写的提示，以从LLM中获得患者的诊断预测，并将其与1000名患者的随机样本中的真实诊断进行比较。
■根据正确预测的地面实况诊断的比例，我们估计GPT-4的诊断命中率为93.9%。PaLM2在相同数据集上达到84.7%。在这1000个随机选择的EHR上，GPT-4正确识别1116个独特的诊断。
■结果表明，人工智能（AI）在与临床医生一起工作时具有减少认知错误的潜力，而认知错误每年导致成千上万的误诊。然而,人类对人工智能的监督仍然至关重要：LLM不能取代临床医生，尤其是当涉及到人类的理解和同情。此外，将人工智能纳入医疗保健存在大量挑战，包括伦理,责任和监管障碍。
UNASSIGNED: Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT-4 and PaLM2. Small-scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT-4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.
UNASSIGNED: To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients\' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully-written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.
UNASSIGNED: Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT-4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT-4 correctly identified 1116 unique diagnoses.
UNASSIGNED: The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
5 Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis.

评估由 ChatGPT 和 Bard 生成的结构化摘要与脊柱外科领域中的人类书面摘要相比的可重复性：比较分析。影响指数 : 7.076
发表时间：Jun 2024 26
来源期刊：J Med Internet Res PMID：38924787

DOI：10.2196/52001
文章类型： Journal Article

背景：由于人工智能（AI）的最新进展，语言模型应用程序可以生成逻辑文本输出，很难与人类写作区分开。ChatGPT（OpenAI）和Bard（随后更名为“Gemini”；GoogleAI）是使用不同的方法开发的，但是关于它们产生摘要的能力差异的研究很少。在脊柱外科领域使用AI撰写科学摘要是许多争论和争议的中心。
目的：本研究的目的是评估由ChatGPT和Bard生成的结构化摘要与人类撰写的摘要在脊柱外科领域的可重复性。
方法：总共，从7种著名期刊中随机选择60篇涉及脊柱部分的摘要，并用作ChatGPT和Bard输入语句，以根据提供的论文标题生成摘要。共174篇摘要，分为人类撰写的摘要，ChatGPT生成的摘要，和Bard生成的摘要，对期刊指南的结构化格式和内容的一致性进行了评估。使用iThenticate和ZeroGPT程序评估抄袭和AI输出的可能性，分别。脊柱领域共有8位评审员评估了30篇随机提取的摘要，以确定它们是由AI还是人类作者制作的。
结果：ChatGPT摘要中符合期刊格式指南的摘要比例（34/60，56.6％）高于Bard产生的摘要（6/54，11.1％；P<.001）。然而,与ChatGPT摘要(30/60,50%;P<.001)相比,Bard摘要的字数符合期刊指南的比例更高(49/54,90.7%)。ChatGPT生成的摘要的相似性指数（20.7%）显著低于Bard生成的摘要（32.1%；P<.001）。AI检测程序预测，21.7%(13/60)的人类群体，ChatGPT组的63.3%（38/60），Bard组的87%(47/54)可能是由人工智能产生的，曲线下面积值为0.863(P<.001)。人类评审员的平均检出率为53.8%(SD11.2%)，灵敏度为56.3%，特异性为48.4%。共有56.3%(63/112)的实际人类撰写的摘要和55.9%(62/128)的人工智能生成的摘要被认为是人类撰写的和人工智能生成的。分别。
结论：ChatGPT和Bard都可以用来帮助编写摘要，但大多数人工智能生成的摘要目前被认为是不道德的，因为抄袭和人工智能检测率很高。ChatGPT生成的摘要在满足期刊格式指南方面似乎优于Bard生成的摘要。因为人类无法准确区分人类编写的摘要和人工智能程序产生的摘要，至关重要的是要特别谨慎，并检查使用AI程序的道德界限，包括ChatGPT和Bard.
BACKGROUND: Due to recent advances in artificial intelligence (AI), language model applications can generate logical text output that is difficult to distinguish from human writing. ChatGPT (OpenAI) and Bard (subsequently rebranded as \"Gemini\"; Google AI) were developed using distinct approaches, but little has been studied about the difference in their capability to generate the abstract. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy.
OBJECTIVE: The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery.
METHODS: In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors.
RESULTS: The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively.
CONCLUSIONS: Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
6 What Is the Performance of ChatGPT in Determining the Gender of Individuals Based on Their First and Last Names?

ChatGPT 在根据姓名和姓氏确定个人性别方面的表现如何？影响指数 : 暂无
发表时间：Mar 2024 13
来源期刊：JMIR AI PMID：38875596

DOI：10.2196/53656
文章类型： Journal Article

暂无摘要。

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
7 Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

为命名实体识别任务微调大型语言模型的样本量考虑因素：方法论研究。影响指数 : 暂无
发表时间：May 2024 16
来源期刊：JMIR AI PMID：38875593

DOI：10.2196/52095
文章类型： Journal Article

背景：大型语言模型（LLM）具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
目的：本研究旨在评估用于微调LLM的样本量和样本选择技术，以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别（NER）。
方法：随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别，一旦建立了适当的协议，注释者独立地注释了另外290个公开声明。从490个注释文档中，抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构（来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示）中微调语言模型的选择，以改善NER。多元回归用于评估样本量(句子)之间的关系，实体密度(每个句子的实体[EPS])，和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
结果：在架构中，微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义（均P<.001）。在所有情况下，EPS和句子数是F1得分的显著预测因子(P<.001)，除了GPT-2_large模型，其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量（以句子的数量衡量）的边际收益递减点，点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样，阈值回归模型表明每股收益的边际收益递减，点估计在1.36和1.38之间。
结论：相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM，和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多，重要的是训练数据量和模型参数大小。
BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
8 Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks.

ChatGPT - 3.5 的肿瘤学知识评估： ASCO - SEP 基准的比较研究。影响指数 : 暂无
发表时间：Jan 2024 12
来源期刊：JMIR AI PMID：38875575

DOI：10.2196/50442
文章类型： Journal Article

背景：ChatGPT（OpenAI）是一种最先进的大型语言模型，它使用人工智能（AI）来解决各种主题的问题。美国临床肿瘤学会自我评估计划（ASCO-SEP）创建了一个全面的教育计划，以帮助医生跟上该领域的许多快速进步。题库由多项选择题组成，涉及癌症护理的许多方面，包括诊断,治疗,和支持性护理。随着ChatGPT应用程序的快速扩展，确定ChatGPT-3.5的知识是否符合肿瘤学家建议遵循的既定标准变得至关重要.
目的：本研究旨在评估ChatGPT-3.5\的知识是否与肿瘤学家期望遵守的既定基准相一致。这将使我们对该工具作为临床决策支持的潜在应用有更深入的了解。
方法：我们对ASCO-SEP上ChatGPT-3.5的性能进行了系统评估，医学肿瘤学家培训和实践的领先教育和评估工具。提取了超过1000个涵盖癌症护理范围的多项选择题。问题按癌症类型或学科分类，亚分类为治疗，诊断,或其他。如果ChatGPT-3.5选择ASCO-SEP定义的答案，则答案评分为正确。
结果：总体而言，ChatGPT-3.5提供的正确答案得分为56.1％（583/1040）。该程序显示出不同癌症类型或学科的准确性。在与发展疗法相关的问题中观察到最高的准确性（8/10；80％正确），而在与胃肠道癌症相关的问题中观察到最低的准确性（102/209；48.8％正确）。在预定义的诊断子类别中，程序的性能没有显着差异，治疗,和其他（P=.16，大于.05）。
结论：本研究使用ASCO-SEP评估了ChatGPT-3.5的肿瘤学知识，旨在解决临床决策中关于ChatGPT等AI工具的不确定性。我们的研究结果表明，虽然ChatGPT-3.5为肿瘤学中的AI提供了一个充满希望的前景，其目前在ASCO-SEP测试中的表现需要进一步完善，以达到必要的能力水平。未来的评估可以探索ChatGPT的临床决策支持能力与现实世界的临床情景，它很容易融入医疗工作流程，以及它在医疗保健环境中促进跨学科合作和患者参与的潜力。
BACKGROUND: ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow.
OBJECTIVE: This study aims to evaluate whether ChatGPT-3.5\'s knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making.
METHODS: We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP.
RESULTS: Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program\'s performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05).
CONCLUSIONS: This study evaluated ChatGPT-3.5\'s oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT\'s clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
9 Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and Validation.

基于深度语言模型的多维信息质量评估在线健康搜索：算法开发与验证。影响指数 : 暂无
发表时间：May 2024 2
来源期刊：JMIR AI PMID：38875551

DOI：10.2196/42630
文章类型： Journal Article

背景：网络资源中广泛存在的错误信息可能会对寻求健康建议的个人产生严重影响。尽管如此,信息检索模型通常只关注查询文档相关性维度来对结果进行排名。
目的：研究基于深度学习的多维信息质量检索模型，以提高在线医疗信息搜索结果的有效性。
方法：在本研究中，我们模拟了在线健康信息搜索场景，其中包含32个不同的健康相关查询的主题集和一个包含2019年4月常见爬网快照中10亿个Web文档的语料库。使用最先进的预训练语言模型，我们根据检索到的文件的有用性评估其质量，支持性,以及6030人工注释的给定搜索查询的可信度，查询-文档对。我们使用迁移学习和更具体的领域适应技术来评估这种方法。
结果：在迁移学习设置中，有用性模型提供了帮助和伤害兼容文档之间的最大区别，相差5.6%，导致检索到的前10名中的大多数有用文档。支持性模型实现了最佳的伤害相容性(+2.4%)，而有用性的结合，支持性,和可信度模型在有用的主题上实现了帮助和伤害兼容性之间的最大区别(+16.9%)。在域自适应设置中，不同模型的线性组合表现出稳健的性能，所有尺寸的帮助-伤害兼容性都高于+4.4%，高达+6.8%。
结论：这些结果表明，集成为特定信息质量维度创建的自动排名模型可以提高与健康相关的信息检索的有效性。因此，我们的方法可用于增强寻求在线健康信息的个人的搜索。
BACKGROUND: Widespread misinformation in web resources can lead to serious implications for individuals seeking health advice. Despite that, information retrieval models are often focused only on the query-document relevance dimension to rank results.
OBJECTIVE: We investigate a multidimensional information quality retrieval model based on deep learning to enhance the effectiveness of online health care information search results.
METHODS: In this study, we simulated online health information search scenarios with a topic set of 32 different health-related inquiries and a corpus containing 1 billion web documents from the April 2019 snapshot of Common Crawl. Using state-of-the-art pretrained language models, we assessed the quality of the retrieved documents according to their usefulness, supportiveness, and credibility dimensions for a given search query on 6030 human-annotated, query-document pairs. We evaluated this approach using transfer learning and more specific domain adaptation techniques.
RESULTS: In the transfer learning setting, the usefulness model provided the largest distinction between help- and harm-compatible documents, with a difference of +5.6%, leading to a majority of helpful documents in the top 10 retrieved. The supportiveness model achieved the best harm compatibility (+2.4%), while the combination of usefulness, supportiveness, and credibility models achieved the largest distinction between help- and harm-compatibility on helpful topics (+16.9%). In the domain adaptation setting, the linear combination of different models showed robust performance, with help-harm compatibility above +4.4% for all dimensions and going as high as +6.8%.
CONCLUSIONS: These results suggest that integrating automatic ranking models created for specific information quality dimensions can increase the effectiveness of health-related information retrieval. Thus, our approach could be used to enhance searches made by individuals seeking online health information.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
10 Quantifying social capital creation in post-disaster recovery aid in Indonesia: methodological innovation by an AI-based language model.

量化印度尼西亚灾后恢复援助中的社会资本创造：基于 AI 的语言模型的方法创新。影响指数 : 3.311
发表时间：Jul 2024 11
来源期刊：Disasters PMID：38860638

DOI：10.1111/disa.12631
文章类型： Journal Article

与受灾社区的顺畅互动可以创造和加强其社会资本，导致在提供成功的灾后恢复援助方面更有效。为了理解交互类型之间的关系，产生的社会资本的力量，并提供成功的灾后恢复援助，需要复杂的人种学定性研究，但它可能仍然是说明性的，因为它是基于，至少在某种程度上,根据研究人员的直觉。因此，本文提供了一种创新的研究方法，采用基于定量人工智能（AI）的语言模型，这允许研究人员重新检查数据，从而验证定性研究的结果，并收集可能错过的其他见解。本文认为，人脉紧密的人员和以宗教为基础的社区活动有助于通过社区内部的联系和与外部机构的联系以及混合方法来增强社会资本，基于基于AI的语言模型，有效加强基于文本的定性研究。
Smooth interaction with a disaster-affected community can create and strengthen its social capital, leading to greater effectiveness in the provision of successful post-disaster recovery aid. To understand the relationship between the types of interaction, the strength of social capital generated, and the provision of successful post-disaster recovery aid, intricate ethnographic qualitative research is required, but it is likely to remain illustrative because it is based, at least to some degree, on the researcher\'s intuition. This paper thus offers an innovative research method employing a quantitative artificial intelligence (AI)-based language model, which allows researchers to re-examine data, thereby validating the findings of the qualitative research, and to glean additional insights that might otherwise have been missed. This paper argues that well-connected personnel and religiously-based communal activities help to enhance social capital by bonding within a community and linking to outside agencies and that mixed methods, based on the AI-based language model, effectively strengthen text-based qualitative research.
災害の影響を受けたコミュニティとの円滑な交流は、コミュニティの社会資本を構築および強化することができ、災害後の復興支援をより効果的に提供することにつながる。相互作用の種類、生成される社会資本の強さ、および成功した災害復興支援の提供の関係を理解するには、複雑な民族誌的かつ質的研究が必要であるが、それは少なくともある程度は研究者の直感に基づいているため、例示にとどまる可能性が高い。このような研究に次元を加えて強化するために、この論文では、定量的な AI ベースの言語モデルを使用した革新的研究手法を提示する。本モデルを使用することで、質的研究のデータを再調査して結果を検証し、見逃されていた可能性のあるその他の洞察の収集が可能となる。本論文では、人間関係の良好な人材と宗教に基づいた共同体活動が、コミュニティ内での絆や外部機関とのつながりによって社会資本の強化に役立つと論じている。また、AI ベースの言語モデルに基づく混合手法により、テキストベースの質的研究を効果的に強化できると論じている。.
与受灾社区的顺畅互动可创造和加强社区的社会资本，从而更有效地提供灾后恢复援助。为了理解互动类型所产生的社会资本强度以及成功的灾后恢复援助之间的关系，需要进行复杂的民族志定性研究，但它可能仍具有说明性，因为它至少在某种程度上基于研究人员的直觉。为了增加维度并加强此类研究，本文提出了一种采用基于定量人工智能的语言模型的创新研究方法。该模型允许研究人员重新审查数据，从而验证定性研究的结果，并收集定性研究可能错过的额外见解。本文认为，人脉广泛的人员和以基于宗教的社区活动有助于通过社区内的联系和与外部机构的联系来增强社会资本。它还认为，基于人工智能语言模型的混合方法能有效地加强基于文本的定性研究。.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文

language model 关注

1 Measuring cognitive effort using tabular transformer-based language models of electronic health record-based audit log action sequences.

2 The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis.

3 News Media Framing of Suicide Circumstances and Gender: Mixed Methods Analysis.

4 A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients.

5 Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis.

6 What Is the Performance of ChatGPT in Determining the Gender of Individuals Based on Their First and Last Names?

7 Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.

8 Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks.

9 Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and Validation.

10 Quantifying social capital creation in post-disaster recovery aid in Indonesia: methodological innovation by an AI-based language model.