language models

语言模型
  • 文章类型: Journal Article
    背景:虽然病史是诊断疾病的基础,由于资源限制,教学和提供技能反馈可能具有挑战性。因此,虚拟模拟患者和基于网络的聊天机器人已经成为教育工具,随着人工智能(AI)的最新进展,如大型语言模型(LLM),增强了它们的真实性和提供反馈的潜力。
    目的:在我们的研究中,我们旨在评估生成预训练变压器(GPT)4模型的有效性,以对医学生在模拟患者的历史表现提供结构化反馈.
    方法:我们进行了一项前瞻性研究,涉及医学生使用GPT驱动的聊天机器人进行历史学习。为此,我们设计了一个聊天机器人来模拟病人的反应,并提供对学生的全面性的即时反馈。分析了学生与聊天机器人的互动,并将聊天机器人的反馈与人类评估者的反馈进行了比较。我们测量了评估者间的可靠性,并进行了描述性分析以评估反馈的质量。
    结果:研究的大多数参与者都在医学院三年级。我们的分析中总共包括了来自106个对话的1894个问答对。在超过99%的病例中,GPT-4的角色扮演和反应在医学上是合理的。GPT-4与人类评估者之间的评估者间可靠性显示出“几乎完美”的一致性(Cohenκ=0.832)。在45个反馈类别中的8个中,检测到的一致性较低(κ<0.6)突出了模型评估过于具体或与人类判断不同的主题。
    结论:GPT模型在医学生提供的关于历史记录对话的结构化反馈方面是有效的。尽管我们揭示了某些反馈类别的反馈特异性的一些限制,与人类评估者的总体高度一致表明,LLM可以成为医学教育的宝贵工具。我们的发现,因此,倡导在医疗培训中仔细整合人工智能驱动的反馈机制,并在这种情况下使用LLM时突出重要方面。
    BACKGROUND: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback.
    OBJECTIVE: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students\' performance in history taking with a simulated patient.
    METHODS: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients\' responses and provide immediate feedback on the comprehensiveness of the students\' history taking. Students\' interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback.
    RESULTS: Most of the study\'s participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4\'s role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed \"almost perfect\" agreement (Cohen κ=0.832). Less agreement (κ<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model\'s assessments were overly specific or diverged from human judgement.
    CONCLUSIONS: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    随着像ChatGPT这样的大型语言模型在各个行业中的应用越来越多,它在医疗领域的潜力,特别是在标准化考试中,已成为研究的重点。
    本研究的目的是评估ChatGPT的临床表现,重点关注其在中国国家医师资格考试(CNMLE)中的准确性和可靠性。
    CNMLE2022问题集,由500个单答案多选题组成,被重新分类为15个医学亚专科。从2023年4月24日至5月15日,每个问题在OpenAI平台上用中文进行了8到12次测试。考虑了三个关键因素:GPT-3.5和4.0版本,针对医疗亚专科定制的系统角色的提示指定,为了连贯性而重复。通过准确度阈值被建立为60%。采用χ2检验和κ值评估模型的准确性和一致性。
    GPT-4.0达到了72.7%的通过精度,显著高于GPT-3.5(54%;P<.001)。GPT-4.0重复反应的变异性低于GPT-3.5(9%vs19.5%;P<.001)。然而,两个模型都显示出相对较好的响应一致性,κ值分别为0.778和0.610。系统角色在数值上提高了GPT-4.0(0.3%-3.7%)和GPT-3.5(1.3%-4.5%)的准确性,并将变异性降低了1.7%和1.8%,分别(P>0.05)。在亚组分析中,ChatGPT在不同题型之间取得了相当的准确率(P>.05)。GPT-4.0在15个亚专业中的14个超过了准确性阈值,而GPT-3.5在第一次反应的15人中有7人这样做。
    GPT-4.0通过了CNMLE,并在准确性等关键领域优于GPT-3.5,一致性,和医学专科专业知识。添加系统角色不会显着增强模型的可靠性和答案的连贯性。GPT-4.0在医学教育和临床实践中显示出有希望的潜力,值得进一步研究。
    UNASSIGNED: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.
    UNASSIGNED: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).
    UNASSIGNED: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt\'s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model\'s accuracy and consistency.
    UNASSIGNED: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.
    UNASSIGNED: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model\'s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:尽管在实施方面存在不确定性,人工智能驱动的生成语言模型(GLM)在医学领域具有巨大的潜力。GLM的部署可以提高患者对临床文本的理解,并改善低健康素养。
    目的:本研究的目的是评估ChatGPT-3.5和GPT-4的潜力,以适应患者特定输入教育水平的医疗信息的复杂性,这是至关重要的,如果它是作为解决低健康素养的工具。
    方法:设计了与2种常见慢性疾病-II型糖尿病和高血压-相关的输入模板。针对假设的患者教育水平调整每个临床小插图,以评估输出个性化。要评估GLM(GPT-3.5和GPT-4)在定制输出编写方面的成功,使用Flesch阅读缓解评分(FKRE)和Flesch-Kincaid等级(FKGL)对转换前后输出的可读性进行量化.
    结果:使用GPT-3.5和GPT-4在2个临床小插曲中产生反应(n=80)。对于GPT-3.5,FKRE平均值为57.75(SD4.75),51.28(标准差5.14),32.28(标准差4.52),六年级为28.31(SD5.22),8年级,高中,和单身汉,分别;FKGL平均得分为9.08(SD0.90),10.27(标准差1.06),13.4(标准差0.80),和13.74(标准差1.18)。GPT-3.5仅与学士学位的预设教育水平保持一致。相反,GPT-4的FKRE平均得分为74.54(SD2.6),71.25(标准差4.96),47.61(标准差6.13),和13.71(标准差5.77),FKGL平均得分为6.3(SD0.73),6.7(标准差1.11),11.09(标准差1.26),和17.03(标准差1.11),分别为相同的教育水平。GPT-4符合除6级FKRE平均值外的所有组的目标可读性。两种GLM的产出均具有统计学上的显着差异(P<.001;8年级P<.001;高中P<.001;学士P=.003;FKGL:6年级P=.001;8年级P<.001;高中P<.001;学士P<.001)。
    结论:GLM可以根据输入指定的教育来改变医学文本输出的结构和可读性。然而,GLM将输入教育指定分类为3个广泛的输出可读性等级:容易(6年级和8年级),中等(高中),和困难(学士学位)。这是第一个结果表明GLM在输出文本简化方面的成功存在更广泛的界限。未来的研究必须确定GLM如何可靠地将医学文本个性化到预定的教育水平,以便对医疗保健素养产生更广泛的影响。
    BACKGROUND: Although uncertainties exist regarding implementation, artificial intelligence-driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy.
    OBJECTIVE: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy.
    METHODS: Input templates related to 2 prevalent chronic diseases-type II diabetes and hypertension-were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL).
    RESULTS: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor\'s, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor\'s degree. Conversely, GPT-4\'s FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels.
    CONCLUSIONS: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor\'s degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: English Abstract
    BACKGROUND: The medical coding of radiology reports is essential for a good quality of care and correct billing, but at the same time a complex and error-prone task.
    OBJECTIVE: To assess the performance of natural language processing (NLP) for ICD-10 coding of German radiology reports using fine tuning of suitable language models.
    METHODS: This retrospective study included all magnetic resonance imaging (MRI) radiology reports acquired at our institution between 2010 and 2020. The codes on discharge ICD-10 were matched to the corresponding reports to construct a dataset for multiclass classification. Fine tuning of GermanBERT and flanT5 was carried out on the total dataset (dstotal) containing 1035 different ICD-10 codes and 2 reduced subsets containing the 100 (ds100) and 50 (ds50) most frequent codes. The performance of the model was assessed using top‑k accuracy for k = 1, 3 and 5. In an ablation study both models were trained on the accompanying metadata and the radiology report alone.
    RESULTS: The total dataset consisted of 100,672 radiology reports, the reduced subsets ds100 of 68,103 and ds50 of 52,293 reports. The performance of the model increased when several of the best predictions of the model were taken into consideration, when the number of target classes was reduced and the metadata were combined with the report. The flanT5 outperformed GermanBERT across all datasets and metrics and was is suited as a medical coding assistant, achieving a top 3 accuracy of nearly 70% in the real-world dataset dstotal.
    CONCLUSIONS: Finely tuned language models can reliably predict ICD-10 codes of German magnetic resonance imaging (MRI) radiology reports across various settings. As a coding assistant flanT5 can guide medical coders to make informed decisions and potentially reduce the workload.
    UNASSIGNED: HINTERGRUND: Die medizinische Codierung von radiologischen Befunden ist essenziell für eine gute Qualität der Versorgung und die korrekte Abrechnung, gleichzeitig aber eine aufwändige und fehleranfällige Aufgabe.
    UNASSIGNED: Bewertung der Anwendbarkeit natürlicher Sprachverarbeitung (Natural Language Processing, NLP) für die ICD-10-Codierung von radiologischen Befunden in deutscher Sprache durch Finetuning geeigneter Sprachmodelle.
    METHODS: In dieser retrospektiven Studie wurden alle Magnetresonanztomographie(MRT)-Befunde unseres Instituts zwischen 2010 und 2020 berücksichtigt. Die ICD-10-Codes bei Entlassung wurden den jeweiligen Befunden zugeordnet, um einen Datensatz für eine Multiclass-Klassifizierung zu erstellen. Finetuning von GermanBERT und flanT5 wurde auf dem Gesamtdatensatz (dstotal) mit 1035 verschiedenen ICD-10-Codes und zwei reduzierten Datensätzen mit den 100 (ds100) und 50 (ds50) häufigsten Codes durchgeführt. Die Performance der Modelle wurde mit Top-k-Genauigkeit für k = 1, 3, 5 evaluiert. In einer Ablationsstudie wurden beide Modelle einmal auf den zugehörigen Metadaten und dem Befund allein trainiert.
    UNASSIGNED: Der Gesamtdatensatz bestand aus 100.672 radiologischen Befunden, die reduzierten Datensätze ds100 aus 68.103 und ds50 aus 52.293 Berichten. Die Modellperformance stieg, wenn mehrere der besten Voraussagen des Modells in Betracht gezogen wurden, die Anzahl der Zielklassen reduziert wurde und die Metadaten mit dem Befund kombiniert wurden. FlanT5 übertraf GermanBERT in allen Datensätzen und Metriken und eignet sich am besten als medizinischer Codierungsassistent, wobei eine Top-3-Genauigkeit von fast 70 % im realitätsnahen Datensatz dstotal erreicht wurde.
    UNASSIGNED: Finetuning von Sprachmodellen verspricht eine zuverlässige Vorhersage von ICD-10-Codes deutscher radiologischer MRT-Befunde in unterschiedlichen Szenarien. Als Codierungsassistent kann flanT5 medizinischen Codierern helfen, informierte Entscheidungen zu treffen und potenziell ihre Arbeitsbelastung reduzieren.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    自监督神经语言模型最近取得了前所未有的成功,从自然语言处理到学习生物序列和有机分子的语言。这些型号在一代中表现出卓越的性能,结构分类,以及具有学习表征的蛋白质和分子的功能预测。然而,大多数基于掩蔽的预训练语言模型不是为生成设计而设计的,它们的黑匣子性质使得很难解释它们的设计逻辑。这里提出了材料的空白填充语言模型(BLMM)晶体变压器,基于神经网络的概率生成模型,用于无机材料的生成和修补设计。该模型建立在文本生成的填空语言模型上,在学习“材料语法”和高质量生成方面表现出独特的优势,可解释性,数据效率。它可以产生化学有效的材料组合物,具有高达89.7%的电荷中性和84.8%的平衡电负性,与伪随机抽样基线相比,高出四倍和八倍以上。BLMM的概率生成过程允许它根据学习的材料化学推荐材料修补操作,这使得它对材料掺杂有用。该模型用于发现一组使用密度泛函理论(DFT)计算验证的新材料。因此,这项工作将基于无监督转换语言模型的生成人工智能带入了无机材料。已经开发了用于修补材料设计的用户友好的Web应用程序,可以在www上自由访问。materialsatlas.org/blmtinker。
    Self-supervised neural language models have recently achieved unprecedented success from natural language processing to learning the languages of biological sequences and organic molecules. These models have demonstrated superior performance in the generation, structure classification, and functional predictions for proteins and molecules with learned representations. However, most of the masking-based pre-trained language models are not designed for generative design, and their black-box nature makes it difficult to interpret their design logic. Here a Blank-filling Language Model for Materials (BLMM) Crystal Transformer is proposed, a neural network-based probabilistic generative model for generative and tinkering design of inorganic materials. The model is built on the blank-filling language model for text generation and has demonstrated unique advantages in learning the \"materials grammars\" together with high-quality generation, interpretability, and data efficiency. It can generate chemically valid materials compositions with as high as 89.7% charge neutrality and 84.8% balanced electronegativity, which are more than four and eight times higher compared to a pseudo-random sampling baseline. The probabilistic generation process of BLMM allows it to recommend materials tinkering operations based on learned materials chemistry, which makes it useful for materials doping. The model is applied to discover a set of new materials as validated using the Density Functional Theory (DFT) calculations. This work thus brings the unsupervised transformer language models based generative artificial intelligence to inorganic materials. A user-friendly web app for tinkering materials design has been developed and can be accessed freely at www.materialsatlas.org/blmtinker.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    诸如ChatGPT之类的大型语言模型(LLM)的出现对诸如认知行为疗法(CBT)之类的心理治疗具有潜在的影响。我们系统地调查了LLM是否可以识别一个无益的想法,检查其有效性,并将其重新构建为更有用的。LLM目前有可能为识别和重组无用的想法提供合理的建议,但不应依靠领导CBT交付。
    The advent of large language models (LLMs) such as ChatGPT has potential implications for psychological therapies such as cognitive behavioral therapy (CBT). We systematically investigated whether LLMs could recognize an unhelpful thought, examine its validity, and reframe it to a more helpful one. LLMs currently have the potential to offer reasonable suggestions for the identification and reframing of unhelpful thoughts but should not be relied on to lead CBT delivery.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    更好地理解最近的大型语言模型的紧急计算和解决问题的能力对于进一步改进它们并扩大其适用性至关重要。这项工作研究了一种语言模型,训练来预测下一个令牌,可以执行算术计算,泛化超出训练数据。二进制加法和乘法构成了一个很好的测试平台,因为它们需要非常小的词汇量,并且表现出相关的输入/输出不连续性,使得平滑的输入插值对新数据无效。我们成功地训练了一个光语言模型来学习这些任务,并进行了一些实验来研究外推能力和内部信息处理。我们的发现支持这样的假设,即语言模型作为编码-回归-解码机器工作,一旦输入标记表示映射到适当的内部表示,计算就在值空间中进行。
    A better understanding of the emergent computation and problem-solving capabilities of recent large language models is of paramount importance to further improve them and broaden their applicability. This work investigates how a language model, trained to predict the next token, can perform arithmetic computations generalizing beyond training data. Binary addition and multiplication constitute a good testbed for this purpose, since they require a very small vocabulary and exhibit relevant input/output discontinuities making smooth input interpolation ineffective for novel data. We successfully trained a light language model to learn these tasks and ran a number of experiments to investigate the extrapolation capabilities and internal information processing. Our findings support the hypothesis that the language model works as an Encoding-Regression-Decoding machine where the computation takes place in the value space once the input token representation is mapped to an appropriate internal representation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:在过去的两年中,研究人员使用各种医疗许可考试来测试ChatGPT(OpenAI)是否拥有准确的医学知识。ChatGPT的每个版本在多种环境中的医疗许可检查中的性能均显示出显着差异。在这个阶段,对ChatGPT在不同医学许可考试中的表现差异仍缺乏全面的了解。
    目的:在本研究中,我们回顾了截至2024年3月有关ChatGPT在医疗许可检查中表现的所有研究.这篇综述旨在通过对ChatGPT在各种环境中的表现进行全面分析,为医学教育中人工智能(AI)的不断发展的话语做出贡献。从这个系统的审查中获得的见解将指导教育工作者,政策制定者,和技术专家在医学教育中有效和明智地使用人工智能。
    方法:我们通过在WebofScience中搜索查询字符串,搜索了2022年1月1日至2024年3月29日之间发布的文献,PubMed,还有Scopus.两位作者根据纳入和排除标准筛选了文献,提取的数据,并独立评估有关诊断准确性研究质量评估的文献质量-2。我们进行了定性和定量分析。
    结果:本研究共纳入45项关于不同版本的ChatGPT在医疗许可检查中的表现的研究。GPT-4的总体准确率为81%(95%CI78-84;P<0.01),显著超过GPT-3.5的58%(95%CI53-63;P<.01)准确率。GPT-4在29例中有26例通过了医学检查,17例中有13例表现优于医学生的平均成绩。将试题翻译成英语提高了GPT-3.5的性能,但不影响GPT-4。GPT-3.5在英语和非英语国家的考试中没有表现差异(P=0.72),但GPT-4在英语国家的考试中表现更好(P=0.02)。任何类型的提示都可以显著提高GPT-3.5(P=.03)和GPT-4(P<.01)的性能。GPT-3.5在短文本问题上的表现优于长文本问题。问题的难度影响了GPT-3.5和GPT-4的性能。在基于图像的多项选择题(MCQ)中,ChatGPT的准确率范围从13.1%到100%。ChatGPT在开放式问题上的表现明显差于MCQ。
    结论:GPT-4显示出未来在医学教育中使用的巨大潜力。然而,由于其准确性不足,不一致的性能,以及各国不同的医疗政策和知识带来的挑战,GPT-4还不适合用于医学教育。
    背景:PROSPEROCRD42024506687;https://www.crd.约克。AC.uk/prospro/display_record.php?RecordID=506687。
    BACKGROUND: Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT\'s performance on different medical licensing examinations.
    OBJECTIVE: In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education.
    METHODS: We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses.
    RESULTS: A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5\'s performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5\'s (P=.03) and GPT-4\'s (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT\'s accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs.
    CONCLUSIONS: GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education.
    BACKGROUND: PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    语言障碍是精神分裂症的核心特征,经常被研究为一种正式的思维障碍。精神分裂症中语言的神经生物学已经在同一框架内得到解决,语言和思想是考虑症状而不是体征的等价物。这篇综述旨在系统地研究已发表的同行评审研究,这些研究采用神经影像学技术来研究精神分裂症患者与语言体征有关的异常脑语言网络。
    我们采用了一种语言模型来自动提取数据。我们根据PRISMA的建议选择了我们的研究,我们根据STROBE指南对入选研究进行了质量评估。
    我们分析了37项研究的结果,根据患者特征对它们进行分类,大脑措施,和语言任务类型。额下回(IFG)和颞上回(STG)在这些研究和范例中表现出最显着的差异。
    我们根据我们的分析提出了该领域未来研究的指南。调查涉及语言处理的大型网络至关重要,必须整合语言模型和大脑指标,以增强我们对精神分裂症语言和大脑异常之间关系的理解。
    UNASSIGNED: Language disturbances are a core feature of schizophrenia, often studied as a formal thought disorder. The neurobiology of language in schizophrenia has been addressed within the same framework, that language and thought are equivalents considering symptoms and not signs. This review aims to systematically examine published peer-reviewed studies that employed neuroimaging techniques to investigate aberrant brain-language networks in individuals with schizophrenia in relation to linguistic signs.
    UNASSIGNED: We employed a language model for automatic data extraction. We selected our studies according to the PRISMA recommendations, and we conducted the quality assessment of the selected studies according to the STROBE guidance.
    UNASSIGNED: We analyzed the findings from 37 studies, categorizing them based on patient characteristics, brain measures, and language task types. The inferior frontal gyrus (IFG) and superior temporal gyrus (STG) exhibited the most significant differences among these studies and paradigms.
    UNASSIGNED: We propose guidelines for future research in this field based on our analysis. It is crucial to investigate larger networks involved in language processing, and language models with brain metrics must be integrated to enhance our understanding of the relationship between language and brain abnormalities in schizophrenia.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    推理是智能系统的关键能力。大型语言模型(LM)在抽象推理任务上实现了高于偶然的性能,但表现出许多缺陷。然而,人类的抽象推理也是不完善的。人类的推理受到我们现实世界的知识和信念的影响,并显示值得注意的“内容效果”;当问题的语义内容支持正确的逻辑推断时,人类推理更可靠。这些内容纠缠的推理模式是关于人类智力基本性质的辩论的核心。这里,我们研究语言模型-其先前的期望捕获人类知识的某些方面-是否类似地将内容混合到他们对逻辑问题的答案中。我们在三个逻辑推理任务中探索了这个问题:自然语言推理,判断三段论的逻辑有效性,和Wason选择任务。我们评估最先进的LMs,和人类一样,并发现LMs在这些任务上反映了许多相同的定性人类模式,例如人类,当任务的语义内容支持逻辑推断时,模型会更准确地回答。这些相似之处反映在准确性模式中,以及一些较低级别的特征,例如LM对可能答案的置信度与人类响应时间之间的关系。然而,在某些情况下,人类和模型的行为不同,特别是在Wason任务上,人类的表现比大型模型差得多,并表现出明显的错误模式。我们的发现对理解这些人类认知效应的可能贡献者有意义,以及影响语言模型性能的因素。
    reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks but exhibit many imperfections. However, human abstract reasoning is also imperfect. Human reasoning is affected by our real-world knowledge and beliefs, and shows notable \"content effects\"; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns are central to debates about the fundamental nature of human intelligence. Here, we investigate whether language models-whose prior expectations capture some aspects of human knowledge-similarly mix content into their answers to logic problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task. We evaluate state of the art LMs, as well as humans, and find that the LMs reflect many of the same qualitative human patterns on these tasks-like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected in accuracy patterns, and in some lower-level features like the relationship between LM confidence over possible answers and human response times. However, in some cases the humans and models behave differently-particularly on the Wason task, where humans perform much worse than large models, and exhibit a distinct error pattern. Our findings have implications for understanding possible contributors to these human cognitive effects, as well as the factors that influence language model performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号