Claude

克劳德
  • 文章类型: Journal Article
    目标:评估和比较五种不同的人工智能(AI)聊天机器人GPT-4,克劳德,米斯特拉尔,谷歌PaLM,和Grok-in对最常见的关于肾结石(KS)的问题的回应。
    方法:Google趋势促进了与KS相关的相关术语的识别。每个AI聊天机器人都提供了一个由25个常用搜索短语组成的唯一序列作为输入。使用DISCERN评估反应,可打印材料的患者教育材料评估工具(PEMAT-P),Flesch-Kincaid等级(FKGL),和Flesch-Kincaid阅读轻松(FKRE)标准。
    结果:搜索频率最高的三个术语是“肾结石”,“肾结石疼痛,“和”肾痛。\"尼泊尔,印度,特立尼达和多巴哥是在KS中进行搜索最多的国家。人工智能聊天机器人都没有达到必要的可理解性水平。Grok表现出最高的FKRE和FKGL评级(p=0.001),而克劳德的DISCERN得分优于其他聊天机器人(p=0.001)。GPT-4中PEMAT-P的可理解性最低,Claude的可操作性最高(p=0.001)。
    结论:GPT-4具有五个聊天机器人中最复杂的语言结构,使其最难以阅读和理解,而Grok是最简单的.克劳德拥有最好的KS文本质量。Chatbot技术可以改善医疗保健材料,使其更容易掌握。
    Objective: To evaluate and compare the quality and comprehensibility of answers produced by five distinct artificial intelligence (AI) chatbots-GPT-4, Claude, Mistral, Google PaLM, and Grok-in response to the most frequently searched questions about kidney stones (KS). Materials and Methods: Google Trends facilitated the identification of pertinent terms related to KS. Each AI chatbot was provided with a unique sequence of 25 commonly searched phrases as input. The responses were assessed using DISCERN, the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), the Flesch-Kincaid Grade Level (FKGL), and the Flesch-Kincaid Reading Ease (FKRE) criteria. Results: The three most frequently searched terms were \"stone in kidney,\" \"kidney stone pain,\" and \"kidney pain.\" Nepal, India, and Trinidad and Tobago were the countries that performed the most searches in KS. None of the AI chatbots attained the requisite level of comprehensibility. Grok demonstrated the highest FKRE (55.6 ± 7.1) and lowest FKGL (10.0 ± 1.1) ratings (p = 0.001), whereas Claude outperformed the other chatbots in its DISCERN scores (47.6 ± 1.2) (p = 0.001). PEMAT-P understandability was the lowest in GPT-4 (53.2 ± 2.0), and actionability was the highest in Claude (61.8 ± 3.5) (p = 0.001). Conclusion: GPT-4 had the most complex language structure of the five chatbots, making it the most difficult to read and comprehend, whereas Grok was the simplest. Claude had the best KS text quality. Chatbot technology can improve healthcare material and make it easier to grasp.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:从社交媒体中手动分析与公共卫生相关的内容可以提供对信念的宝贵见解,态度,和个人的行为,揭示可以告知公众理解的趋势和模式,政策决定,有针对性的干预措施,和沟通策略。不幸的是,训练有素的人类主题专家所需的时间和精力使得大量的手动社交媒体收听不可行。生成的大型语言模型(LLM)可以潜在地总结和解释大量的文本,但目前还不清楚LLM在多大程度上可以在大量社交媒体帖子中收集微妙的健康相关含义,并合理地报告健康相关主题。
    目的:我们旨在通过尝试回答以下问题来评估使用LLM进行主题模型选择或对社交媒体帖子的大量内容进行归纳主题分析的可行性:LLM能否像以前的人工研究中那样有效地进行主题模型选择和归纳主题分析,或者至少是合理的,
    方法:我们提出了相同的研究问题,并使用了相同的社交媒体内容来进行LLM相关主题的选择和LLM主题的分析,就像在已发表的关于疫苗修辞的研究中手动进行的那样。我们使用该研究的结果作为本LLM实验的背景,将先前人工人体分析的结果与3个LLM的分析进行比较:GPT4-32K,Claude-instant-100K,还有Claude-2-100K.我们还评估了多个LLM是否具有等效能力,并评估了每个LLM重复分析的一致性。
    结果:LLM通常对人类先前选择的最相关的主题给予很高的排名。我们拒绝零假设(P<.001,总体比较),并得出结论,这些LLM更有可能在其排名前5位的内容领域中包含人类评级的内容,而不是偶然发生的。关于主题识别,LLM确定了几个与人类确定的主题相似的主题,幻觉发生率很低.LLM之间以及单个LLM的测试运行之间发生了变化。尽管不能始终如一地匹配人类产生的主题,主题专家发现LLM产生的主题仍然是合理和相关的。
    结论:LLM可以有效且高效地处理基于社交媒体的大型健康相关数据集。LLM可以从人类主题专家认为合理的数据中提取主题。然而,我们无法证明我们测试的LLM可以通过一致地从相同的数据中提取相同的主题来复制人类主题专家的分析深度。潜力巨大,一旦更好地验证,用于基于LLM的自动实时社交监听常见和罕见的健康状况,告知公众健康对公众利益和关切的理解,并确定公众解决这些问题的想法。
    BACKGROUND: Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes.
    OBJECTIVE: We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts?
    METHODS: We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM.
    RESULTS: The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant.
    CONCLUSIONS: LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public\'s interests and concerns and determining the public\'s ideas to address them.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:人工智能(AI)和大型语言模型(LLM)的最新进展在医学领域显示出潜力,包括皮肤病学。随着LLM中图像分析功能的引入,它们在皮肤病学诊断中的应用引起了极大的兴趣。这些功能是通过将计算机视觉技术集成到LLM的底层体系结构中而实现的。
    目的:本研究旨在比较Claude3Opus和ChatGPT与GPT-4在分析皮肤镜图像以进行黑色素瘤检测方面的诊断性能。提供洞察他们的优势和局限性。
    方法:我们随机选择了100个组织病理学证实的皮肤镜图像(50个恶性,50良性)来自国际皮肤成像合作组织(ISIC)档案,使用计算机生成的随机化过程。之所以选择ISIC档案,是因为它收集了全面且注释齐全的皮肤图像,确保多样化和代表性的样本。如果是经组织病理学证实的黑素细胞病变的皮肤镜图像,则包括图像。每个模型都给出了相同的提示,指示它为每张图像提供前3个鉴别诊断,按可能性排序。初级诊断准确性,前3种鉴别诊断的准确性,并评估恶性肿瘤的辨别能力。选择McNemar测试来比较2种型号的诊断性能,因为它适合分析配对的标称数据。
    结果:在主要诊断中,克劳德3Opus实现了54.9%的灵敏度(95%CI44.08%-65.37%),57.14%特异性(95%CI46.31%-67.46%),和56%的准确率(95%CI46.22%-65.42%),而ChatGPT表现出56.86%的敏感性(95%CI45.99%-67.21%),特异性38.78%(95%CI28.77%-49.59%),准确率为48%(95%CI38.37%-57.75%)。McNemar检验显示两种模型之间没有显着差异(P=0.17)。对于前3个鉴别诊断,Claude3Opus和ChatGPT包括76%(95%CI66.33%-83.77%)和78%(95%CI68.46%-85.45%)的病例的正确诊断,分别。McNemar检验无显著性差异(P=0.56)。在恶性肿瘤歧视中,Claude3Opus的表现优于ChatGPT,灵敏度为47.06%,81.63%特异性,准确率为64%,与45.1%相比,42.86%,44%,分别。McNemar检验显示差异显著(P<.001)。Claude3Opus在区分恶性肿瘤方面的比值比为3.951(95%CI1.685-9.263),而ChatGPT-4的比值比为0.616(95%CI0.297-1.278)。
    结论:我们的研究强调了LLM在协助皮肤科医生方面的潜力,但也揭示了其局限性。两种模型在诊断黑色素瘤和良性病变时都出错。这些发现强调了开发健壮,透明,以及通过人工智能研究人员之间的协作努力进行临床验证的人工智能模型,皮肤科医生,和其他医疗保健专业人员。虽然AI可以提供有价值的见解,它还不能取代训练有素的临床医生的专业知识。
    BACKGROUND: Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown potential in medical fields, including dermatology. With the introduction of image analysis capabilities in LLMs, their application in dermatological diagnostics has garnered significant interest. These capabilities are enabled by the integration of computer vision techniques into the underlying architecture of LLMs.
    OBJECTIVE: This study aimed to compare the diagnostic performance of Claude 3 Opus and ChatGPT with GPT-4 in analyzing dermoscopic images for melanoma detection, providing insights into their strengths and limitations.
    METHODS: We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process. The ISIC archive was chosen due to its comprehensive and well-annotated collection of dermoscopic images, ensuring a diverse and representative sample. Images were included if they were dermoscopic images of melanocytic lesions with histopathologically confirmed diagnoses. Each model was given the same prompt, instructing it to provide the top 3 differential diagnoses for each image, ranked by likelihood. Primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability were assessed. The McNemar test was chosen to compare the diagnostic performance of the 2 models, as it is suitable for analyzing paired nominal data.
    RESULTS: In the primary diagnosis, Claude 3 Opus achieved 54.9% sensitivity (95% CI 44.08%-65.37%), 57.14% specificity (95% CI 46.31%-67.46%), and 56% accuracy (95% CI 46.22%-65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99%-67.21%), 38.78% specificity (95% CI 28.77%-49.59%), and 48% accuracy (95% CI 38.37%-57.75%). The McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33%-83.77%) and 78% (95% CI 68.46%-85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685-9.263) in discriminating malignancy, while ChatGPT-4 had an odds ratio of 0.616 (95% CI 0.297-1.278).
    CONCLUSIONS: Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. These findings underscore the need for developing robust, transparent, and clinically validated AI models through collaborative efforts between AI researchers, dermatologists, and other health care professionals. While AI can provide valuable insights, it cannot yet replace the expertise of trained clinicians.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景技术自然语言处理的快速发展带来了跨各种医学领域的大型语言模型(LLM)的广泛使用。然而,他们在专业领域的有效性,比如自然疗法,相对未被探索。目的本研究旨在评估免费提供的LLM聊天机器人为各种类型的疾病和病症提供自然疗法咨询的能力。方法五个免费的LLM(即,双子座,副驾驶员,ChatGPT,克劳德,和困惑)用于与20例临床病例(模拟现实场景)进行对话。每个病例都有病例细节和与自然疗法有关的问题。将回答提交给三个具有>5年实践的自然疗法医生。答案由他们以五点Likert-likert-lic语言流利度评分,连贯性,准确度,和相关性。在他的研究中,这四个属性的平均值被称为完美。结果LLM的总体评分为Gemini3.81±0.23,Copilot4.34±0.28,ChatGPT4.43±0.2,Claude3.8±0.26,Perplexity3.91±0.28(ANOVAF[3.034,57.64]=33.47,P<0.0001。一起,他们在咨询中表现出整体约80%的完美。LLM总分之间的平均测量组内相关系数为0.463(95%CI=-0.028至0.76),P=0.03。结论虽然LLM聊天机器人可以帮助提供自然疗法和瑜伽治疗咨询与大约一个整体公平的完善水平,他们对用户的解决方案因不同的聊天机器人而异,并且它们之间的可靠性非常低。
    Background The rapid advancements in natural language processing have brought about the widespread use of large language models (LLMs) across various medical domains. However, their effectiveness in specialized fields, such as naturopathy, remains relatively unexplored. Objective The study aimed to assess the capability of freely available LLM chatbots in providing naturopathy consultations for various types of diseases and disorders. Methods Five free LLMs (viz., Gemini, Copilot, ChatGPT, Claude, and Perplexity) were used to converse with 20 clinical cases (simulation of real-world scenarios). Each case had the case details and questions pertinent to naturopathy. The responses were presented to three naturopathy doctors with > 5 years of practice. The answers were rated by them on a five-point Likert-like scale for language fluency, coherence, accuracy, and relevancy. The average of these four attributes is termed perfection in his study. Results The overall score of the LLMs were Gemini 3.81±0.23, Copilot 4.34±0.28, ChatGPT 4.43±0.2, Claude 3.8±0.26, and Perplexity 3.91±0.28 (ANOVA F [3.034, 57.64] = 33.47, P <0.0001. Together, they showed overall ~80% perfection in consultation. The average measure intraclass correlation coefficient among the LLMs for the overall score was 0.463 (95% CI = -0.028 to 0.76), P = 0.03. Conclusion Although the LLM chatbots could help in providing naturopathy and yoga treatment consultation with approximately an overall fair level of perfection, their solution to the user varies across different chatbots and there was very low reliability among them.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型语言模型(LLM)可能为缺乏训练有素的卫生人员提供解决方案,特别是在低收入和中等收入国家。然而,他们的优点和缺点仍然不清楚。
    在这里,我们将不同的LLM(Bard2023.07.13,Claude2,ChatGPT4)与耳鼻喉科(ORL)的六名顾问进行比较。
    从文献和德国州考试中提取了基于案例的问题。来自Bard2023.07.13,Claude2,ChatGPT4和六名ORL顾问的答案在6分李克特量表上对医疗充分性进行了盲目评级,可理解性,连贯性,和简洁。将给出的答案与经过验证的答案进行比较,并评估其危害。进行改进的图灵测试并比较字符计数。
    LLM答案在所有类别中都低于顾问。然而,顾问和LLM之间的差异很小,在简洁性和可理解性方面最明显的差距。在LLM中,克劳德2在医疗充分性和简洁性方面被评为最佳。咨询公司的回答与经过验证的解决方案匹配的比例为93%(228/246),ChatGPT4占85%(35/41),克劳德2占78%(32/41),和巴德2023.07.13占59%(24/41)。答案在ChatGPT4的10%(24/246),Claude2的14%(34/246),Bard2023.07.13的19%(46/264)和顾问的6%(71/1230)中被评为潜在危险。
    尽管顾问表现优异,LLM显示出在ORL中临床应用的潜力。未来的研究应该更大规模地评估他们的表现。
    UNASSIGNED: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.
    UNASSIGNED: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).
    UNASSIGNED: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.
    UNASSIGNED: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants\' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.
    UNASSIGNED: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:评估两种大型语言模型的临床推理能力,ChatGPT-4和Claude-2.0,与新生儿护理方案中的新生儿护士相比。
    方法:一项横断面研究,使用调查工具进行比较评估,包括六个新生儿重症监护病房的临床情景。
    方法:32名新生儿重症监护护士,在三个医疗中心的新生儿重症监护病房工作5-10年。
    方法:参与者对6种书面临床方案做出反应。同时,我们要求ChatGPT-4和Claude-2.0提供相同方案的初步评估和治疗建议.来自ChatGPT-4和Claude-2.0的反应然后由认证的新生儿护士从业者进行准确性评分,完整性,和响应时间。
    结果:两种模型都证明了新生儿护理的临床推理能力,Claude-2.0在临床准确性和速度上显著优于ChatGPT-4。然而,在诊断精度上发现了所有病例的局限性,治疗特异性,和反应滞后。
    结论:虽然显示出希望,当前的局限性强化了在ChatGPT-4和Claude-2.0被考虑纳入临床实践之前对深度改进的必要性.这些工具的额外验证对于安全地利用这种人工智能技术来增强临床决策非常重要。
    结论:该研究提供了对新生儿临床护理中新型人工智能模型的推理准确性的理解。ChatGPT-4和Claude-2.0的当前精度差距需要在临床使用之前解决。
    OBJECTIVE: To assess the clinical reasoning capabilities of two large language models, ChatGPT-4 and Claude-2.0, compared to those of neonatal nurses during neonatal care scenarios.
    METHODS: A cross-sectional study with a comparative evaluation using a survey instrument that included six neonatal intensive care unit clinical scenarios.
    METHODS: 32 neonatal intensive care nurses with 5-10 years of experience working in the neonatal intensive care units of three medical centers.
    METHODS: Participants responded to 6 written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified neonatal nurse practitioners for accuracy, completeness, and response time.
    RESULTS: Both models demonstrated capabilities in clinical reasoning for neonatal care, with Claude-2.0 significantly outperforming ChatGPT-4 in clinical accuracy and speed. However, limitations were identified across the cases in diagnostic precision, treatment specificity, and response lag.
    CONCLUSIONS: While showing promise, current limitations reinforce the need for deep refinement before ChatGPT-4 and Claude-2.0 can be considered for integration into clinical practice. Additional validation of these tools is important to safely leverage this Artificial Intelligence technology for enhancing clinical decision-making.
    CONCLUSIONS: The study provides an understanding of the reasoning accuracy of new Artificial Intelligence models in neonatal clinical care. The current accuracy gaps of ChatGPT-4 and Claude-2.0 need to be addressed prior to clinical usage.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)已经改变了医学的各个领域,协助复杂的任务和临床决策,与OpenAI的GPT-4,GPT-3.5,谷歌的Bard,和人类的克劳德其中最广泛使用。虽然GPT-4在一些研究中表现出优异的性能,这些模型之间的综合比较仍然有限。认识到国家医学考试委员会(NBME)考试在评估医学生的临床知识方面的重要性,这项研究旨在比较流行的LLM在NBME临床科目考试样本问题上的准确性。
    方法:本研究中使用的问题是从NBME官方网站获得的多项选择题,并且可以公开获得。NBME医学科目考试的问题,儿科,妇产科,临床神经病学,门诊护理,家庭医学,精神病学,和手术用于查询每个LLM。来自GPT-4,GPT-3.5,克劳德,和吟游诗人在2023年10月被收集。将每个LLM的响应与NBME提供的答案进行比较,并检查准确性。使用单向方差分析(ANOVA)进行统计学分析。
    结果:每个LLM总共查询了163个问题。GPT-4得分为163/163(100%),GPT-3.5得分为134/163(82.2%),巴德得分为123/163(75.5%),克劳德得分为138/163(84.7%)。GPT-4的总性能在统计学上优于GPT-3.5,克劳德,和巴德17.8%,15.3%,和24.5%,分别。GPT-3.5的总性能,克劳德,和巴德没有显著差异。GPT-4在特定受试者中的表现明显优于Bard,包括医学,儿科,家庭医学,和门诊护理,和GPT-3.5在门诊护理和家庭医学方面。在所有LLM中,手术检查平均分最高(18.25/20),而家庭医学考试的平均分最低(3.75/5)。结论:GPT-4在NBME临床科目考试样本问题上的卓越表现突显了其在医学教育和实践中的潜力。虽然LLM表现出承诺,在它们的应用中辨别是至关重要的,考虑到偶尔的不准确。随着技术的进步,必须定期重新评估和改进,以保持其在医学中的可靠性和相关性。
    BACKGROUND: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI\'s GPT-4, GPT-3.5, Google\'s Bard, and Anthropic\'s Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.
    METHODS: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).
    RESULTS: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5).  Conclusion: GPT-4\'s superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有心理健康应用的潜力。然而,他们不透明的对齐过程可能会嵌入偏见,形成有问题的观点。评估嵌入在LLM中指导其决策的价值观具有道德重要性。施瓦茨的基本价值观理论(STBV)为量化文化价值取向提供了一个框架,并显示了在心理健康环境中检查价值观的效用。包括文化,诊断,和治疗师-客户动态。
    目的:这项研究旨在(1)评估STBV是否可以测量领先的LLM中的价值样构建体,以及(2)确定LLM是否表现出与人类和彼此不同的价值样模式。
    方法:总共,4名法学硕士(吟游诗人,克劳德2,生成预训练变压器[GPT]-3.5,GPT-4)被拟人化,并指示完成肖像值问卷修订(PVQ-RR)以评估类似价值的构造。对他们在10项试验中的反应进行了信度和效度分析。要对LLM值配置文件进行基准测试,将他们的结果与来自49个国家的53,472名完成PVQ-RR的不同样本的已发表数据进行比较.这使我们能够评估LLM是否与跨文化群体的既定人类价值模式有所不同。还通过统计检验比较了模型之间的值概况。
    结果:PVQ-RR显示出良好的信度和效度,用于量化LLM内的价值式基础设施。然而,LLM的价值概况和人口数据之间出现了很大的差异。这些模型缺乏共识,表现出明显的动机偏见,反映不透明的对齐过程。例如,所有模式都优先考虑普遍主义和自我导向,在不强调成就的同时,电源,和相对于人类的安全。成功的判别分析区分了4个不同的LLM值概况。进一步的检查发现,当出现心理健康困境时,有偏见的价值概况强烈预测了LLM的反应,需要在相反的价值之间进行选择。这为嵌入塑造其决策的独特动机价值样结构的模型提供了进一步的验证。
    结论:这项研究利用了STBV来映射激励领先LLM的类价值基础设施。尽管研究表明STBV可以有效地表征LLM中的类价值基础设施,与人类价值观的巨大分歧引发了人们对将这些模型与心理健康应用保持一致的道德担忧。如果在没有适当保障措施的情况下进行整合,对某些文化价值集的偏见会带来风险。例如,即使在临床上不明智的情况下,优先考虑普遍性也可以促进无条件接受。此外,LLM之间的差异强调了标准化调整过程以捕获真正的文化多样性的必要性。因此,任何负责任的将LLM整合到精神卫生保健中都必须考虑到其嵌入的偏见和动机不匹配,以确保跨不同人群的公平交付。实现这一目标将需要透明和完善对齐技术,以灌输全面的人类价值观。
    BACKGROUND: Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz\'s theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics.
    OBJECTIVE: This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other.
    METHODS: In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire-Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs\' value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests.
    RESULTS: The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs\' value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs\' distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs\' responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making.
    CONCLUSIONS: This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    最近,大型语言模型(LLM)已经展示了解决各种任务的令人印象深刻的能力。然而,尽管他们在各种任务中取得了成功,以前的工作还没有调查他们在生物医学领域的能力。为此,本文旨在评估LLM在基准生物医学任务上的性能。为此,对26个数据集的6个不同生物医学任务中的4个流行LLM进行了综合评估。据我们所知,这是对生物医学领域的各种LLM进行广泛评估和比较的第一项工作。有趣的是,根据我们的评估,我们发现在具有较小训练集的生物医学数据集中,零拍LLM甚至优于当前最先进的模型,因为它们仅在这些数据集的训练集上进行了微调。这表明,大型文本语料库的预培训使LLM即使在生物医学领域也非常专业。我们还发现,没有一个LLM可以在所有任务中胜过其他LLM,不同LLM的性能可能因任务而异。虽然与在大型训练集上进行微调的生物医学模型相比,它们的性能仍然相当差,我们的发现表明,LLM有可能成为缺乏大量注释数据的各种生物医学任务的有价值的工具。
    Recently, Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets has been conducted. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art models when they were fine-tuned only on the training set of these datasets. This suggests that pre-training on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    第一部分回顾了阻碍理解复杂疲劳生物学进展的持续挑战。难以量化主观症状,映射多因子机制,考虑个体差异,实现侵入式传感,克服研究/资助的孤立,更多的讨论。第二部分探讨了新兴的人工智能以及机器和深度学习技术如何通过将复杂的生理特征识别为更客观的生物标志物来帮助解决局限性。预测建模以捕获个体差异,通过数据挖掘整合脱节的发现,和模拟来探索干预措施。像克劳德和ChatGPT这样的对话剂也有可能加速人类疲劳研究,但是他们目前缺乏强大的自主贡献的能力。设想是一个创新时间表,其中协同应用增强的神经成像,生物传感器,闭环系统,以及其他与AI分析相结合的进步,可能会在未来几十年内催化在阐明疲劳神经回路和治疗相关疾病方面的变革性进展。
    Part I reviews persistent challenges obstructing progress in understanding complex fatigue\'s biology. Difficulties quantifying subjective symptoms, mapping multi-factorial mechanisms, accounting for individual variation, enabling invasive sensing, overcoming research/funding insularity, and more are discussed. Part II explores how emerging artificial intelligence and machine and deep learning techniques can help address limitations through pattern recognition of complex physiological signatures as more objective biomarkers, predictive modeling to capture individual differences, consolidation of disjointed findings via data mining, and simulation to explore interventions. Conversational agents like Claude and ChatGPT also have potential to accelerate human fatigue research, but they currently lack capacities for robust autonomous contributions. Envisioned is an innovation timeline where synergistic application of enhanced neuroimaging, biosensors, closed-loop systems, and other advances combined with AI analytics could catalyze transformative progress in elucidating fatigue neural circuitry and treating associated conditions over the coming decades.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号