Large language models

大型语言模型
  • 文章类型: Journal Article
    大型语言模型(LLM)支持的服务由于在许多任务中的出色性能而在各种应用程序中越来越受欢迎,如情绪分析和回答问题。最近,研究一直在探索它们在数字健康环境中的潜在用途,特别是在心理健康领域。然而,实施LLM增强的会话人工智能(CAI)提出了重要的道德,技术,和临床挑战。在这篇观点论文中,我们讨论了2个挑战,这些挑战会影响LLM增强的CAI对于有心理健康问题的个人的使用,专注于抑郁症患者的用例:将LLM增强的CAI人性化的趋势以及他们缺乏情境化的鲁棒性。我们的方法是跨学科的,依靠哲学的考虑,心理学,和计算机科学。我们认为,LLM增强的CAI的人性化取决于对使用LLM模拟“类似人类”特征的含义的反映,以及这些系统在与人类的互动中应该扮演什么角色。Further,确保LLM稳健性的情境化需要考虑抑郁症患者语言产生的特殊性,以及它随时间的演变。最后,我们提供了一系列建议,以促进负责任的设计和部署LLM增强的CAI,为抑郁症患者提供治疗支持.
    UNASSIGNED: Large language model (LLM)-powered services are gaining popularity in various applications due to their exceptional performance in many tasks, such as sentiment analysis and answering questions. Recently, research has been exploring their potential use in digital health contexts, particularly in the mental health domain. However, implementing LLM-enhanced conversational artificial intelligence (CAI) presents significant ethical, technical, and clinical challenges. In this viewpoint paper, we discuss 2 challenges that affect the use of LLM-enhanced CAI for individuals with mental health issues, focusing on the use case of patients with depression: the tendency to humanize LLM-enhanced CAI and their lack of contextualized robustness. Our approach is interdisciplinary, relying on considerations from philosophy, psychology, and computer science. We argue that the humanization of LLM-enhanced CAI hinges on the reflection of what it means to simulate \"human-like\" features with LLMs and what role these systems should play in interactions with humans. Further, ensuring the contextualization of the robustness of LLMs requires considering the specificities of language production in individuals with depression, as well as its evolution over time. Finally, we provide a series of recommendations to foster the responsible design and deployment of LLM-enhanced CAI for the therapeutic support of individuals with depression.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目标:在数字时代,患者转向在线来源获取腰椎融合信息,需要仔细研究大型语言模型(LLM),例如用于患者教育的聊天生成预训练变压器(ChatGPT)。
    方法:我们的研究旨在评估OpenAI(人工智能)的ChatGPT3.5和Google的Bard对腰椎融合手术患者问题的响应质量。我们通过谷歌搜索从158个常见问题中找出了10个关键问题,然后将其呈现给两个聊天机器人。五名失明的脊柱外科医生以4分制对反应进行了评分,从“不满意”到“优秀”。\'答案的清晰度和专业性也使用5点李克特量表进行了评估。
    结果:在我们对ChatGPT3.5和Bard的10个问题的评估中,97%的反应被评为优秀或令人满意。具体来说,ChatGPT有62%的优秀和32%的最低澄清反应,只有6%需要适度或实质性的澄清。巴德的回答是66%的优秀和24%的最低澄清,10%需要更多的澄清。2个模型之间的总体评级分布没有发现显着差异。两人都在努力解决关于手术风险的3个具体问题,成功率,以及手术入路的选择(Q3,Q4和Q5)。两种模型的评分者间可靠性均较低(ChatGPT:k=0.041,p=0.622;Bard:k=-0.040,p=0.601)。虽然两人在理解和同理心上都得分很高,吟游诗人在同理心和专业精神方面的评分略低。
    结论:ChatGPT3.5和Bard有效回答了腰椎融合常见问题,但需要进一步的培训和研究来巩固LLM在医学教育和医疗保健沟通中的作用。
    OBJECTIVE: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
    METHODS: Our study aims to assess the response quality of Open AI (artificial intelligence)\'s ChatGPT 3.5 and Google\'s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from \'unsatisfactory\' to \'excellent.\' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
    RESULTS: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard\'s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
    CONCLUSIONS: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs\' role in medical education and healthcare communication.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:尽管抗结核药物通常可以治愈,但结核病(TB)每年杀死约160万人。因此,结核病病例检测和治疗监测,需要一个全面的方法。自动放射分析,结合临床,微生物,和免疫学数据,通过机器学习(ML),可以帮助实现它。
    方法:对6只恒河猴进行肺部致病性结核分枝杆菌的实验接种。数据,包括计算机断层扫描(CT),在0、2、4、8、12、16和20周收集。
    结果:我们基于ML的CT分析(TB-Net)有效且准确地分析了疾病进展,性能优于标准深度学习模型(LLMOpenAI的CLIPVi4)。基于TB-Net的结果比,并独立确认,两名放射科医生对手动疾病进行盲法评分,并显示出与血液生物标志物的强相关性,TB-病灶体积,和疾病发病过程中的疾病体征。
    结论:所提出的方法在早期疾病检测中很有价值,监测治疗效果,和临床决策。
    BACKGROUND: Tuberculosis (TB) kills approximately 1.6 million people yearly despite the fact anti-TB drugs are generally curative. Therefore, TB-case detection and monitoring of therapy, need a comprehensive approach. Automated radiological analysis, combined with clinical, microbiological, and immunological data, by machine learning (ML), can help achieve it.
    METHODS: Six rhesus macaques were experimentally inoculated with pathogenic Mycobacterium tuberculosis in the lung. Data, including Computed Tomography (CT), were collected at 0, 2, 4, 8, 12, 16, and 20 weeks.
    RESULTS: Our ML-based CT analysis (TB-Net) efficiently and accurately analyzed disease progression, performing better than standard deep learning model (LLM OpenAI\'s CLIP Vi4). TB-Net based results were more consistent than, and confirmed independently by, blinded manual disease scoring by two radiologists and exhibited strong correlations with blood biomarkers, TB-lesion volumes, and disease-signs during disease pathogenesis.
    CONCLUSIONS: The proposed approach is valuable in early disease detection, monitoring efficacy of therapy, and clinical decision making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文介绍了大型语言模型(LLM)的风险分析,一种产生文本的“生成”人工智能(AI)系统,通常是响应来自人类用户的文本输入。本文特别关注LLM造成极端灾难的风险,在这种灾难中,他们做了一些类似于接管世界并杀死所有人的事情。自从最近发布了诸如ChatGPT和GPT-4之类的非常有能力的LLM以来,LLM收购灾难的可能性一直是公众讨论的重点。这可以说是第一次,实际的人工智能系统(而不是假设的未来系统)引发了人们对收购灾难的担忧。本文的分析比较了(A)接管可能需要的人工智能系统的特征,正如先前关于人工智能收购风险的理论文献所确定的那样,with(B)characteristicsobservedincurrentLLM.ThiscomparisonreimportsthatthecapabilitiesofcurrentLLMappeartofallwellshortofwhatmaybeneededfortakeutergister.由于深度学习算法的基本限制,未来的LLM可能同样无法实现。然而,关于当前LLM中发现的深度学习和惊喜能力的分歧专家意见表明,未来LLM可能会带来一些收购灾难的风险。LLM治理应监控收购特征的变化,并准备在出现警告信号时更积极地进行。除非出现这样的迹象,更积极的治理措施可能是没有根据的。
    This article presents a risk analysis of large language models (LLMs), a type of \"generative\" artificial intelligence (AI) system that produces text, commonly in response to textual inputs from human users. The article is specifically focused on the risk of LLMs causing an extreme catastrophe in which they do something akin to taking over the world and killing everyone. The possibility of LLM takeover catastrophe has been a major point of public discussion since the recent release of remarkably capable LLMs such as ChatGPT and GPT-4. This arguably marks the first time when actual AI systems (and not hypothetical future systems) have sparked concern about takeover catastrophe. The article\'s analysis compares (A) characteristics of AI systems that may be needed for takeover, as identified in prior theoretical literature on AI takeover risk, with (B) characteristics observed in current LLMs. This comparison reveals that the capabilities of current LLMs appear to fall well short of what may be needed for takeover catastrophe. Future LLMs may be similarly incapable due to fundamental limitations of deep learning algorithms. However, divided expert opinion on deep learning and surprise capabilities found in current LLMs suggests some risk of takeover catastrophe from future LLMs. LLM governance should monitor for changes in takeover characteristics and be prepared to proceed more aggressively if warning signs emerge. Unless and until such signs emerge, more aggressive governance measures may be unwarranted.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    及时的工程,安排输入或提示给大型语言模型的过程,以指导其产生所需的输出,是一个新兴的研究领域,塑造了这些模型如何理解任务,处理信息,并在广泛的自然语言处理(NLP)应用程序中生成响应。数字心理健康,另一方面,由于包括早期发现和干预在内的几个原因变得越来越重要,并减轻用于临床诊断的高技能医务人员的有限可用性。这篇简短的评论概述了数字心理健康NLP领域的即时工程的最新进展。据我们所知,这篇评论是第一次尝试讨论最新的提示工程类型,方法,以及在数字心理健康应用中使用的任务。我们讨论了三种类型的数字心理健康任务:分类,代,和问题回答。最后,我们讨论挑战,局限性,伦理考虑,以及数字心理健康快速工程的未来方向。我们认为,这篇简短的评论为数字心理健康的即时工程的未来研究提供了有用的出发点。
    Prompt engineering, the process of arranging input or prompts given to a large language model to guide it in producing desired outputs, is an emerging field of research that shapes how these models understand tasks, process information, and generate responses in a wide range of natural language processing (NLP) applications. Digital mental health, on the other hand, is becoming increasingly important for several reasons including early detection and intervention, and to mitigate limited availability of highly skilled medical staff for clinical diagnosis. This short review outlines the latest advances in prompt engineering in the field of NLP for digital mental health. To our knowledge, this review is the first attempt to discuss the latest prompt engineering types, methods, and tasks that are used in digital mental health applications. We discuss three types of digital mental health tasks: classification, generation, and question answering. To conclude, we discuss the challenges, limitations, ethical considerations, and future directions in prompt engineering for digital mental health. We believe that this short review contributes a useful point of departure for future research in prompt engineering for digital mental health.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在美国,由于复杂的程序和多个医疗保健提供者等因素,诊断错误在各种医疗保健环境中很常见,往往因初步评估不足而加剧。本研究探讨了大型语言模型(LLM)的作用,特别是OpenAI的ChatGPT-4和谷歌双子座,通过评估有和没有体格检查数据的有效性来改善整形外科和重建外科的紧急决策。使用了30个涵盖骨折和神经损伤等紧急情况的医学小插曲来评估模型的诊断和管理响应。这些反应由医疗专业人员根据既定的临床指南进行评估,使用包括Wilcoxon秩和检验在内的统计分析。结果显示,ChatGPT-4在诊断和治疗方面始终优于双子座,不管体检数据的存在,尽管在不同的数据场景中,每个模型的性能没有显著差异。最后,虽然ChatGPT-4展示了卓越的准确性和管理能力,增加体检数据,虽然加强了反应细节,没有明显超越传统医学资源。这强调了人工智能在支持临床决策方面的效用,特别是在数据有限的情况下,暗示了它作为补充的作用,而不是替代,全面的临床评估和专业知识。
    In the U.S., diagnostic errors are common across various healthcare settings due to factors like complex procedures and multiple healthcare providers, often exacerbated by inadequate initial evaluations. This study explores the role of Large Language Models (LLMs), specifically OpenAI\'s ChatGPT-4 and Google Gemini, in improving emergency decision-making in plastic and reconstructive surgery by evaluating their effectiveness both with and without physical examination data. Thirty medical vignettes covering emergency conditions such as fractures and nerve injuries were used to assess the diagnostic and management responses of the models. These responses were evaluated by medical professionals against established clinical guidelines, using statistical analyses including the Wilcoxon rank-sum test. Results showed that ChatGPT-4 consistently outperformed Gemini in both diagnosis and management, irrespective of the presence of physical examination data, though no significant differences were noted within each model\'s performance across different data scenarios. Conclusively, while ChatGPT-4 demonstrates superior accuracy and management capabilities, the addition of physical examination data, though enhancing response detail, did not significantly surpass traditional medical resources. This underscores the utility of AI in supporting clinical decision-making, particularly in scenarios with limited data, suggesting its role as a complement to, rather than a replacement for, comprehensive clinical evaluation and expertise.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景和目标:大型语言模型(LLM)正在成为整形外科中的有价值的工具,有可能降低外科医生的认知负荷并改善患者的预后。本研究旨在评估和比较两种最常见和最容易获得的LLM的当前状态,打开AI的ChatGPT-4和Google的GeminiPro(1.0Pro),在整形和重建外科手术中提供术中决策支持。材料和方法:我们为每个LLM提供了跨越5个程序的32个独立的术中场景。我们使用5分和3分的李克特量表进行医疗准确性和相关性,分别。我们使用Flesch-Kincaid等级(FKGL)和Flesch阅读轻松(FRE)评分确定了响应的可读性。此外,我们测量了模型的响应时间。我们使用曼-惠特尼U检验和学生t检验比较了性能。结果:ChatGPT-4在提供准确(3.59±0.84vs.3.13±0.83,p值=0.022)和相关(2.28±0.77vs.1.88±0.83,p值=0.032)响应。或者,双子座提供了更简洁易读的回答,平均FKGL(12.80±1.56)显著低于ChatGPT-4(15.00±1.89)(p<0.0001)。然而,FRE评分无差异(p=0.174).此外,双子座的平均反应时间(8.15±1.42s)明显快于ChatGPT-4(13.70±2.87s)(p<0.0001)。结论:尽管ChatGPT-4提供了更准确和相关的响应,两种模型均显示出作为术中工具的潜力.然而,它们在不同手术中的表现不一致,强调需要进一步的培训和优化,以确保它们作为术中决策支持工具的可靠性.
    Background and Objectives: Large language models (LLMs) are emerging as valuable tools in plastic surgery, potentially reducing surgeons\' cognitive loads and improving patients\' outcomes. This study aimed to assess and compare the current state of the two most common and readily available LLMs, Open AI\'s ChatGPT-4 and Google\'s Gemini Pro (1.0 Pro), in providing intraoperative decision support in plastic and reconstructive surgery procedures. Materials and Methods: We presented each LLM with 32 independent intraoperative scenarios spanning 5 procedures. We utilized a 5-point and a 3-point Likert scale for medical accuracy and relevance, respectively. We determined the readability of the responses using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) score. Additionally, we measured the models\' response time. We compared the performance using the Mann-Whitney U test and Student\'s t-test. Results: ChatGPT-4 significantly outperformed Gemini in providing accurate (3.59 ± 0.84 vs. 3.13 ± 0.83, p-value = 0.022) and relevant (2.28 ± 0.77 vs. 1.88 ± 0.83, p-value = 0.032) responses. Alternatively, Gemini provided more concise and readable responses, with an average FKGL (12.80 ± 1.56) significantly lower than ChatGPT-4\'s (15.00 ± 1.89) (p < 0.0001). However, there was no difference in the FRE scores (p = 0.174). Moreover, Gemini\'s average response time was significantly faster (8.15 ± 1.42 s) than ChatGPT\'-4\'s (13.70 ± 2.87 s) (p < 0.0001). Conclusions: Although ChatGPT-4 provided more accurate and relevant responses, both models demonstrated potential as intraoperative tools. Nevertheless, their performance inconsistency across the different procedures underscores the need for further training and optimization to ensure their reliability as intraoperative decision-support tools.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)正变得越来越重要,因为它们被更频繁地用于提供医疗信息。我们的目标是评估电子人工智能(AI)大型语言模型(LLM)的有效性,例如ChatGPT-4,BingAI,和双子座回答患者关于早产儿视网膜病变(ROP)的询问。
    方法:三位眼科医生使用5点Likert量表评估了LLM对50项现实生活中患者询问的回答。还使用DISCERN仪器和EQIP框架评估了模型响应的可靠性,以及使用Flesch阅读方便(FRE)的可读性,Flesch-Kincaid等级(FKGL),和Coleman-Liau指数。
    结果:ChatGPT-4的表现优于BingAI和双子座,在90%(50分中的45分)中得分最高,并在98%(50分中的49分)的回答中获得“同意”或“强烈同意”的评级。它的准确性和可靠性分别为DISCERN和EQIP评分为63和72.2。BingAI的得分为53和61.1,而Gemini的可读性最好(FRE得分为39.1),但可靠性得分较低。特别是在筛选中观察到统计学上显著的性能差异,诊断,和治疗类别。
    结论:ChatGPT-4在对ROP相关查询提供详细和可靠的响应方面表现出色,虽然它的文本更复杂。根据DISCERN和EQIP评估,所有模型均提供了大致准确的信息。
    BACKGROUND: Large language models (LLMs) are becoming increasingly important as they are being used more frequently for providing medical information. Our aim is to evaluate the effectiveness of electronic artificial intelligence (AI) large language models (LLMs), such as ChatGPT-4, BingAI, and Gemini in responding to patient inquiries about retinopathy of prematurity (ROP).
    METHODS: The answers of LLMs for fifty real-life patient inquiries were assessed using a 5-point Likert scale by three ophthalmologists. The models\' responses were also evaluated for reliability with the DISCERN instrument and the EQIP framework, and for readability using the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), and Coleman-Liau Index.
    RESULTS: ChatGPT-4 outperformed BingAI and Gemini, scoring the highest with 5 points in 90% (45 out of 50) and achieving ratings of \"agreed\" or \"strongly agreed\" in 98% (49 out of 50) of responses. It led in accuracy and reliability with DISCERN and EQIP scores of 63 and 72.2, respectively. BingAI followed with scores of 53 and 61.1, while Gemini was noted for the best readability (FRE score of 39.1) but lower reliability scores. Statistically significant performance differences were observed particularly in the screening, diagnosis, and treatment categories.
    CONCLUSIONS: ChatGPT-4 excelled in providing detailed and reliable responses to ROP-related queries, although its texts were more complex. All models delivered generally accurate information as per DISCERN and EQIP assessments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:大型语言模型(LLM),例如OpenAI的生成预训练转换器(GPT)和MetaAI的LLaMA(大型语言模型MetaAI),因其在化学信息学领域的潜力而日益受到认可。特别是在理解简化的分子输入线进入系统(SMILES),表示化学结构的标准方法。这些LLM还具有将SMILES字符串解码为向量表示的能力。
    方法:我们研究了GPT和LLaMA与SMILES上的预训练模型相比在下游任务上嵌入SMILES字符串的性能,重点研究了两个关键应用:分子性质预测和药物相互作用预测。
    结果:我们发现,使用LLaMA生成的SMILES嵌入在分子性质和DDI预测任务中都优于GPT。值得注意的是,基于LLaMA的SMILES嵌入在分子预测任务中显示出与SMILES上的预训练模型相当的结果,并且优于DDI预测任务的预训练模型。
    结论:LLM在生成SMILES嵌入方面的性能显示出进一步研究这些分子嵌入模型的巨大潜力。我们希望我们的研究弥合LLM和分子嵌入之间的差距,激发对分子表示领域LLM潜力的额外研究。GitHub:https://github.com/sshaghayghs/LLaMA-VS-GPT。
    OBJECTIVE: Large Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations.
    METHODS: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction.
    RESULTS: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks.
    CONCLUSIONS: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    ChatGPT和其他人工智能(AI)系统吸引了医疗保健提供者和研究人员的注意力,因为它们有可能改善护理流程和结果。虽然这些技术有望实现流程自动化,提高效率,减少认知负担,它们的使用也有风险。在这篇评论中,我们回顾了AI的基本概念,概述当前可用工具的一些功能和局限性,讨论儿科血液学/肿瘤学的当前和未来应用,并提供评估和实施框架,可供儿科血液学家/肿瘤学家考虑在临床实践中使用AI使用。
    ChatGPT and other artificial intelligence (AI) systems have captivated the attention of healthcare providers and researchers for their potential to improve care processes and outcomes. While these technologies hold promise to automate processes, increase efficiency, and reduce cognitive burden, their use also carries risks. In this commentary, we review basic concepts of AI, outline some of the capabilities and limitations of currently available tools, discuss current and future applications in pediatric hematology/oncology, and provide an evaluation and implementation framework that can be used by pediatric hematologist/oncologists considering the use of AI in clinical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号