Large language models

大型语言模型
  • 文章类型: Journal Article
    背景:桡骨远端骨折的最佳治疗仍然是骨科医师面临的挑战。人工智能(AI)和大型语言模型(LLM)的出现,尤其是ChatGPT,在改善医疗保健和研究方面提供了巨大的潜力。本研究旨在评估ChatGPT知识在治疗桡骨远端骨折方面的准确性和一致性。专注于其为患者提供信息并协助骨科临床医生决策过程的能力。
    方法:我们为ChatGPT提供了七个关于桡骨远端骨折治疗的问题,得到14个答复。这些问题涵盖了一系列主题,包括患者咨询和骨科临床决策。我们要求每个响应的参考,并涉及两名骨科注册师和两名高级骨科外科医生,以评估响应的准确性和一致性。
    结果:所有14个回答都包含了正确和不正确的信息。在引用的47篇参考文献中,13%是准确的,28%似乎是捏造的,57%的人不正确。2%是正确的,但被认为是不合适的。在71%的响应中观察到一致性。
    结论:ChatGPT在提供桡骨远端骨折信息时,在准确性和一致性方面存在显著限制。以目前的格式,它为患者教育和临床决策提供了有限的效用。
    BACKGROUND: The optimal management of distal radius fractures remains a challenge for orthopaedic surgeons. The emergence of Artificial Intelligence (AI) and Large Language Models (LLMs), especially ChatGPT, affords significant potential in improving healthcare and research. This study aims to assess the accuracy and consistency of ChatGPT\'s knowledge in managing distal radius fractures, with a focus on its capability to provide information for patients and assist in the decision-making processes of orthopaedic clinicians.
    METHODS: We presented ChatGPT with seven questions on distal radius fracture management over two sessions, resulting in 14 responses. These questions covered a range of topics, including patient inquiries and orthopaedic clinical decision-making. We requested references for each response and involved two orthopaedic registrars and two senior orthopaedic surgeons to evaluate response accuracy and consistency.
    RESULTS: All 14 responses contained a mix of both correct and incorrect information. Among the 47 cited references, 13% were accurate, 28% appeared to be fabricated, 57% were incorrect, and 2% were correct but deemed inappropriate. Consistency was observed in 71% of the responses.
    CONCLUSIONS: ChatGPT demonstrates significant limitations in accuracy and consistency when providing information on distal radius fractures. In its current format, it offers limited utility for patient education and clinical decision-making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在此更新中,我们讨论了最近的美国FDA指南,该指南提供了有关适当研究设计和分析的更具体的指南,以支持非干预性研究的因果推断,以及欧洲药品管理局(EMA)和药品管理局负责人(HMA)公共电子目录的发布.我们还重点介绍了一篇文章,该文章建议在协议最终确定之前评估数据质量和适用性,以及美国医学会杂志认可的框架,用于在发布现实世界的证据研究时使用因果语言。最后,我们探索大型语言模型在自动化开发卫生经济模型方面的潜力。
    In this update, we discuss recent US FDA guidance offering more specific guidelines on appropriate study design and analysis to support causal inference for non-interventional studies and the launch of the European Medicines Agency (EMA) and the Heads of Medicines Agencies (HMA) public electronic catalogues. We also highlight an article recommending assessing data quality and suitability prior to protocol finalization and a Journal of the American Medical Association-endorsed framework for using causal language when publishing real-world evidence studies. Finally, we explore the potential of large language models to automate the development of health economic models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • DOI:
    文章类型: Journal Article
    人工智能技术可能构成重大的国家安全问题。人工智能程序可用于开发化学和生物制剂,以规避现有的保护措施或医疗方法,或设计具有它们天然不具备的能力的病原体(功能获得研究)。尽管澳大利亚有一个强有力的关于转基因生物研究的立法框架,该框架需要10多个不同政府部门的互动,大学和资助机构。Further,关于在生物学研究中负责任地使用人工智能的指导方针很少,现有的法律和政策不适用于“虚拟”进行的研究,即使这项研究可能对国家安全产生影响。本文探讨了澳大利亚生物安全框架中这些未被仔细审查的概念。
    AI technologies can pose a major national security concern. AI programs could be used to develop chemical and biological agents which circumvent existing protective measures or medical treatments, or to design pathogens with capabilities they do not naturally possess (gain-of-function research). Although Australia has a strong legislative framework relating to research into genetically modified organisms, the framework requires the interaction of more than 10 different government departments, universities and funding agencies. Further, there are few guidelines about the responsible use of AI in biological research where existing laws and policies do not apply to research that is conducted \"virtually\", even where that research may have national security implications. This article explores these under-scrutinised concepts in Australia\'s biological security frameworks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    从电子健康记录(EHR)中准确识别临床表型可为患者的健康状况提供更多见解,特别是当这些信息在结构化数据中不可用时。这项研究评估了OpenAI的生成预训练变压器(GPT)-4模型在非小细胞肺癌(NSCLC)患者的EHR文本中识别临床表型的应用。目标是确定疾病阶段,使用GPT-4的治疗和进展,并将其性能与GPT-3.5-turbo进行比较,Flan-T5-xl,Flan-T5-xxl,Llama-3-8B,以及2种基于规则和基于机器学习的方法,即,scispaCy和medspaCy。
    表型,如初始癌症阶段,初始治疗,癌症复发的证据,从圣路易斯华盛顿大学的63例NSCLC患者的13.646临床记录中确定了复发期间受影响的器官,密苏里州。GPT-4模型的性能与GPT-3.5-turbo进行了评估,Flan-T5-xxl,Flan-T5-xl,Llama-3-8B,medspaCy,和scisspaCy通过比较精度,召回,和micro-F1得分。
    GPT-4取得了更高的F1得分,精度,与Flan-T5-xl相比,Flan-T5-xxl,Llama-3-8B,medspaCy,和scispaCy的模型。GPT-3.5-turbo的性能类似于GPT-4。GPT,Flan-T5和Llama模型不受上下文模式识别的明确规则要求的约束。spaCy模型依赖于预定义的模式,导致他们的表现欠佳。
    GPT-4由于其强大的预训练和对嵌入令牌的显着模式识别能力而改善了临床表型识别。它展示了数据驱动的有效性,即使输入中的上下文有限。虽然基于规则的模型对某些任务仍然有用,GPT模型提供了改进的文本上下文理解,和稳健的临床表型提取。
    UNASSIGNED: Accurately identifying clinical phenotypes from Electronic Health Records (EHRs) provides additional insights into patients\' health, especially when such information is unavailable in structured data. This study evaluates the application of OpenAI\'s Generative Pre-trained Transformer (GPT)-4 model to identify clinical phenotypes from EHR text in non-small cell lung cancer (NSCLC) patients. The goal was to identify disease stages, treatments and progression utilizing GPT-4, and compare its performance against GPT-3.5-turbo, Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, and 2 rule-based and machine learning-based methods, namely, scispaCy and medspaCy.
    UNASSIGNED: Phenotypes such as initial cancer stage, initial treatment, evidence of cancer recurrence, and affected organs during recurrence were identified from 13 646 clinical notes for 63 NSCLC patients from Washington University in St. Louis, Missouri. The performance of the GPT-4 model is evaluated against GPT-3.5-turbo, Flan-T5-xxl, Flan-T5-xl, Llama-3-8B, medspaCy, and scispaCy by comparing precision, recall, and micro-F1 scores.
    UNASSIGNED: GPT-4 achieved higher F1 score, precision, and recall compared to Flan-T5-xl, Flan-T5-xxl, Llama-3-8B, medspaCy, and scispaCy\'s models. GPT-3.5-turbo performed similarly to that of GPT-4. GPT, Flan-T5, and Llama models were not constrained by explicit rule requirements for contextual pattern recognition. spaCy models relied on predefined patterns, leading to their suboptimal performance.
    UNASSIGNED: GPT-4 improves clinical phenotype identification due to its robust pre-training and remarkable pattern recognition capability on the embedded tokens. It demonstrates data-driven effectiveness even with limited context in the input. While rule-based models remain useful for some tasks, GPT models offer improved contextual understanding of the text, and robust clinical phenotype extraction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:我们比较了生成AI的性能(G-AI,ATARI)和自然语言处理(NLP)工具,用于识别放射学报告和图像中的侧向错误。
    方法:我们使用基于NLP(mPower)的工具来识别在其QA仪表板中标记为侧向错误的放射学报告。NLP模型检测并突出显示放射学报告中的侧向性不匹配。从NLP标记的侧向错误的1124份放射学报告的初始池中,我们选择并评估了898份包含射线照相术的报告,CT,MRI,和超声模式,以确保全面覆盖。放射科医师审查了每个放射学报告,以评估是否存在标记的侧向错误(报告错误-真阳性)或不存在(NLP错误-假阳性)。接下来,我们将ATARI应用于237例连续NLP真阳性(118例)和假阳性(119例)侧向错误的放射学报告和图像.我们估计了NLP和G-AI工具的准确性,以识别整体和模态侧向误差。
    结果:在898个NLP标记的侧向错误中,64%(574/898)有NLP错误,36%(324/898)报告错误。文本查询ATARI功能以97.4%的准确率(115/118报告;95%CI=96.5%-98.3%)正确识别不存在侧向性不匹配(NLP假阳性)。组合视觉和文本查询导致98.3%的准确率(116/118报告/图像;95%CI=97.6%-99.0%)单独查询具有98.3%的准确率(116/118图像;95%CI=97.6%-99.0%)。
    结论:生成AI授权的ATARI原型优于评估的NLP工具,用于确定放射学报告中的真实和虚假侧向错误,同时实现基于图像的侧向性确定。复杂放射学报告中ATARI文本查询的潜在错误强调了进一步改进技术的必要性。
    OBJECTIVE: We compared the performance of generative AI (G-AI, ATARI) and natural language processing (NLP) tools for identifying laterality errors in radiology reports and images.
    METHODS: We used an NLP-based (mPower) tool to identify radiology reports flagged for laterality errors in its QA Dashboard. The NLP model detects and highlights laterality mismatches in radiology reports. From an initial pool of 1124 radiology reports flagged by the NLP for laterality errors, we selected and evaluated 898 reports that encompassed radiography, CT, MRI, and ultrasound modalities to ensure comprehensive coverage. A radiologist reviewed each radiology report to assess if the flagged laterality errors were present (reporting error - true positive) or absent (NLP error - false positive). Next, we applied ATARI to 237 radiology reports and images with consecutive NLP true positive (118 reports) and false positive (119 reports) laterality errors. We estimated accuracy of NLP and G-AI tools to identify overall and modality-wise laterality errors.
    RESULTS: Among the 898 NLP-flagged laterality errors, 64% (574/898) had NLP errors and 36% (324/898) were reporting errors. The text query ATARI feature correctly identified the absence of laterality mismatch (NLP false positives) with a 97.4% accuracy (115/118 reports; 95% CI = 96.5% - 98.3%). Combined Vision and text query resulted in 98.3% accuracy (116/118 reports/images; 95% CI = 97.6% - 99.0%) query alone had a 98.3% accuracy (116/118 images; 95% CI = 97.6% - 99.0%).
    CONCLUSIONS: The generative AI-empowered ATARI prototype outperformed the assessed NLP tool for determining true and false laterality errors in radiology reports while enabling an image-based laterality determination. Underlying errors in ATARI text query in complex radiology reports emphasize the need for further improvement in the technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    大型语言模型(LLM)支持的服务由于在许多任务中的出色性能而在各种应用程序中越来越受欢迎,如情绪分析和回答问题。最近,研究一直在探索它们在数字健康环境中的潜在用途,特别是在心理健康领域。然而,实施LLM增强的会话人工智能(CAI)提出了重要的道德,技术,和临床挑战。在这篇观点论文中,我们讨论了2个挑战,这些挑战会影响LLM增强的CAI对于有心理健康问题的个人的使用,专注于抑郁症患者的用例:将LLM增强的CAI人性化的趋势以及他们缺乏情境化的鲁棒性。我们的方法是跨学科的,依靠哲学的考虑,心理学,和计算机科学。我们认为,LLM增强的CAI的人性化取决于对使用LLM模拟“类似人类”特征的含义的反映,以及这些系统在与人类的互动中应该扮演什么角色。Further,确保LLM稳健性的情境化需要考虑抑郁症患者语言产生的特殊性,以及它随时间的演变。最后,我们提供了一系列建议,以促进负责任的设计和部署LLM增强的CAI,为抑郁症患者提供治疗支持.
    UNASSIGNED: Large language model (LLM)-powered services are gaining popularity in various applications due to their exceptional performance in many tasks, such as sentiment analysis and answering questions. Recently, research has been exploring their potential use in digital health contexts, particularly in the mental health domain. However, implementing LLM-enhanced conversational artificial intelligence (CAI) presents significant ethical, technical, and clinical challenges. In this viewpoint paper, we discuss 2 challenges that affect the use of LLM-enhanced CAI for individuals with mental health issues, focusing on the use case of patients with depression: the tendency to humanize LLM-enhanced CAI and their lack of contextualized robustness. Our approach is interdisciplinary, relying on considerations from philosophy, psychology, and computer science. We argue that the humanization of LLM-enhanced CAI hinges on the reflection of what it means to simulate \"human-like\" features with LLMs and what role these systems should play in interactions with humans. Further, ensuring the contextualization of the robustness of LLMs requires considering the specificities of language production in individuals with depression, as well as its evolution over time. Finally, we provide a series of recommendations to foster the responsible design and deployment of LLM-enhanced CAI for the therapeutic support of individuals with depression.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目标:在数字时代,患者转向在线来源获取腰椎融合信息,需要仔细研究大型语言模型(LLM),例如用于患者教育的聊天生成预训练变压器(ChatGPT)。
    方法:我们的研究旨在评估OpenAI(人工智能)的ChatGPT3.5和Google的Bard对腰椎融合手术患者问题的响应质量。我们通过谷歌搜索从158个常见问题中找出了10个关键问题,然后将其呈现给两个聊天机器人。五名失明的脊柱外科医生以4分制对反应进行了评分,从“不满意”到“优秀”。\'答案的清晰度和专业性也使用5点李克特量表进行了评估。
    结果:在我们对ChatGPT3.5和Bard的10个问题的评估中,97%的反应被评为优秀或令人满意。具体来说,ChatGPT有62%的优秀和32%的最低澄清反应,只有6%需要适度或实质性的澄清。巴德的回答是66%的优秀和24%的最低澄清,10%需要更多的澄清。2个模型之间的总体评级分布没有发现显着差异。两人都在努力解决关于手术风险的3个具体问题,成功率,以及手术入路的选择(Q3,Q4和Q5)。两种模型的评分者间可靠性均较低(ChatGPT:k=0.041,p=0.622;Bard:k=-0.040,p=0.601)。虽然两人在理解和同理心上都得分很高,吟游诗人在同理心和专业精神方面的评分略低。
    结论:ChatGPT3.5和Bard有效回答了腰椎融合常见问题,但需要进一步的培训和研究来巩固LLM在医学教育和医疗保健沟通中的作用。
    OBJECTIVE: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
    METHODS: Our study aims to assess the response quality of Open AI (artificial intelligence)\'s ChatGPT 3.5 and Google\'s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from \'unsatisfactory\' to \'excellent.\' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
    RESULTS: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard\'s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
    CONCLUSIONS: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs\' role in medical education and healthcare communication.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:尽管抗结核药物通常可以治愈,但结核病(TB)每年杀死约160万人。因此,结核病病例检测和治疗监测,需要一个全面的方法。自动放射分析,结合临床,微生物,和免疫学数据,通过机器学习(ML),可以帮助实现它。
    方法:对6只恒河猴进行肺部致病性结核分枝杆菌的实验接种。数据,包括计算机断层扫描(CT),在0、2、4、8、12、16和20周收集。
    结果:我们基于ML的CT分析(TB-Net)有效且准确地分析了疾病进展,性能优于标准深度学习模型(LLMOpenAI的CLIPVi4)。基于TB-Net的结果比,并独立确认,两名放射科医生对手动疾病进行盲法评分,并显示出与血液生物标志物的强相关性,TB-病灶体积,和疾病发病过程中的疾病体征。
    结论:所提出的方法在早期疾病检测中很有价值,监测治疗效果,和临床决策。
    BACKGROUND: Tuberculosis (TB) kills approximately 1.6 million people yearly despite the fact anti-TB drugs are generally curative. Therefore, TB-case detection and monitoring of therapy, need a comprehensive approach. Automated radiological analysis, combined with clinical, microbiological, and immunological data, by machine learning (ML), can help achieve it.
    METHODS: Six rhesus macaques were experimentally inoculated with pathogenic Mycobacterium tuberculosis in the lung. Data, including Computed Tomography (CT), were collected at 0, 2, 4, 8, 12, 16, and 20 weeks.
    RESULTS: Our ML-based CT analysis (TB-Net) efficiently and accurately analyzed disease progression, performing better than standard deep learning model (LLM OpenAI\'s CLIP Vi4). TB-Net based results were more consistent than, and confirmed independently by, blinded manual disease scoring by two radiologists and exhibited strong correlations with blood biomarkers, TB-lesion volumes, and disease-signs during disease pathogenesis.
    CONCLUSIONS: The proposed approach is valuable in early disease detection, monitoring efficacy of therapy, and clinical decision making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文介绍了大型语言模型(LLM)的风险分析,一种产生文本的“生成”人工智能(AI)系统,通常是响应来自人类用户的文本输入。本文特别关注LLM造成极端灾难的风险,在这种灾难中,他们做了一些类似于接管世界并杀死所有人的事情。自从最近发布了诸如ChatGPT和GPT-4之类的非常有能力的LLM以来,LLM收购灾难的可能性一直是公众讨论的重点。这可以说是第一次,实际的人工智能系统(而不是假设的未来系统)引发了人们对收购灾难的担忧。本文的分析比较了(A)接管可能需要的人工智能系统的特征,正如先前关于人工智能收购风险的理论文献所确定的那样,with(B)characteristicsobservedincurrentLLM.ThiscomparisonreimportsthatthecapabilitiesofcurrentLLMappeartofallwellshortofwhatmaybeneededfortakeutergister.由于深度学习算法的基本限制,未来的LLM可能同样无法实现。然而,关于当前LLM中发现的深度学习和惊喜能力的分歧专家意见表明,未来LLM可能会带来一些收购灾难的风险。LLM治理应监控收购特征的变化,并准备在出现警告信号时更积极地进行。除非出现这样的迹象,更积极的治理措施可能是没有根据的。
    This article presents a risk analysis of large language models (LLMs), a type of \"generative\" artificial intelligence (AI) system that produces text, commonly in response to textual inputs from human users. The article is specifically focused on the risk of LLMs causing an extreme catastrophe in which they do something akin to taking over the world and killing everyone. The possibility of LLM takeover catastrophe has been a major point of public discussion since the recent release of remarkably capable LLMs such as ChatGPT and GPT-4. This arguably marks the first time when actual AI systems (and not hypothetical future systems) have sparked concern about takeover catastrophe. The article\'s analysis compares (A) characteristics of AI systems that may be needed for takeover, as identified in prior theoretical literature on AI takeover risk, with (B) characteristics observed in current LLMs. This comparison reveals that the capabilities of current LLMs appear to fall well short of what may be needed for takeover catastrophe. Future LLMs may be similarly incapable due to fundamental limitations of deep learning algorithms. However, divided expert opinion on deep learning and surprise capabilities found in current LLMs suggests some risk of takeover catastrophe from future LLMs. LLM governance should monitor for changes in takeover characteristics and be prepared to proceed more aggressively if warning signs emerge. Unless and until such signs emerge, more aggressive governance measures may be unwarranted.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    及时的工程,安排输入或提示给大型语言模型的过程,以指导其产生所需的输出,是一个新兴的研究领域,塑造了这些模型如何理解任务,处理信息,并在广泛的自然语言处理(NLP)应用程序中生成响应。数字心理健康,另一方面,由于包括早期发现和干预在内的几个原因变得越来越重要,并减轻用于临床诊断的高技能医务人员的有限可用性。这篇简短的评论概述了数字心理健康NLP领域的即时工程的最新进展。据我们所知,这篇评论是第一次尝试讨论最新的提示工程类型,方法,以及在数字心理健康应用中使用的任务。我们讨论了三种类型的数字心理健康任务:分类,代,和问题回答。最后,我们讨论挑战,局限性,伦理考虑,以及数字心理健康快速工程的未来方向。我们认为,这篇简短的评论为数字心理健康的即时工程的未来研究提供了有用的出发点。
    Prompt engineering, the process of arranging input or prompts given to a large language model to guide it in producing desired outputs, is an emerging field of research that shapes how these models understand tasks, process information, and generate responses in a wide range of natural language processing (NLP) applications. Digital mental health, on the other hand, is becoming increasingly important for several reasons including early detection and intervention, and to mitigate limited availability of highly skilled medical staff for clinical diagnosis. This short review outlines the latest advances in prompt engineering in the field of NLP for digital mental health. To our knowledge, this review is the first attempt to discuss the latest prompt engineering types, methods, and tasks that are used in digital mental health applications. We discuss three types of digital mental health tasks: classification, generation, and question answering. To conclude, we discuss the challenges, limitations, ethical considerations, and future directions in prompt engineering for digital mental health. We believe that this short review contributes a useful point of departure for future research in prompt engineering for digital mental health.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号