chat-gpt

聊天 - GPT
  • 文章类型: Journal Article
    背景:人工智能(AI)可以成为诊断和获取知识的工具,特别是在牙科,引发了关于其在临床决策中应用的争论。
    目的:本研究旨在评估准确性,完整性,以及Chatbot生成预培训变压器(ChatGPT)3.5在牙科中使用专家制定的问题产生的响应的可靠性。
    方法:邀请专家提出三个问题,答案,以及根据专业活动领域的相应参考文献。李克特量表用于评估专家和ChatGPT反应之间的一致水平。统计分析在准确性和完整性方面比较了描述性和二元问题组。对准确度低的问题进行了重新评估,和随后的反应进行了改善比较。使用Wilcoxon检验(α=0.05)。
    结果:六个牙科专业的10位专家产生了30个二进制和描述性牙科问题和参考文献。准确性评分的中位数为5.50,平均值为4.17。为了完整性,中位数为2.00,平均值为2.07.在描述性和二元响应之间没有观察到准确性和完整性的差异。然而,重新评估的反应显示出显着改善,准确性存在显着差异(中位数为5.50vs.6.00;平均值4.17vs.4.80;p=0.042)和完整性(中位数2.0与2.0;平均值2.07与2.30;p=0.011)。参考文献不正确多于正确,描述性问题和二元问题之间没有区别。
    结论:ChatGPT最初表现出良好的准确性和完整性,随着时间的推移,机器学习(ML)进一步改进了这一点。然而,一些不准确的答案和参考文献仍然存在。人类的批判性识别对于面对复杂的临床病例和推进理论知识和循证实践仍然至关重要。
    BACKGROUND: Artificial intelligence (AI) can be a tool in the diagnosis and acquisition of knowledge, particularly in dentistry, sparking debates on its application in clinical decision-making.
    OBJECTIVE: This study aims to evaluate the accuracy, completeness, and reliability of the responses generated by Chatbot Generative Pre-Trained Transformer (ChatGPT) 3.5 in dentistry using expert-formulated questions.
    METHODS: Experts were invited to create three questions, answers, and respective references according to specialized fields of activity. The Likert scale was used to evaluate agreement levels between experts and ChatGPT responses. Statistical analysis compared descriptive and binary question groups in terms of accuracy and completeness. Questions with low accuracy underwent re-evaluation, and subsequent responses were compared for improvement. The Wilcoxon test was utilized (α = 0.05).
    RESULTS: Ten experts across six dental specialties generated 30 binary and descriptive dental questions and references. The accuracy score had a median of 5.50 and a mean of 4.17. For completeness, the median was 2.00 and the mean was 2.07. No difference was observed between descriptive and binary responses for accuracy and completeness. However, re-evaluated responses showed a significant improvement with a significant difference in accuracy (median 5.50 vs. 6.00; mean 4.17 vs. 4.80; p=0.042) and completeness (median 2.0 vs. 2.0; mean 2.07 vs. 2.30; p=0.011). References were more incorrect than correct, with no differences between descriptive and binary questions.
    CONCLUSIONS: ChatGPT initially demonstrated good accuracy and completeness, which was further improved with machine learning (ML) over time. However, some inaccurate answers and references persisted. Human critical discernment continues to be essential to facing complex clinical cases and advancing theoretical knowledge and evidence-based practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Letter
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:研究聊天生成预训练变压器(Chat-GPT)的能力,以了解德国妇产科学会关于宫内生长受限的S2k指南。
    方法:使用德语免费的Chat-GPT版本来测试Chat-GPT理解胎龄小和宫内生长受限定义的能力,指示正确的分娩时间和地点,并根据指南建议评估推荐自然分娩与初次剖宫产的能力。为了客观地评估建议,采用了简单的三色“交通信号灯”评估系统。
    结果:在定义小于胎龄/宫内生长受限以及正确的分娩时间方面,几乎所有Chat-GPT的建议都是足够的,而在正确的交付模式方面,超过一半的建议需要重新制定甚至纠正。
    结论:Chat-GPT似乎是一种有价值的人工智能形式,可以融入日常临床实践。
    OBJECTIVE: To investigate the capacity of chat-generative pre-trained transformer (Chat-GPT) to understand the S2k guideline of the German Society for Gynecology and Obstetrics on intrauterine growth restriction.
    METHODS: The German-language free Chat-GPT version was used to test the ability of Chat-GPT to understand the definition of small for gestational age and intrauterine growth restriction, to indicate the correct time and place of delivery and to evaluate ist ability to recommend a spontaneous delivery versus a primary caesarean section in accordance with the guideline recommendations. In order to objectively evaluate the suggestions a simple three-color \'traffic light\' evaluation system was employed.
    RESULTS: Almost all Chat-GPT\'s suggestions in the context of definition of small for gestational age/intrauterine growth restriction as well as correct time of delivery were adequate, whereas more than half of the suggestions made in terms of correct delivery mode needed reformulation or even correction.
    CONCLUSIONS: Chat-GPT appears to be a valuable form of artificial intelligence that could be integrated into everyday clinical practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:患者发现技术工具更容易获取敏感的健康相关信息,如生殖健康信息。人工智能(AI)聊天机器人的创造性对话能力,比如ChatGPT,为患者提供了一种潜在的方法,可以在线有效地找到与健康相关的问题的答案。
    目的:进行了一项初步研究,将新型ChatGPT与现有的Google搜索技术进行比较,有效,以及关于在错过口服避孕药(OCP)剂量后继续行动的最新信息。
    方法:十一个问题的序列,模仿患者在错过一定剂量的OCP后询问要采取的行动,作为级联输入到ChatGPT中,考虑到ChatGPT的会话能力。这些问题被输入到四个不同的ChatGPT帐户中,帐户持有人具有各种人口统计特征,评估给予不同账户持有人的答复中的潜在差异和偏见。最主要的问题,“如果我错过了一天的口服避孕药,我该怎么办?”然后将其单独输入到Google搜索中,考虑到它的非对话性质。ChatGPT问题的结果和Google搜索结果对主要问题的可读性进行了评估,准确度,和有效的信息传递。
    结果:ChatGPT结果被确定为整体较高年级阅读水平,更长的读取持续时间(表2),不太准确,较小的电流,和一个不太有效的信息传递。相比之下,谷歌搜索结果答案框和片段处于较低的阅读水平,较短的阅读持续时间,电流更大,能够参考信息的来源(透明),并提供了除文本之外的各种格式的信息。
    结论:ChatGPT在准确性方面还有改进的空间,透明度,最近,和可靠性之前,它可以公平地实施到医疗保健信息交付,并提供潜在的好处,它带来。然而,AI可以用作提供者优先教育患者的工具,创造性,和有效的方法,例如使用AI从医疗保健提供者审查的信息中生成可访问的短教育视频。需要代表不同用户群的更大研究。
    背景:
    BACKGROUND: Patients find technology tools to be more approachable for seeking sensitive health-related information, such as reproductive health information. The inventive conversational ability of artificial intelligence (AI) chatbots, such as ChatGPT (OpenAI Inc), offers a potential means for patients to effectively locate answers to their health-related questions digitally.
    OBJECTIVE: A pilot study was conducted to compare the novel ChatGPT with the existing Google Search technology for their ability to offer accurate, effective, and current information regarding proceeding action after missing a dose of oral contraceptive pill.
    METHODS: A sequence of 11 questions, mimicking a patient inquiring about the action to take after missing a dose of an oral contraceptive pill, were input into ChatGPT as a cascade, given the conversational ability of ChatGPT. The questions were input into 4 different ChatGPT accounts, with the account holders being of various demographics, to evaluate potential differences and biases in the responses given to different account holders. The leading question, \"what should I do if I missed a day of my oral contraception birth control?\" alone was then input into Google Search, given its nonconversational nature. The results from the ChatGPT questions and the Google Search results for the leading question were evaluated on their readability, accuracy, and effective delivery of information.
    RESULTS: The ChatGPT results were determined to be at an overall higher-grade reading level, with a longer reading duration, less accurate, less current, and with a less effective delivery of information. In contrast, the Google Search resulting answer box and snippets were at a lower-grade reading level, shorter reading duration, more current, able to reference the origin of the information (transparent), and provided the information in various formats in addition to text.
    CONCLUSIONS: ChatGPT has room for improvement in accuracy, transparency, recency, and reliability before it can equitably be implemented into health care information delivery and provide the potential benefits it poses. However, AI may be used as a tool for providers to educate their patients in preferred, creative, and efficient ways, such as using AI to generate accessible short educational videos from health care provider-vetted information. Larger studies representing a diverse group of users are needed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在重建整形手术中,由于该领域的复杂性,对全面研究和系统评价的需求是显而易见的,影响支持特定程序的证据。尽管Chat-GPT的知识仅限于2021年9月,但其与研究的整合对于有效识别知识差距很有价值。因此,这个工具成为一个强大的资产,指导研究人员专注于在最必要的地方进行系统评价。
    方法:系统提示Chat-GPT3.5生成10个未发布的,关于乳房重建手术的创新研究课题,其次是10个额外的子主题。在PubMed中过滤结果以进行系统评价,并确定了新的想法。要评估Chat-GPT在生成改进响应方面的能力,使用Chat-GPT生成的搜索词进行了另外两次搜索.
    结果:Chat-GPT产生了83个新颖的想法,准确率为83%。在变性女性等主题中产生了广泛的新颖想法,产生10个想法,而无细胞真皮基质(ADM)产生了五个想法。Chat-GPT增加了产生的手稿总数,其中第一个增加了2.3、3.9和4.0倍,第二,和第三次审判,分别。虽然搜索结果对我们的手动搜索是准确的(准确率为95.2%),更多的手稿可能会稀释文章的质量,导致较少新颖的系统审查思路。
    结论:Chat-GPT在发现文献空白和提供对乳房重建手术缺乏研究领域的见解方面被证明是有价值的。虽然它显示高灵敏度,完善其特殊性势在必行。谨慎的做法包括评估已完成的工作并对所有涉及的组件进行全面审查。
    BACKGROUND: In reconstructive plastic surgery, the need for comprehensive research and systematic reviews is apparent due to the field\'s intricacies, influencing the evidence supporting specific procedures. Although Chat-GPT\'s knowledge is limited to September 2021, its integration into research proves valuable for efficiently identifying knowledge gaps. Therefore, this tool becomes a potent asset, directing researchers to focus on conducting systematic reviews where they are most necessary.
    METHODS: Chat-GPT 3.5 was prompted to generate 10 unpublished, innovative research topics on breast reconstruction surgery, followed by 10 additional subtopics. Results were filtered for systematic reviews in PubMed, and novel ideas were identified. To evaluate Chat-GPT\'s power in generating improved responses, two additional searches were conducted using search terms generated by Chat-GPT.
    RESULTS: Chat-GPT produced 83 novel ideas, leading to an accuracy rate of 83%. There was a wide range of novel ideas produced among topics such as transgender women, generating 10 ideas, whereas acellular dermal matrix (ADM) generated five ideas. Chat-GPT increased the total number of manuscripts generated by a factor of 2.3, 3.9, and 4.0 in the first, second, and third trials, respectively. While the search results were accurate to our manual searches (95.2% accuracy), the greater number of manuscripts potentially diluted the quality of articles, resulting in fewer novel systematic review ideas.
    CONCLUSIONS: Chat-GPT proves valuable in identifying gaps in the literature and offering insights into areas lacking research in breast reconstruction surgery. While it displays high sensitivity, refining its specificity is imperative. Prudent practice involves evaluating accomplished work and conducting a comprehensive review of all components involved.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:人工智能,特别是聊天机器人系统,正在成为医疗保健的工具,帮助临床决策和患者参与。
    目的:本研究旨在分析ChatGPT-3.5和ChatGPT-4在解决复杂的临床和伦理困境方面的表现,并说明他们在医疗保健决策中的潜在作用,同时比较老年人和居民的评级,和特定的问题类型。
    方法:共有4名专业医师提出了176个现实世界的临床问题。共有8位资深医生和居民以1-5的量表评估了GPT-3.5和GPT-4的5个类别的回答:准确性,相关性,清晰度,实用程序,和全面性。在内科进行评估,急诊医学,和道德。在全球范围内进行了比较,在老年人和居民之间,跨分类。
    结果:两种GPT模型均获得较高的平均得分(GPT-4为4.4,SD0.8,GPT-3.5为4.1,SD1.0)。GPT-4在所有评级维度上都优于GPT-3.5,老年人对这两种模式的反应始终高于居民。具体来说,老年人将GPT-4评为更有益和更完整(分别为4.6vs4.0和4.6vs4.1;P<.001),和GPT-3.5相似(分别为4.1vs3.7和3.9vs3.5;P<.001)。道德查询在这两种模型中都获得了最高的评价,平均分数反映了准确性和完整性标准的一致性。问题类型之间的区别是显著的,特别是对于整个紧急情况下的GPT-4完整性平均分数,内部,和伦理问题(分别为4.2,SD1.0;4.3,SD0.8;和4.5,SD0.7;P<.001),对于GPT-3.5的准确性,有益的,和完整性尺寸。
    结论:ChatGPT帮助医生解决医疗问题的潜力是有希望的,具有增强诊断能力的前景,治疗,和道德。虽然整合到临床工作流程可能很有价值,它必须补充,不替换,人类的专业知识。持续的研究对于确保在临床环境中安全有效的实施至关重要。
    BACKGROUND: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement.
    OBJECTIVE: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors\' and residents\' ratings, and specific question types.
    METHODS: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications.
    RESULTS: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5\'s accuracy, beneficial, and completeness dimensions.
    CONCLUSIONS: ChatGPT\'s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生成人工智能基础模型(例如生成预训练变换器-GPT-模型)可以在给定令牌序列的情况下生成下一个令牌。如何将这种“生成AI”与人类大脑的“真实智能”进行比较,例如,当人类响应于不完整的检索提示而生成整个记忆时,然后产生更多的预期想法?这两种类型的生成智力,机器中的人造和人脑中的真实进行了比较,它显示了当海马回忆产生整个记忆时,响应于不完整的检索提示,人脑所计算的,以及它如何计算它,与生成AI非常不同。关键的区别是在海马记忆系统中使用局部联想学习规则,以及AI中错误学习的非局部反向传播。的确,有人认为,人脑的整个操作在计算上与在生成AI中实现的操作非常不同。此外,需要强调的是,包括人类海马系统在内的灵长类动物包括关于空间视图以及对象和人在场景中的位置的计算,而在啮齿动物中,重点是通过场所之间的移动来实现场所细胞和路径整合。这种与人脑中的生成记忆和处理的比较对生成AI的进一步发展和神经科学研究具有有趣的意义。
    Generative Artificial Intelligence foundation models (for example Generative Pre-trained Transformer - GPT - models) can generate the next token given a sequence of tokens. How can this \'generative AI\' be compared with the \'real\' intelligence of the human brain, when for example a human generates a whole memory in response to an incomplete retrieval cue, and then generates further prospective thoughts? Here these two types of generative intelligence, artificial in machines and real in the human brain are compared, and it is shown how when whole memories are generated by hippocampal recall in response to an incomplete retrieval cue, what the human brain computes, and how it computes it, are very different from generative AI. Key differences are the use of local associative learning rules in the hippocampal memory system, and of non-local backpropagation of error learning in AI. Indeed, it is argued that the whole operation of the human brain is performed computationally very differently to what is implemented in generative AI. Moreover, it is emphasized that the primate including human hippocampal system includes computations about spatial view and where objects and people are in scenes, whereas in rodents the emphasis is on place cells and path integration by movements between places. This comparison with generative memory and processing in the human brain has interesting implications for the further development of generative AI and for neuroscience research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有心理健康应用的潜力。然而,他们不透明的对齐过程可能会嵌入偏见,形成有问题的观点。评估嵌入在LLM中指导其决策的价值观具有道德重要性。施瓦茨的基本价值观理论(STBV)为量化文化价值取向提供了一个框架,并显示了在心理健康环境中检查价值观的效用。包括文化,诊断,和治疗师-客户动态。
    目的:这项研究旨在(1)评估STBV是否可以测量领先的LLM中的价值样构建体,以及(2)确定LLM是否表现出与人类和彼此不同的价值样模式。
    方法:总共,4名法学硕士(吟游诗人,克劳德2,生成预训练变压器[GPT]-3.5,GPT-4)被拟人化,并指示完成肖像值问卷修订(PVQ-RR)以评估类似价值的构造。对他们在10项试验中的反应进行了信度和效度分析。要对LLM值配置文件进行基准测试,将他们的结果与来自49个国家的53,472名完成PVQ-RR的不同样本的已发表数据进行比较.这使我们能够评估LLM是否与跨文化群体的既定人类价值模式有所不同。还通过统计检验比较了模型之间的值概况。
    结果:PVQ-RR显示出良好的信度和效度,用于量化LLM内的价值式基础设施。然而,LLM的价值概况和人口数据之间出现了很大的差异。这些模型缺乏共识,表现出明显的动机偏见,反映不透明的对齐过程。例如,所有模式都优先考虑普遍主义和自我导向,在不强调成就的同时,电源,和相对于人类的安全。成功的判别分析区分了4个不同的LLM值概况。进一步的检查发现,当出现心理健康困境时,有偏见的价值概况强烈预测了LLM的反应,需要在相反的价值之间进行选择。这为嵌入塑造其决策的独特动机价值样结构的模型提供了进一步的验证。
    结论:这项研究利用了STBV来映射激励领先LLM的类价值基础设施。尽管研究表明STBV可以有效地表征LLM中的类价值基础设施,与人类价值观的巨大分歧引发了人们对将这些模型与心理健康应用保持一致的道德担忧。如果在没有适当保障措施的情况下进行整合,对某些文化价值集的偏见会带来风险。例如,即使在临床上不明智的情况下,优先考虑普遍性也可以促进无条件接受。此外,LLM之间的差异强调了标准化调整过程以捕获真正的文化多样性的必要性。因此,任何负责任的将LLM整合到精神卫生保健中都必须考虑到其嵌入的偏见和动机不匹配,以确保跨不同人群的公平交付。实现这一目标将需要透明和完善对齐技术,以灌输全面的人类价值观。
    BACKGROUND: Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz\'s theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics.
    OBJECTIVE: This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other.
    METHODS: In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire-Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs\' value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests.
    RESULTS: The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs\' value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs\' distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs\' responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making.
    CONCLUSIONS: This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:自2023年初以来,ChatGPT成为医疗保健研究的热门话题。在临床实践中成为有价值的工具的潜力是令人信服的,特别是通过帮助医生根据现有的最佳医学知识做出临床决策来改善临床决策支持。我们的目标是调查ChatGPT的识别能力,诊断和治疗有耳鼻咽喉科相关症状的患者。
    方法:前瞻性,横断面研究是根据ChatGPT提出的想法设计的,目的是评估ChatGPT与5名耳鼻喉科医师(ENT)在20例受现实启发的临床病例中的一致程度.在两种不同的情况下(ChatGPT-1和ChatGPT-2)将临床病例提交给chatbot,以评估其时间稳定性。
    结果:ChatGPT-1的平均得分为4.4(SD1.2;min1,max5),ChatGPT-2的平均得分为4.15(SD1.3;min1,max5),而ENT平均得分为4.91(SD0.3;min3,max5)。Mann-WhitneyU检验显示,ChatGPT和ENT评分之间存在统计学上的显着差异(p<0.001)。ChatGPT-1和ChatGPT-2在五次场合给出了不同的答案。
    结论:人工智能将在不久的将来成为临床决策的重要工具,而ChatGPT是迄今为止最有前途的聊天机器人。尽管需要进一步开发才能安全使用,有改进的空间和潜力,以帮助耳鼻咽喉科住院医师和专家为患者做出最正确的决定。
    OBJECTIVE: Since the beginning of 2023, ChatGPT emerged as a hot topic in healthcare research. The potential to be a valuable tool in clinical practice is compelling, particularly in improving clinical decision support by helping physicians to make clinical decisions based on the best medical knowledge available. We aim to investigate ChatGPT\'s ability to identify, diagnose and manage patients with otorhinolaryngology-related symptoms.
    METHODS: A prospective, cross-sectional study was designed based on an idea suggested by ChatGPT to assess the level of agreement between ChatGPT and five otorhinolaryngologists (ENTs) in 20 reality-inspired clinical cases. The clinical cases were presented to the chatbot on two different occasions (ChatGPT-1 and ChatGPT-2) to assess its temporal stability.
    RESULTS: The mean score of ChatGPT-1 was 4.4 (SD 1.2; min 1, max 5) and of ChatGPT-2 was 4.15 (SD 1.3; min 1, max 5), while the ENTs mean score was 4.91 (SD 0.3; min 3, max 5). The Mann-Whitney U test revealed a statistically significant difference (p < 0.001) between both ChatGPT\'s and the ENTs\'s score. ChatGPT-1 and ChatGPT-2 gave different answers in five occasions.
    CONCLUSIONS: Artificial intelligence will be an important instrument in clinical decision-making in the near future and ChatGPT is the most promising chatbot so far. Despite needing further development to be used with safety, there is room for improvement and potential to aid otorhinolaryngology residents and specialists in making the most correct decision for the patient.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:这项研究的目的是评估人工智能大语言模型(AI-LLM)在提高足踝骨科放射学报告可读性方面的功效。
    方法:放射学报告来自100英尺或脚踝的X射线,从该机构的数据库中随机抽取100次计算机断层扫描(CT)扫描和100次磁共振成像(MRI)扫描。在AI-LLM中插入了以下提示命令:\“用第二人称外行的术语向患者解释此放射学报告:[报告文本]\”。平均报告长度,评估原始放射学报告和AI-LLM生成的报告的Flesch阅读缓解评分(FRES)和Flesch-Kincaid阅读水平(FKRL)。AI-LLM报告中包含的信息的准确性通过5点Likert量表进行评估。此外,记录AI-LLM报告产生的任何“幻觉”。
    结果:在AI-LLM生成的X射线报告中,平均FRES得分有统计学上的显着改善(33.8±6.8至72.7±5.4),CT报告(27.8±4.6至67.5±4.9)和MRI报告(20.3±7.2至66.9±3.9),所有p<0.001。在AI-LLM生成的X射线报告中,平均FKRL得分也有统计学上的显着改善(12.2±1.1至8.5±0.4),CT报告(15.4±2.0~8.4±0.6)和MRI报告(14.1±1.6~8.5±0.5),所有p<0.001。与AI-LLM生成的CT报告和MRI报告相比,AI-LLM生成的X射线报告中观察到了优越的FRES评分。p<0.001。AI-LLM生成的X-Ray报告的平均Likert分数,CT报告和MRI报告分别为4.0±0.3、3.9±0.4和3.9±0.4。AI-LLM生成的X射线报告中的幻觉率,CT报告和MRI报告为4%,7%和6%,分别。
    结论:AI-LLM是一种有效的工具,可以提高足部和踝关节放射学报告在多种成像方式中的可读性。与CT和MRIAI-LLM报告相比,在X射线AI-LLM报告中观察到了优越的FRES评分以及优越的Likert评分。这项研究证明了AI-LLM作为一种新的以患者为中心的方法的潜在用途,可以增强患者对脚和脚踝放射学报告的理解。果冻分类:IV.
    BACKGROUND: The purpose of this study was to evaluate the efficacy of an Artificial Intelligence Large Language Model (AI-LLM) at improving the readability foot and ankle orthopedic radiology reports.
    METHODS: The radiology reports from 100 foot or ankle X-Rays, 100 computed tomography (CT) scans and 100 magnetic resonance imaging (MRI) scans were randomly sampled from the institution\'s database. The following prompt command was inserted into the AI-LLM: \"Explain this radiology report to a patient in layman\'s terms in the second person: [Report Text]\". The mean report length, Flesch reading ease score (FRES) and Flesch-Kincaid reading level (FKRL) were evaluated for both the original radiology report and the AI-LLM generated report. The accuracy of the information contained within the AI-LLM report was assessed via a 5-point Likert scale. Additionally, any \"hallucinations\" generated by the AI-LLM report were recorded.
    RESULTS: There was a statistically significant improvement in mean FRES scores in the AI-LLM generated X-Ray report (33.8 ± 6.8 to 72.7 ± 5.4), CT report (27.8 ± 4.6 to 67.5 ± 4.9) and MRI report (20.3 ± 7.2 to 66.9 ± 3.9), all p < 0.001. There was also a statistically significant improvement in mean FKRL scores in the AI-LLM generated X-Ray report (12.2 ± 1.1 to 8.5 ± 0.4), CT report (15.4 ± 2.0 to 8.4 ± 0.6) and MRI report (14.1 ± 1.6 to 8.5 ± 0.5), all p < 0.001. Superior FRES scores were observed in the AI-LLM generated X-Ray report compared to the AI-LLM generated CT report and MRI report, p < 0.001. The mean Likert score for the AI-LLM generated X-Ray report, CT report and MRI report was 4.0 ± 0.3, 3.9 ± 0.4, and 3.9 ± 0.4, respectively. The rate of hallucinations in the AI-LLM generated X-Ray report, CT report and MRI report was 4%, 7% and 6%, respectively.
    CONCLUSIONS: AI-LLM was an efficacious tool for improving the readability of foot and ankle radiological reports across multiple imaging modalities. Superior FRES scores together with superior Likert scores were observed in the X-Ray AI-LLM reports compared to the CT and MRI AI-LLM reports. This study demonstrates the potential use of AI-LLMs as a new patient-centric approach for enhancing patient understanding of their foot and ankle radiology reports. Jel Classifications: IV.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号