背景:淀粉样变,罕见的多系统条件,通常需要复杂的,多学科护理。其低患病率强调了努力确保高质量患者教育材料的可用性以获得更好的结果的重要性。ChatGPT(OpenAI)是一个由人工智能提供支持的大型语言模型,为传播准确、可靠,以及为患者和提供者提供的可访问教育资源。其友好的用户界面,引人入胜的对话回应,以及用户提出后续问题的能力使其成为向患者提供准确和量身定制的信息的有前途的未来工具。
目的:我们对准确性进行了多学科评估,再现性,ChatGPT在回答与淀粉样变性有关的问题时的可读性。
方法:总共,与心脏病学相关的98个淀粉样变性问题,胃肠病学,神经学是由医学协会策划的,机构,和淀粉样变性Facebook支持小组,并输入ChatGPT-3.5和ChatGPT-4。心脏病学和胃肠病学相关的反应由董事会认证的心脏病学家和胃肠病学家独立评分。分别,专门研究淀粉样变性的人.这两位审稿人(RG和DCK)还对一般性问题进行了评分,并通过讨论解决了分歧。与神经病学相关的反应由专门从事淀粉样变性的董事会认证的神经科医生(AAH)进行分级。评审人员使用了以下分级量表:(1)综合,(2)正确但不充分,(3)有些正确,有些不正确,(4)完全不正确。问题按类别分层,以便进一步分析。通过将每个问题输入每个模型两次来评估重复性。还使用Python中的Textstat库(Python软件基金会)和R软件中的Textstat可读性包(R统计计算基金会)评估了ChatGPT-4响应的可读性。
结果:ChatGPT-4(n=98)提供了93(95%)的准确信息,和82(84%)是全面的。ChatGPT-3.5(n=83)提供了74(89%)的准确信息响应,66(79%)是全面的。按问题类别检查时,ChatGTP-4和ChatGPT-3.5提供了53(95%)和48(86%)的综合响应,分别,到“一般问题”(n=56)。当受试者检查时,ChatGPT-4和ChatGPT-3.5在回答心脏病学问题方面表现最佳(n=12),两种模型均产生10个(83%)综合反应。对于胃肠病学(n=15),ChatGPT-4收到了9个(60%)答复的综合成绩,和ChatGPT-3.5提供8(53%)的反应。总的来说,ChatGPT-4的98个应答中的96个(98%)和ChatGPT-3.5的83个应答中的73个(88%)是可再现的。ChatGPT-4响应的可读性范围从10年级到美国研究生年级以上,平均为15.5(SD1.9)。
结论:大型语言模型是为淀粉样变性患者提供准确可靠的健康信息的有前途的工具。然而,ChatGPT的回答超过了美国医学会推荐的五至六年级阅读水平。未来的研究重点是提高反应的准确性和可读性是必要的。在广泛实施之前,该技术的局限性和伦理影响必须进一步探讨,以确保患者安全和公平实施。
BACKGROUND: Amyloidosis, a rare multisystem condition, often requires complex, multidisciplinary care. Its low prevalence underscores the importance of efforts to ensure the availability of high-quality patient education materials for better outcomes. ChatGPT (OpenAI) is a large language model powered by artificial intelligence that offers a potential avenue for disseminating accurate, reliable, and accessible educational resources for both patients and providers. Its user-friendly interface, engaging conversational responses, and the capability for users to ask follow-up questions make it a promising future tool in delivering accurate and tailored information to patients.
OBJECTIVE: We performed a multidisciplinary assessment of the accuracy, reproducibility, and readability of ChatGPT in answering questions related to amyloidosis.
METHODS: In total, 98 amyloidosis questions related to cardiology, gastroenterology, and neurology were curated from medical societies, institutions, and amyloidosis Facebook support groups and inputted into ChatGPT-3.5 and ChatGPT-4. Cardiology- and gastroenterology-related responses were independently graded by a board-certified cardiologist and gastroenterologist, respectively, who specialize in amyloidosis. These 2 reviewers (RG and DCK) also graded general questions for which disagreements were resolved with discussion. Neurology-related responses were graded by a board-certified neurologist (AAH) who specializes in amyloidosis. Reviewers used the following grading scale: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Questions were stratified by categories for further analysis. Reproducibility was assessed by inputting each question twice into each model. The readability of ChatGPT-4 responses was also evaluated using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing).
RESULTS: ChatGPT-4 (n=98) provided 93 (95%) responses with accurate information, and 82 (84%) were comprehensive. ChatGPT-3.5 (n=83) provided 74 (89%) responses with accurate information, and 66 (79%) were comprehensive. When examined by question category, ChatGTP-4 and ChatGPT-3.5 provided 53 (95%) and 48 (86%) comprehensive responses, respectively, to \"general questions\" (n=56). When examined by subject, ChatGPT-4 and ChatGPT-3.5 performed best in response to cardiology questions (n=12) with both models producing 10 (83%) comprehensive responses. For gastroenterology (n=15), ChatGPT-4 received comprehensive grades for 9 (60%) responses, and ChatGPT-3.5 provided 8 (53%) responses. Overall, 96 of 98 (98%) responses for ChatGPT-4 and 73 of 83 (88%) for ChatGPT-3.5 were reproducible. The readability of ChatGPT-4\'s responses ranged from 10th to beyond graduate US grade levels with an average of 15.5 (SD 1.9).
CONCLUSIONS: Large language models are a promising tool for accurate and reliable health information for patients living with amyloidosis. However, ChatGPT\'s responses exceeded the American Medical Association\'s recommended fifth- to sixth-grade reading level. Future studies focusing on improving response accuracy and readability are warranted. Prior to widespread implementation, the technology\'s limitations and ethical implications must be further explored to ensure patient safety and equitable implementation.