关键词: AI ChatGPT Claude LLM artificial intelligence dermatologist large language model

来  源:   DOI:10.2196/59273   PDF(Pubmed)

Abstract:
BACKGROUND: Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown potential in medical fields, including dermatology. With the introduction of image analysis capabilities in LLMs, their application in dermatological diagnostics has garnered significant interest. These capabilities are enabled by the integration of computer vision techniques into the underlying architecture of LLMs.
OBJECTIVE: This study aimed to compare the diagnostic performance of Claude 3 Opus and ChatGPT with GPT-4 in analyzing dermoscopic images for melanoma detection, providing insights into their strengths and limitations.
METHODS: We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process. The ISIC archive was chosen due to its comprehensive and well-annotated collection of dermoscopic images, ensuring a diverse and representative sample. Images were included if they were dermoscopic images of melanocytic lesions with histopathologically confirmed diagnoses. Each model was given the same prompt, instructing it to provide the top 3 differential diagnoses for each image, ranked by likelihood. Primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability were assessed. The McNemar test was chosen to compare the diagnostic performance of the 2 models, as it is suitable for analyzing paired nominal data.
RESULTS: In the primary diagnosis, Claude 3 Opus achieved 54.9% sensitivity (95% CI 44.08%-65.37%), 57.14% specificity (95% CI 46.31%-67.46%), and 56% accuracy (95% CI 46.22%-65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99%-67.21%), 38.78% specificity (95% CI 28.77%-49.59%), and 48% accuracy (95% CI 38.37%-57.75%). The McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33%-83.77%) and 78% (95% CI 68.46%-85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685-9.263) in discriminating malignancy, while ChatGPT-4 had an odds ratio of 0.616 (95% CI 0.297-1.278).
CONCLUSIONS: Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. These findings underscore the need for developing robust, transparent, and clinically validated AI models through collaborative efforts between AI researchers, dermatologists, and other health care professionals. While AI can provide valuable insights, it cannot yet replace the expertise of trained clinicians.
摘要:
背景:人工智能(AI)和大型语言模型(LLM)的最新进展在医学领域显示出潜力,包括皮肤病学。随着LLM中图像分析功能的引入,它们在皮肤病学诊断中的应用引起了极大的兴趣。这些功能是通过将计算机视觉技术集成到LLM的底层体系结构中而实现的。
目的:本研究旨在比较Claude3Opus和ChatGPT与GPT-4在分析皮肤镜图像以进行黑色素瘤检测方面的诊断性能。提供洞察他们的优势和局限性。
方法:我们随机选择了100个组织病理学证实的皮肤镜图像(50个恶性,50良性)来自国际皮肤成像合作组织(ISIC)档案,使用计算机生成的随机化过程。之所以选择ISIC档案,是因为它收集了全面且注释齐全的皮肤图像,确保多样化和代表性的样本。如果是经组织病理学证实的黑素细胞病变的皮肤镜图像,则包括图像。每个模型都给出了相同的提示,指示它为每张图像提供前3个鉴别诊断,按可能性排序。初级诊断准确性,前3种鉴别诊断的准确性,并评估恶性肿瘤的辨别能力。选择McNemar测试来比较2种型号的诊断性能,因为它适合分析配对的标称数据。
结果:在主要诊断中,克劳德3Opus实现了54.9%的灵敏度(95%CI44.08%-65.37%),57.14%特异性(95%CI46.31%-67.46%),和56%的准确率(95%CI46.22%-65.42%),而ChatGPT表现出56.86%的敏感性(95%CI45.99%-67.21%),特异性38.78%(95%CI28.77%-49.59%),准确率为48%(95%CI38.37%-57.75%)。McNemar检验显示两种模型之间没有显着差异(P=0.17)。对于前3个鉴别诊断,Claude3Opus和ChatGPT包括76%(95%CI66.33%-83.77%)和78%(95%CI68.46%-85.45%)的病例的正确诊断,分别。McNemar检验无显著性差异(P=0.56)。在恶性肿瘤歧视中,Claude3Opus的表现优于ChatGPT,灵敏度为47.06%,81.63%特异性,准确率为64%,与45.1%相比,42.86%,44%,分别。McNemar检验显示差异显著(P<.001)。Claude3Opus在区分恶性肿瘤方面的比值比为3.951(95%CI1.685-9.263),而ChatGPT-4的比值比为0.616(95%CI0.297-1.278)。
结论:我们的研究强调了LLM在协助皮肤科医生方面的潜力,但也揭示了其局限性。两种模型在诊断黑色素瘤和良性病变时都出错。这些发现强调了开发健壮,透明,以及通过人工智能研究人员之间的协作努力进行临床验证的人工智能模型,皮肤科医生,和其他医疗保健专业人员。虽然AI可以提供有价值的见解,它还不能取代训练有素的临床医生的专业知识。
公众号