背景:应使用真实世界的患者数据对人工智能(AI)症状检查器模型进行训练,以提高其诊断准确性。鉴于目前在临床实践中使用基于AI的症状检查程序,随着时间的推移,他们的表现应该会有所改善。然而,对这些症状检查程序诊断准确性的纵向评估是有限的.
目的:本研究旨在评估真实世界中使用的基于AI的症状检查程序创建的鉴别诊断列表准确性的纵向变化。
方法:这是一个单中心,回顾性,观察性研究。在2019年5月1日至2022年4月30日期间,在没有预约的情况下访问了门诊诊所,并且在索引访问后30天内被送往日本社区医院的患者被认为是合格的。我们只包括在索引访视时接受基于AI症状检查的患者,最终在随访期间确诊。最终诊断分为常见或不常见,所有病例均分为典型或非典型.主要结果指标是基于AI的症状检查器创建的鉴别诊断列表的准确性,在症状检查程序创建的10项鉴别诊断列表中定义为最终诊断。为了评估症状检查者3年内诊断准确性的变化,我们使用卡方检验比较了3个时期的主要结果:2019年5月1日至2020年4月30日(第一年);2020年5月1日至2021年4月30日(第二年);2021年5月1日至2022年4月30日(第三年).
结果:共纳入381例患者。常见疾病包括257例(67.5%),在298例(78.2%)病例中观察到典型表现。总的来说,基于AI的症状检查器创建的鉴别诊断列表的准确性为172(45.1%),在3年内没有差异(第一年:97/219,44.3%;第二年:32/72,44.4%;第三年:43/90,47.7%;P=.85)。症状检查器创建的鉴别诊断列表的准确性在那些患有罕见疾病(30/124,24.2%)和非典型表现(12/83,14.5%)的患者中很低。在多元逻辑回归模型中,常见疾病(P<.001;比值比4.13,95%CI2.50-6.98)和典型表现(P<.001;比值比6.92,95%CI3.62-14.2)与症状检查程序创建的鉴别诊断列表的准确性显著相关.
结论:由基于AI的症状检查程序开发的鉴别诊断列表的诊断准确性的3年纵向调查,已在现实世界的临床实践中实施,随着时间的推移没有改善。罕见疾病和非典型表现与较低的诊断准确性独立相关。在未来,应该训练症状检查人员来识别不常见的情况。
BACKGROUND: Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited.
OBJECTIVE: This
study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world.
METHODS: This was a single-center, retrospective, observational
study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker\'s diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year).
RESULTS: A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker.
CONCLUSIONS: A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.