背景:肺栓塞(PE)的识别不足或晚期是严重威胁患者生命的1个或多个肺动脉的血栓形成,是现代医学面临的主要挑战。
目的:我们旨在建立准确且信息丰富的机器学习(ML)模型,以识别入院时PE高危患者。在他们的初步临床检查之前,只使用他们医疗记录中的信息。
方法:我们收集了人口统计数据,合并症,2568例PE患者和52,598例对照患者的药物数据。我们专注于急诊科入院前的可用数据,因为这些是最普遍可访问的数据。我们训练了ML随机森林算法,以在患者住院期间的最早时间-入院时检测PE。我们开发并应用了2种基于ML的方法,专门解决PE和非PE患者之间的数据失衡问题。导致PE误诊。
结果:所得模型根据年龄预测PE,性别,BMI,过去的临床PE事件,慢性肺病,过去的血栓形成事件,和抗凝剂的使用,获得PE和非PE分类精度的80%几何平均值。虽然入院时只有4%(1942/46,639)的患者诊断为PE,我们确定了2个包含亚组的聚类方案,其中超过61%(聚类方案1中的705/1120;聚类方案2中的427/701和340/549)的PE阳性患者.第一聚类方案中的一个亚组包括36%(705/1942)的所有PE患者,其特征是过去明确的PE诊断。深静脉血栓形成的患病率高6倍,肺炎的患病率高3倍,与该方案中其他亚组的患者进行比较。在第二种聚类方案中,2个亚组(仅男性中的1个,仅女性中的1个)包括所有患有PE且肺炎患病率相对较高的患者。第三个亚组仅包括那些过去诊断为肺炎的患者.
结论:这项研究建立了一种ML工具,用于在入院后几乎立即早期诊断PE。尽管高度不平衡的情况破坏了准确的PE预测,并使用仅来自患者病史的信息,我们的模型既准确又翔实,能够在入院时识别已经处于PE高风险的患者,甚至在进行初始临床检查之前.事实上,根据以前发表的量表,我们没有将我们的患者限制在PE高危人群中(例如,Wells或修订的Genova评分)使我们能够准确评估ML在原始医疗数据上的应用,并确定新的,先前未识别的PE风险因素,比如以前的肺部疾病,在一般人群中。
BACKGROUND: Under- or late identification of pulmonary embolism (PE)-a thrombosis of 1 or more pulmonary arteries that seriously threatens patients\' lives-is a major challenge confronting modern medicine.
OBJECTIVE: We aimed to establish accurate and informative machine learning (ML) models to identify patients at high risk for PE as they are admitted to the hospital, before their initial clinical checkup, by using only the information in their medical records.
METHODS: We collected demographics, comorbidities, and medications data for 2568 patients with PE and 52,598 control patients. We focused on data available prior to emergency department admission, as these are the most universally accessible data. We trained an ML random forest algorithm to detect PE at the earliest possible time during a patient\'s hospitalization-at the time of his or her admission. We developed and applied 2 ML-based methods specifically to address the data imbalance between PE and non-PE patients, which causes misdiagnosis of PE.
RESULTS: The resulting models predicted PE based on age, sex, BMI, past clinical PE events, chronic lung disease, past thrombotic events, and usage of anticoagulants, obtaining an 80% geometric mean value for the PE and non-PE classification accuracies. Although on hospital admission only 4% (1942/46,639) of the patients had a diagnosis of PE, we identified 2 clustering schemes comprising subgroups with more than 61% (705/1120 in clustering scheme 1; 427/701 and 340/549 in clustering scheme 2) positive patients for PE. One subgroup in the first clustering scheme included 36% (705/1942) of all patients with PE who were characterized by a definite past PE diagnosis, a 6-fold higher prevalence of deep vein thrombosis, and a 3-fold higher prevalence of pneumonia, compared with patients of the other subgroups in this scheme. In the second clustering scheme, 2 subgroups (1 of only men and 1 of only women) included patients who all had a past PE diagnosis and a relatively high prevalence of pneumonia, and a third subgroup included only those patients with a past diagnosis of pneumonia.
CONCLUSIONS: This study established an ML tool for early diagnosis of PE almost immediately upon hospital admission. Despite the highly imbalanced scenario undermining accurate PE prediction and using information available only from the patient\'s medical history, our models were both accurate and informative, enabling the identification of patients already at high risk for PE upon hospital admission, even before the initial clinical checkup was performed. The fact that we did not restrict our patients to those at high risk for PE according to previously published scales (eg, Wells or revised Genova scores) enabled us to accurately assess the application of ML on raw medical data and identify new, previously unidentified risk factors for PE, such as previous pulmonary disease, in general populations.