关键词: Data Science Data Visualization Electronic Health Records Machine Learning

Mesh : Humans Unsupervised Machine Learning Child Electronic Health Records Child, Preschool Infant Adolescent Cluster Analysis Infant, Newborn Male Female Age Factors

来  源:   DOI:10.1136/bmjhci-2023-100963   PDF(Pubmed)

Abstract:
BACKGROUND: Despite the increasing availability of electronic healthcare record (EHR) data and wide availability of plug-and-play machine learning (ML) Application Programming Interfaces, the adoption of data-driven decision-making within routine hospital workflows thus far, has remained limited. Through the lens of deriving clusters of diagnoses by age, this study investigated the type of ML analysis that can be performed using EHR data and how results could be communicated to lay stakeholders.
METHODS: Observational EHR data from a tertiary paediatric hospital, containing 61 522 unique patients and 3315 unique ICD-10 diagnosis codes was used, after preprocessing. K-means clustering was applied to identify age distributions of patient diagnoses. The final model was selected using quantitative metrics and expert assessment of the clinical validity of the clusters. Additionally, uncertainty over preprocessing decisions was analysed.
RESULTS: Four age clusters of diseases were identified, broadly aligning to ages between: 0 and 1; 1 and 5; 5 and 13; 13 and 18. Diagnoses, within the clusters, aligned to existing knowledge regarding the propensity of presentation at different ages, and sequential clusters presented known disease progressions. The results validated similar methodologies within the literature. The impact of uncertainty induced by preprocessing decisions was large at the individual diagnoses but not at a population level. Strategies for mitigating, or communicating, this uncertainty were successfully demonstrated.
CONCLUSIONS: Unsupervised ML applied to EHR data identifies clinically relevant age distributions of diagnoses which can augment existing decision making. However, biases within healthcare datasets dramatically impact results if not appropriately mitigated or communicated.
摘要:
背景:尽管电子医疗记录(EHR)数据的可用性越来越高,并且即插即用机器学习(ML)应用编程接口的广泛可用性,到目前为止,在常规医院工作流程中采用数据驱动的决策,仍然有限。通过按年龄推导诊断集群的镜头,本研究调查了可以使用EHR数据进行ML分析的类型,以及如何将结果传达给相关利益相关者.
方法:来自三级儿科医院的观察性EHR数据,使用了61522例独特患者和3315例独特ICD-10诊断代码,预处理后。K均值聚类用于识别患者诊断的年龄分布。使用定量度量和专家评估聚类的临床有效性来选择最终模型。此外,分析了预处理决策的不确定性。
结果:确定了四个年龄簇的疾病,大致与年龄在0和1之间;1和5;5和13;13和18。诊断,在集群内,与现有的关于不同年龄的演讲倾向的知识相一致,和序贯群集呈现已知的疾病进展。结果验证了文献中的类似方法。预处理决策引起的不确定性的影响在个体诊断中很大,但在人群水平上却没有。缓解战略,或沟通,这种不确定性得到了成功的证明。
结论:无监督ML应用于EHR数据可识别诊断的临床相关年龄分布,这可以增强现有决策。然而,如果没有适当地减轻或传达,医疗保健数据集中的偏见会极大地影响结果。
公众号