关键词: BERT model COVID-19 EHR NLP disease identification electronic health records model development multidisciplinary natural language processing prediction primary care public health

Mesh : Humans Electronic Health Records Natural Language Processing Pandemics COVID-19 / diagnosis epidemiology General Practice

来  源:   DOI:10.2196/49944   PDF(Pubmed)

Abstract:
Natural language processing (NLP) models such as bidirectional encoder representations from transformers (BERT) hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for health care, EHRs generated by these primary care providers contain a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases.
This study aims to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands.
The BERT model was initially pretrained on Dutch language data and fine-tuned using a comprehensive EHR data set comprising confirmed COVID-19 GP consultations and non-COVID-19-related consultations. The data set was partitioned into a training and development set, and the model\'s performance was evaluated on an independent test set that served as the primary measure of its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through 3 approaches. First, external validation was applied on an EHR data set from a different geographic region in the Netherlands. Second, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19-related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, that is, the period before widespread testing.
The model development used 300,359 GP consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0.97, F1-score 0.90, precision 0.85, recall 0.85, specificity 0.99). External validations showed comparable high performance. Validation on PCR test data showed high recall but low precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19-related hospitalizations (F1-score 96.8; P<.001; R2=0.69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands.
The developed BERT model was able to accurately identify COVID-19 cases among GP consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offer a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.
摘要:
背景:自然语言处理(NLP)模型,例如来自变压器(BERT)的双向编码器表示,通过潜在地提高效率和准确性,有望彻底改变来自电子健康记录(EHR)的疾病识别。然而,它们在实践环境中的实际应用需要一种全面和多学科的方法来开发和验证。COVID-19大流行强调了由于测试可用性有限以及处理非结构化数据方面的挑战而在疾病识别方面面临的挑战。在荷兰,全科医生(GP)是医疗保健的第一联系点,这些初级保健提供者生成的EHR包含大量潜在有价值的信息。尽管如此,EHR中自由文本条目的非结构化性质在识别趋势方面提出了挑战,检测疾病爆发,或准确定位COVID-19病例。
目的:本研究旨在开发和验证BERT模型,用于检测荷兰一般实践EHR中的COVID-19咨询。
方法:BERT模型最初是在荷兰语数据上进行预训练的,并使用包括确认的COVID-19GP咨询和非COVID-19相关咨询的全面EHR数据集进行了微调。数据集被划分为训练和开发集,并在一个独立的测试集上评估模型的性能,该测试集作为其在COVID-19检测中有效性的主要衡量标准。为了验证最终模型,通过3种方法评估了其性能。首先,对来自荷兰不同地理区域的EHR数据集进行了外部验证.第二,使用从市政卫生服务获得的聚合酶链反应(PCR)测试数据的结果进行验证。最后,评估了荷兰预测结局与COVID-19相关住院率之间的相关性,涵盖了荷兰大流行爆发前后的时期,也就是说,在广泛测试之前的时期。
结果:模型开发使用了300,359个GP咨询。我们为COVID-19会诊开发了一个高度准确的模型(准确度0.97,F1得分0.90,精确度0.85,召回率0.85,特异性0.99)。外部验证显示出相当高的性能。对PCR检测数据的验证显示召回率高,但精确度和特异性低。使用医院数据进行的验证显示,该模型的COVID-19预测与COVID-19相关的住院率之间存在显着相关性(F1评分96.8;P<.001;R2=0.69)。最重要的是,该模型能够在荷兰首例确诊病例出现前几周预测COVID-19病例.
结论:开发的BERT模型能够在确诊病例之前的全科医生咨询中准确识别COVID-19病例。我们的BERT模型的验证功效突出了NLP模型早期识别疾病爆发的潜力,体现了多学科努力利用技术进行疾病识别的力量。此外,这项研究的意义超越了COVID-19,为早期识别各种疾病提供了蓝图,揭示了这样的模型可以彻底改变疾病监测。
公众号