关键词: Historical epidemiology Infectious diseases Machine learning Natural language processing Outbreaks Plague

Mesh : Natural Language Processing Algorithms

来  源:   DOI:10.1016/j.epidem.2022.100656

Abstract:
Pandemic diseases such as plague have produced a vast amount of literature providing information about the spatiotemporal extent, transmission, or countermeasures. However, the manual extraction of such information from running text is a tedious process, and much of this information remains locked into a narrative format. Natural Language processing (NLP) is a promising tool for the automated extraction of epidemiological data, and can facilitate the establishment of datasets. In this paper, we explore the utility of NLP to assist in the creation of a plague outbreak dataset. We produced a gold standard list of toponyms by manual annotation of a German plague treatise published by Sticker in 1908. We investigated the performance of five pre-trained NLP libraries (Google, Stanford CoreNLP, spaCy, germaNER and Geoparser) for the automated extraction of location data compared to the gold standard. Of all tested algorithms, spaCy performed best (sensitivity 0.92, F1 score 0.83), followed closely by Stanford CoreNLP (sensitivity 0.81, F1 score 0.87). Google NLP had a slightly lower performance (F1 score 0.72, sensitivity 0.78). Geoparser and germaNER had a poor sensitivity (0.41 and 0.61). We then evaluated how well automated geocoding services such as Google geocoding, Geonames and Geoparser located these outbreaks correctly. All geocoding services performed poorly - particularly for historical regions - and returned the correct GIS information only in 60.4%, 52.7% and 33.8% of all cases. Finally, we compared our newly digitized plague dataset to a re-digitized version of the plague treatise by Biraben and provide an update of the spatio-temporal extent of the second pandemic plague outbreaks. We conclude that NLP tools have their limitations, but they are potentially useful to accelerate the collection of data and the generation of a global plague outbreak database.
摘要:
瘟疫等大流行疾病产生了大量提供时空范围信息的文献,传输,或对策。然而,从运行文本中手动提取这些信息是一个繁琐的过程,这些信息中的大部分仍然锁定在叙事格式中。自然语言处理(NLP)是一种用于自动提取流行病学数据的有前途的工具,并且可以促进数据集的建立。在本文中,我们探索了NLP在创建鼠疫爆发数据集方面的实用性。我们通过对Sticker于1908年出版的德国鼠疫论文的手动注释,制作了黄金标准的地名清单。我们调查了五个预先训练的NLP库的性能(谷歌,斯坦福CoreNLP,spaCy,germaNER和Geoparser)用于与黄金标准相比自动提取位置数据。在所有测试的算法中,spaCy表现最好(灵敏度0.92,F1评分0.83),紧随其后的是StanfordCoreNLP(敏感性0.81,F1评分0.87)。GoogleNLP的表现略低(F1得分0.72,灵敏度0.78)。Geoparser和germaNER的敏感性较差(0.41和0.61)。然后,我们评估了谷歌地理编码等自动化地理编码服务的好坏,Geonames和Geoparser正确定位了这些爆发。所有地理编码服务表现不佳-特别是对于历史区域-并且仅在60.4%中返回了正确的GIS信息,所有病例的52.7%和33.8%。最后,我们将新的数字化鼠疫数据集与Biraben的鼠疫论文的重新数字化版本进行了比较,并提供了第二次大流行鼠疫爆发的时空范围的更新。我们得出结论,NLP工具有其局限性,但它们可能有助于加速数据收集和全球鼠疫爆发数据库的生成。
公众号