关键词: Alzheimer’s disease clinical notes electronic health records information extraction natural language processing sleep

来  源:   DOI:10.1093/jamia/ocae177

Abstract:
OBJECTIVE: Alzheimer\'s disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients\' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.
METHODS: A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.
RESULTS: The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).
CONCLUSIONS: Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.
CONCLUSIONS: The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.
摘要:
目的:阿尔茨海默病(AD)是美国最常见的痴呆形式。睡眠是与生活方式相关的因素之一,已被证明对老年人的最佳认知功能至关重要。然而,缺乏研究睡眠与AD发病率之间的关联。进行此类研究的主要瓶颈是传统的获取睡眠信息的方法耗时,低效,不可伸缩,仅限于患者的主观体验。我们的目标是自动提取特定的睡眠相关模式,比如打鼾,午睡,睡眠质量差,白天嗜睡,晚上醒来,其他睡眠问题,和睡眠持续时间,从AD患者的临床记录。假设这些睡眠模式在AD的发病中起作用,深入了解睡眠与AD发病和进展之间的关系。
方法:黄金标准数据集是从adSLEEP的570份随机抽样临床笔记文档的手动注释中创建的,从匹兹堡大学医学中心(UPMC)检索到的7266名AD患者的192.000个取消识别的临床笔记。我们开发了一种基于规则的自然语言处理(NLP)算法,机器学习模型,和基于大型语言模型(LLM)的NLP算法,以自动提取与睡眠相关的概念,包括打鼾,午睡,睡眠问题,睡眠质量差,白天嗜睡,晚上醒来,和睡眠持续时间,来自黄金标准数据集。
结果:482名患者的注释数据集主要包括白人(89.2%),平均年龄为84.7岁的老年人口,女性占64.1%,绝大多数是非西班牙裔或拉丁裔(94.6%)。基于规则的NLP算法在所有睡眠相关概念中实现了F1的最佳性能。就阳性预测值(PPV)而言,基于规则的NLP算法在白天嗜睡(1.00)和睡眠持续时间(1.00)方面获得了最高的PPV分数,虽然机器学习模型的睡眠PPV最高(0.95),睡眠质量差(0.86),LLAMA2在夜间醒来(0.93)和睡眠问题(0.89)时的PPV最高。
结论:尽管临床记录中很少记录睡眠信息,提出的基于规则的NLP算法和基于LLM的NLP算法仍然取得了有希望的结果。相比之下,基于机器学习的方法没有取得好的效果,这是由于训练数据中的睡眠信息较小。
结论:结果表明,基于规则的NLP算法一致地实现了所有睡眠概念的最佳性能。本研讨集中于AD患者的临床注解,但可以扩展到其他疾病的普通睡眠信息提取。
公众号