关键词: home healthcare large language model machine learning patient–nurse verbal communication synthetic data augmentation

来  源:   DOI:10.1111/jnu.13004

Abstract:
BACKGROUND: Identifying health problems in audio-recorded patient-nurse communication is important to improve outcomes in home healthcare patients who have complex conditions with increased risks of hospital utilization. Training machine learning classifiers for identifying problems requires resource-intensive human annotation.
OBJECTIVE: To generate synthetic patient-nurse communication and to automatically annotate for common health problems encountered in home healthcare settings using GPT-4. We also examined whether augmenting real-world patient-nurse communication with synthetic data can improve the performance of machine learning to identify health problems.
METHODS: Secondary data analysis of patient-nurse verbal communication data in home healthcare settings.
METHODS: The data were collected from one of the largest home healthcare organizations in the United States. We used 23 audio recordings of patient-nurse communications from 15 patients. The audio recordings were transcribed verbatim and manually annotated for health problems (e.g., circulation, skin, pain) indicated in the Omaha System Classification scheme. Synthetic data of patient-nurse communication were generated using the in-context learning prompting method, enhanced by chain-of-thought prompting to improve the automatic annotation performance. Machine learning classifiers were applied to three training datasets: real-world communication, synthetic communication, and real-world communication augmented by synthetic communication.
RESULTS: Average F1 scores improved from 0.62 to 0.63 after training data were augmented with synthetic communication. The largest increase was observed using the XGBoost classifier where F1 scores improved from 0.61 to 0.64 (about 5% improvement). When trained solely on either real-world communication or synthetic communication, the classifiers showed comparable F1 scores of 0.62-0.61, respectively.
CONCLUSIONS: Integrating synthetic data improves machine learning classifiers\' ability to identify health problems in home healthcare, with performance comparable to training on real-world data alone, highlighting the potential of synthetic data in healthcare analytics.
CONCLUSIONS: This study demonstrates the clinical relevance of leveraging synthetic patient-nurse communication data to enhance machine learning classifier performances to identify health problems in home healthcare settings, which will contribute to more accurate and efficient problem identification and detection of home healthcare patients with complex health conditions.
摘要:
背景:识别音频记录的患者-护士沟通中的健康问题对于改善家庭保健患者的预后很重要,这些患者病情复杂,医院使用风险增加。训练机器学习分类器来识别问题需要资源密集型的人类注释。
目的:使用GPT-4生成患者-护士之间的综合沟通,并自动注释家庭医疗环境中遇到的常见健康问题。我们还研究了用合成数据增强现实世界的病人-护士交流是否可以提高机器学习识别健康问题的性能。
方法:家庭医疗环境中病人-护士口头交流数据的二级数据分析。
方法:数据来自美国最大的家庭医疗保健组织之一。我们使用了来自15名患者的23个病人-护士通信录音。录音被逐字转录,并手动注释健康问题(例如,流通,皮肤,疼痛)在奥马哈系统分类方案中指出。使用上下文学习提示方法生成患者-护士沟通的合成数据,通过思想链提示增强,以提高自动注释性能。机器学习分类器被应用于三个训练数据集:真实世界的通信,合成通信,以及通过合成通信增强的现实世界通信。
结果:训练数据通过综合交流增强后,平均F1得分从0.62提高到0.63。使用XGBoost分类器观察到最大的增加,其中F1分数从0.61提高到0.64(约5%提高)。如果只接受真实世界通信或合成通信的培训,分类器的F1评分分别为0.62~0.61.
结论:集成合成数据可以提高机器学习分类器识别家庭医疗保健中健康问题的能力。性能与仅在现实世界数据上进行训练相当,强调合成数据在医疗保健分析中的潜力。
结论:这项研究证明了利用合成的患者-护士交流数据来增强机器学习分类器性能以识别家庭医疗保健环境中的健康问题的临床相关性。这将有助于更准确和有效地识别和检测具有复杂健康状况的家庭保健患者的问题。
公众号