对常用自然语言处理技术进行系统评估，以编纂临床笔记。Systematic evaluation of common natural language processing techniques to codify clinical notes.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Proper codification of medical diagnoses and procedures is essential for optimized health care management, quality improvement, research, and reimbursement tasks within large healthcare systems. Assignment of diagnostic or procedure codes is a tedious manual process, often prone to human error. Natural Language Processing (NLP) has been suggested to facilitate this manual codification process. Yet, little is known on best practices to utilize NLP for such applications. With Large Language Models (LLMs) becoming more ubiquitous in daily life, it is critical to remember, not every task requires that level of resource and effort. Here we comprehensively assessed the performance of common NLP techniques to predict current procedural terminology (CPT) from operative notes. CPT codes are commonly used to track surgical procedures and interventions and are the primary means for reimbursement. Our analysis of 100 most common musculoskeletal CPT codes suggest that traditional approaches can outperform more resource intensive approaches like BERT significantly (P-value = 4.4e-17) with average AUROC of 0.96 and accuracy of 0.97, in addition to providing interpretability which can be very helpful and even crucial in the clinical domain. We also proposed a complexity measure to quantify the complexity of a classification task and how this measure could influence the effect of dataset size on model\'s performance. Finally, we provide preliminary evidence that NLP can help minimize the codification error, including mislabeling due to human error.

摘要：

正确编纂医疗诊断和程序对于优化医疗保健管理至关重要，质量改进,研究,以及大型医疗保健系统内的报销任务。诊断或程序代码的分配是一个繁琐的手动过程，往往容易出现人为错误。已经建议自然语言处理(NLP)来促进这种手动编码过程。然而，对于将NLP用于此类应用的最佳实践知之甚少。随着大型语言模型（LLM）在日常生活中变得越来越普遍，重要的是要记住,不是每项任务都需要这样的资源和努力。在这里，我们全面评估了常用NLP技术的性能，以从操作注释中预测当前的程序术语（CPT）。CPT代码通常用于跟踪外科手术和干预措施，并且是报销的主要手段。我们对100个最常见的肌肉骨骼CPT代码的分析表明，传统方法可以显着优于BERT等资源密集型方法（P值=4.4e-17），平均AUROC为0.96，准确性为0.97，此外还提供了可解释性，这在临床领域非常有用，甚至至关重要。我们还提出了一种复杂性度量来量化分类任务的复杂性，以及该度量如何影响数据集大小对模型性能的影响。最后,我们提供了初步证据，证明NLP可以帮助最小化编码错误，包括由于人为错误而导致的错误标签。