目标:尽管有监督的机器学习在从临床笔记中提取信息方面很受欢迎,创建大型带注释的数据集需要广泛的领域专业知识,并且非常耗时。同时,大型语言模型(LLM)已经证明了有希望的迁移学习能力。在这项研究中,我们探讨了最近的LLM是否可以减少对大规模数据注释的需求。
方法:我们整理了769份乳腺癌病理报告的数据集,手动标记有12个类别,比较以下LLM的零射分类能力:GPT-4、GPT-3.5、Starling、和临床骆驼,具有3种模型的特定任务监督分类性能:随机森林,注意力长期短期记忆网络(LSTM-Att),和UCSF-BERT模型。
结果:在所有12个任务中,GPT-4模型的性能明显优于最好的监督模型,LSTM-Att(平均宏F1评分为0.86对0.75),在具有高标签不平衡的任务上具有优势。其他LLM表现不佳。常见的GPT-4错误类别包括来自多个样本和历史的错误推断,复杂的任务设计,和几个LSTM-Att错误与测试集的泛化性差有关。
结论:对于无法轻松收集大型注释数据集的任务,LLM可以减轻数据标记的负担。然而,如果LLM的使用令人望而却步,使用带有大型注释数据集的更简单的模型可以提供可比的结果。
结论:GPT-4证明了通过减少对大型注释数据集的需求来加快临床NLP研究执行的潜力。这可能会增加临床研究中基于NLP的变量和结果的利用率。
OBJECTIVE: Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations.
METHODS: We curated a dataset of 769
breast cancer pathology
reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model.
RESULTS: Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set.
CONCLUSIONS: On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results.
CONCLUSIONS: GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.