具有大型语言模型的可通用临床笔记部分识别。Generalizable clinical note section identification with large language models.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

UNASSIGNED: Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as named entity recognition. However, the traditional supervised methods suffer from transferability issues. This study proposes a new framework for using large language models (LLMs) for section identification to overcome the limitations.
UNASSIGNED: We framed section identification as question-answering and provided the section definitions in free-text. We evaluated multiple LLMs off-the-shelf without any training. We also fine-tune our LLMs to investigate how the size and the specificity of the fine-tuning dataset impacts model performance.
UNASSIGNED: GPT4 achieved the highest F1 score of 0.77. The best open-source model (Tulu2-70b) achieved 0.64 and is on par with GPT3.5 (ChatGPT). GPT4 is also found to obtain F1 scores greater than 0.9 for 9 out of the 27 (33%) section types and greater than 0.8 for 15 out of 27 (56%) section types. For our fine-tuned models, we found they plateaued with an increasing size of the general domain dataset. We also found that adding a reasonable amount of section identification examples is beneficial.
UNASSIGNED: These results indicate that GPT4 is nearly production-ready for section identification, and seemingly contains both knowledge of note structure and the ability to follow complex instructions, and the best current open-source LLM is catching up.
UNASSIGNED: Our study shows that LLMs are promising for generalizable clinical note section identification. They have the potential to be further improved by adding section identification examples to the fine-tuning dataset.

摘要：

■临床注释部分识别有助于定位相关信息，并可能有利于下游任务，如命名实体识别。然而,传统的监督方法存在可转移性问题。本研究提出了一种使用大型语言模型（LLM）进行部分识别的新框架，以克服这些局限性。
■我们将部分识别框为问答，并以自由文本提供部分定义。我们在没有任何培训的情况下评估了多个现成的LLM。我们还微调我们的LLM，以调查微调数据集的大小和特异性如何影响模型性能。
■GPT4获得了最高的F1得分0.77。最佳开源模型（Tulu2-70b）达到0.64，与GPT3.5（ChatGPT）相当。还发现GPT4在27种（33％）截面类型中的9种获得的F1得分大于0.9，在27种（56％）截面类型中的15种获得的F1得分大于0.8。对于我们的微调模型，我们发现它们随着一般领域数据集的大小而趋于稳定。我们还发现,添加合理量的区段识别实例是有益的。
■这些结果表明，GPT4已接近生产就绪，可用于区段识别，似乎包含了笔记结构的知识和遵循复杂指令的能力，目前最好的开源LLM正在迎头赶上。
■我们的研究表明，LLM有望用于可推广的临床笔记部分识别。通过向微调数据集添加部分识别示例，它们有可能得到进一步改进。