关键词: annotation conflict of interest disclosure disclosures expert annotation fine-tuning language model large language models machine learning named-entity recognition natural language processing sample sample size statement statements transfer learning

来  源:   DOI:10.2196/52095   PDF(Pubmed)

Abstract:
BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
摘要:
背景:大型语言模型(LLM)具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
目的:本研究旨在评估用于微调LLM的样本量和样本选择技术,以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别(NER)。
方法:随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别,一旦建立了适当的协议,注释者独立地注释了另外290个公开声明。从490个注释文档中,抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构(来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示)中微调语言模型的选择,以改善NER。多元回归用于评估样本量(句子)之间的关系,实体密度(每个句子的实体[EPS]),和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
结果:在架构中,微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义(均P<.001)。在所有情况下,EPS和句子数是F1得分的显著预测因子(P<.001),除了GPT-2_large模型,其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量(以句子的数量衡量)的边际收益递减点,点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样,阈值回归模型表明每股收益的边际收益递减,点估计在1.36和1.38之间。
结论:相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM,和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多,重要的是训练数据量和模型参数大小。
公众号