fine-tuning

微调
  • 文章类型: Journal Article
    乳腺癌是全球女性死亡的主要原因,需要对乳腺超声图像进行精确分类以进行早期诊断和治疗。使用CNN架构的传统方法,如VGG,ResNet,和DenseNet,虽然有点有效,经常与阶级不平衡和微妙的纹理变化作斗争,导致恶性肿瘤等少数群体的准确性降低。为了解决这些问题,我们提出了一种利用可扩展CNN架构EfficientNet-B7的方法,结合先进的数据增强技术,增强少数类表示并提高模型的鲁棒性。我们的方法涉及在BUSI数据集上微调EfficientNet-B7,实现RandomHorizontalFlip,随机旋转,和ColorJitter来平衡数据集并提高模型的鲁棒性。训练过程包括提前停止以防止过度拟合和优化性能指标。此外,我们集成了可解释的人工智能(XAI)技术,比如Grad-CAM,为了增强模型预测的可解释性和透明度,提供对影响分类结果的超声图像特征和区域的视觉和定量见解。我们的模型实现了99.14%的分类准确率,在乳腺超声图像分类方面明显优于现有的基于CNN的方法。XAI技术的结合增强了我们对模型决策过程的理解,从而提高其可靠性并促进临床采用。这个全面的框架为乳腺癌的早期检测和诊断提供了强大且可解释的工具,提高自动诊断系统的能力和支持临床决策过程。
    Breast cancer is a leading cause of mortality among women globally, necessitating precise classification of breast ultrasound images for early diagnosis and treatment. Traditional methods using CNN architectures such as VGG, ResNet, and DenseNet, though somewhat effective, often struggle with class imbalances and subtle texture variations, leading to reduced accuracy for minority classes such as malignant tumors. To address these issues, we propose a methodology that leverages EfficientNet-B7, a scalable CNN architecture, combined with advanced data augmentation techniques to enhance minority class representation and improve model robustness. Our approach involves fine-tuning EfficientNet-B7 on the BUSI dataset, implementing RandomHorizontalFlip, RandomRotation, and ColorJitter to balance the dataset and improve model robustness. The training process includes early stopping to prevent overfitting and optimize performance metrics. Additionally, we integrate Explainable AI (XAI) techniques, such as Grad-CAM, to enhance the interpretability and transparency of the model\'s predictions, providing visual and quantitative insights into the features and regions of ultrasound images influencing classification outcomes. Our model achieves a classification accuracy of 99.14%, significantly outperforming existing CNN-based approaches in breast ultrasound image classification. The incorporation of XAI techniques enhances our understanding of the model\'s decision-making process, thereby increasing its reliability and facilitating clinical adoption. This comprehensive framework offers a robust and interpretable tool for the early detection and diagnosis of breast cancer, advancing the capabilities of automated diagnostic systems and supporting clinical decision-making processes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    脑肿瘤是全球死亡的主要原因,恶性肿瘤有多种类型,只有12%的被诊断患有脑癌的成年人存活超过五年。这项研究引入了一种超参数卷积神经网络(CNN)模型来识别脑肿瘤,具有重大的实际意义。通过微调CNN模型的超参数,优化特征提取,系统地降低模型复杂度,从而提高脑肿瘤诊断的准确性。关键超参数包括批量大小,层计数,学习率,激活函数,汇集策略,填充,和过滤器尺寸。超参数调整的CNN模型在Kaggle提供的三个不同的脑MRI数据集上进行了训练,产生优异的性能分数,准确度的平均值为97%,精度,召回,和F1得分。我们的优化模型是有效的,正如我们与最先进的方法进行有条理的比较所证明的那样。我们的超参数修改增强了模型性能并增强了其泛化能力,给医生一个更准确和有效的工具来做出关于脑肿瘤诊断的关键判断。我们的模型是朝着值得信赖和准确的医疗诊断的正确方向迈出的重要一步,对改善患者预后具有实际意义。
    Brain tumors are a leading cause of death globally, with numerous types varying in malignancy, and only 12% of adults diagnosed with brain cancer survive beyond five years. This research introduces a hyperparametric convolutional neural network (CNN) model to identify brain tumors, with significant practical implications. By fine-tuning the hyperparameters of the CNN model, we optimize feature extraction and systematically reduce model complexity, thereby enhancing the accuracy of brain tumor diagnosis. The critical hyperparameters include batch size, layer counts, learning rate, activation functions, pooling strategies, padding, and filter size. The hyperparameter-tuned CNN model was trained on three different brain MRI datasets available at Kaggle, producing outstanding performance scores, with an average value of 97% for accuracy, precision, recall, and F1-score. Our optimized model is effective, as demonstrated by our methodical comparisons with state-of-the-art approaches. Our hyperparameter modifications enhanced the model performance and strengthened its capacity for generalization, giving medical practitioners a more accurate and effective tool for making crucial judgments regarding brain tumor diagnosis. Our model is a significant step in the right direction toward trustworthy and accurate medical diagnosis, with practical implications for improving patient outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    准确的职业分类在各个领域至关重要,包括政策制定和流行病学研究。本研究旨在开发基于DistilKoBERT的职业分类模型。
    这项研究使用了分别于2017年和2020年进行的第5次和第6次韩国工作条件调查的数据。共有99,665名调查参与者,他们是韩国工人的全国代表,包括在内。我们根据韩国标准职业分类(第7版,3位数字代码)。数据集以7:3的比例随机分为训练和测试数据集。利用训练数据集对基于DistilKoBERT的职业分类模型进行了微调,并使用测试数据集对模型进行评估。准确性,精度,召回,计算F1评分作为评价指标。
    最终模型,将测试数据集中的28,996名调查参与者分为142个职业代码,准确率为84.44%。对于评估指标,精度,召回,和模型的F1得分,根据样本量加权计算,分别为0.83、0.84和0.83。该模型在服务和销售人员的分类中表现出很高的精度,但在经理的分类中却表现出很低的精度。此外,它在对训练数据集中突出表现的职业进行分类时显示出很高的精度。
    这项研究开发了一种基于DistilKoBERT的职业分类系统,表现合理。尽管进一步努力提高分类精度,这种自动职业分类模型有望促进职业安全和卫生领域的流行病学研究。
    UNASSIGNED: Accurate occupation classification is essential in various fields, including policy development and epidemiological studies. This study aims to develop an occupation classification model based on DistilKoBERT.
    UNASSIGNED: This study used data from the 5th and 6th Korean Working Conditions Surveys conducted in 2017 and 2020, respectively. A total of 99,665 survey participants, who were nationally representative of Korean workers, were included. We used natural language responses regarding their job responsibilities and occupational codes based on the Korean Standard Classification of Occupations (7th version, 3-digit codes). The dataset was randomly split into training and test datasets in a ratio of 7:3. The occupation classification model based on DistilKoBERT was fine-tuned using the training dataset, and the model was evaluated using the test dataset. The accuracy, precision, recall, and F1 score were calculated as evaluation metrics.
    UNASSIGNED: The final model, which classified 28,996 survey participants in the test dataset into 142 occupational codes, exhibited an accuracy of 84.44%. For the evaluation metrics, the precision, recall, and F1 score of the model, calculated by weighting based on the sample size, were 0.83, 0.84, and 0.83, respectively. The model demonstrated high precision in the classification of service and sales workers yet exhibited low precision in the classification of managers. In addition, it displayed high precision in classifying occupations prominently represented in the training dataset.
    UNASSIGNED: This study developed an occupation classification system based on DistilKoBERT, which demonstrated reasonable performance. Despite further efforts to enhance the classification accuracy, this automated occupation classification model holds promise for advancing epidemiological studies in the fields of occupational safety and health.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物膜由脂质双层组成,其中嵌入了完整的膜蛋白。基于膜中发现的脂质种类的组成复杂性,以及它们与膜蛋白的特异性和选择性相互作用,我们最近提出,膜双层可以最好地描述为“微调分子机器”。“我们现在通过描述在从头鞘脂生物合成途径中运作的负反馈机制来讨论一组这样的脂质-蛋白质相互作用,它发生在内质网的膜中,并描述了该途径中第一个酶之间的原子相互作用,即丝氨酸棕榈酰转移酶,以及途径中第四种酶的产物,神经酰胺.我们探索丝氨酸棕榈酰转移酶复合物和神经酰胺中Asn13和Phe63之间形成的氢键和疏水相互作用如何影响内质网的神经酰胺含量。这个微调的生化相互作用的例子提出了关于鞘脂及其生物合成酶如何进化的有趣的机制问题,特别是考虑到它们的代谢依赖性。
    Biological membranes consist of a lipid bilayer in which integral membrane proteins are embedded. Based on the compositional complexity of the lipid species found in membranes, and on their specific and selective interactions with membrane proteins, we recently suggested that membrane bilayers can be best described as \"finely-tuned molecular machines.\" We now discuss one such set of lipid-protein interactions by describing a negative feedback mechanism operating in the de novo sphingolipid biosynthetic pathway, which occurs in the membrane of the endoplasmic reticulum, and describe the atomic interactions between the first enzyme in the pathway, namely serine palmitoyl transferase, and the product of the fourth enzyme in the pathway, ceramide. We explore how hydrogen-bonding and hydrophobic interactions formed between Asn13 and Phe63 in the serine palmitoyl transferase complex and ceramide can influence the ceramide content of the endoplasmic reticulum. This example of finely-tuned biochemical interactions raises intriguing mechanistic questions about how sphingolipids and their biosynthetic enzymes could have evolved, particularly in light of their metabolic co-dependence.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    临床注释部分识别有助于定位相关信息,并可能有利于下游任务,如命名实体识别。然而,传统的监督方法存在可转移性问题。本研究提出了一种使用大型语言模型(LLM)进行部分识别的新框架,以克服这些局限性。
    我们将部分识别框为问答,并以自由文本提供部分定义。我们在没有任何培训的情况下评估了多个现成的LLM。我们还微调我们的LLM,以调查微调数据集的大小和特异性如何影响模型性能。
    GPT4获得了最高的F1得分0.77。最佳开源模型(Tulu2-70b)达到0.64,与GPT3.5(ChatGPT)相当。还发现GPT4在27种(33%)截面类型中的9种获得的F1得分大于0.9,在27种(56%)截面类型中的15种获得的F1得分大于0.8。对于我们的微调模型,我们发现它们随着一般领域数据集的大小而趋于稳定。我们还发现,添加合理量的区段识别实例是有益的。
    这些结果表明,GPT4已接近生产就绪,可用于区段识别,似乎包含了笔记结构的知识和遵循复杂指令的能力,目前最好的开源LLM正在迎头赶上。
    我们的研究表明,LLM有望用于可推广的临床笔记部分识别。通过向微调数据集添加部分识别示例,它们有可能得到进一步改进。
    UNASSIGNED: Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as named entity recognition. However, the traditional supervised methods suffer from transferability issues. This study proposes a new framework for using large language models (LLMs) for section identification to overcome the limitations.
    UNASSIGNED: We framed section identification as question-answering and provided the section definitions in free-text. We evaluated multiple LLMs off-the-shelf without any training. We also fine-tune our LLMs to investigate how the size and the specificity of the fine-tuning dataset impacts model performance.
    UNASSIGNED: GPT4 achieved the highest F1 score of 0.77. The best open-source model (Tulu2-70b) achieved 0.64 and is on par with GPT3.5 (ChatGPT). GPT4 is also found to obtain F1 scores greater than 0.9 for 9 out of the 27 (33%) section types and greater than 0.8 for 15 out of 27 (56%) section types. For our fine-tuned models, we found they plateaued with an increasing size of the general domain dataset. We also found that adding a reasonable amount of section identification examples is beneficial.
    UNASSIGNED: These results indicate that GPT4 is nearly production-ready for section identification, and seemingly contains both knowledge of note structure and the ability to follow complex instructions, and the best current open-source LLM is catching up.
    UNASSIGNED: Our study shows that LLMs are promising for generalizable clinical note section identification. They have the potential to be further improved by adding section identification examples to the fine-tuning dataset.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:解码人类基因组序列需要对DNA序列功能性进行全面分析。通过计算和实验方法,研究人员已经研究了基因型与表型的关系,并生成了有助于解开复杂遗传蓝图的重要数据集。因此,最近开发的人工智能方法可以用来解释这些DNA序列的功能。
    方法:本研究探讨了深度学习的使用,特别是预训练的基因组模型,如DNA_bert_6和human_gpt2-v1,在解释和表示人类基因组序列。最初,我们精心构建了多个连接基因型和表型的数据集,以微调这些模型,从而实现精确的DNA序列分类.此外,我们评估了序列长度对分类结果的影响,并使用HERV数据集分析了模型隐藏层中特征提取的影响.为了增强我们对模型识别的表型特异性模式的理解,我们进行浓缩,具有高平均局部代表权重(ALRW)评分的人内源性逆转录病毒(HERV)序列中特定基序的致病性和保守性分析。
    结果:我们构建了多个基因型-表型数据集,与随机基因组序列相比,这些数据集显示出值得称道的分类性能,特别是在HERV数据集中,实现了二进制和多分类精度,F1值分别超过0.935和0.888。值得注意的是,HERV数据集的微调不仅提高了我们识别和区分DNA序列中不同信息类型的能力,而且还成功地在ALRW评分较高的区域中识别出与神经系统疾病和癌症相关的特定基序.随后对这些基序的分析揭示了物种对环境压力的适应性反应及其与病原体的共同进化。
    结论:这些发现突出了预先训练的基因组模型在学习DNA序列表征方面的潜力。特别是在利用HERV数据集时,并为未来的研究工作提供有价值的见解。这项研究代表了一种创新的策略,将预先训练的基因组模型表示与分析基因组序列功能的经典方法相结合。从而促进基因组学和人工智能之间的交叉受精。
    BACKGROUND: Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers have studied the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. Thus, the recently developed artificial intelligence methods can be used to interpret the functions of those DNA sequences.
    METHODS: This study explores the use of deep learning, particularly pre-trained genomic models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. Initially, we meticulously constructed multiple datasets linking genotypes and phenotypes to fine-tune those models for precise DNA sequence classification. Additionally, we evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the hidden layers of our model using the HERV dataset. To enhance our understanding of phenotype-specific patterns recognized by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the human endogenous retrovirus (HERV) sequence with high average local representation weight (ALRW) scores.
    RESULTS: We have constructed multiple genotype-phenotype datasets displaying commendable classification performance in comparison with random genomic sequences, particularly in the HERV dataset, which achieved binary and multi-classification accuracies and F1 values exceeding 0.935 and 0.888, respectively. Notably, the fine-tuning of the HERV dataset not only improved our ability to identify and distinguish diverse information types within DNA sequences but also successfully identified specific motifs associated with neurological disorders and cancers in regions with high ALRW scores. Subsequent analysis of these motifs shed light on the adaptive responses of species to environmental pressures and their co-evolution with pathogens.
    CONCLUSIONS: These findings highlight the potential of pre-trained genomic models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research endeavors. This study represents an innovative strategy that combines pre-trained genomic model representations with classical methods for analyzing the functionality of genome sequences, thereby promoting cross-fertilization between genomics and artificial intelligence.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    微调是迁移学习中的一项重要技术,在缺乏训练数据的任务中取得了显著的成功。然而,由于当源域和目标域之间的数据分布差异较大时,难以提取单源域微调的有效特征,我们提出了一种基于多源域的迁移学习框架,称为自适应多源域协作微调(AMCF)。AMCF利用多个源域模型进行协作微调,从而提高模型在目标任务中的特征提取能力。具体来说,AMCF采用自适应多源域层选择策略,为多个源域模型中的目标任务定制合适的层微调方案,旨在提取更有效的特征。此外,设计了一种新的多源域协同损失函数,便于各源域模型精确提取目标数据特征。同时,它致力于最小化各种源域模型之间的输出差异,增强了源域模型对目标数据的适应性。为了验证AMCF的有效性,它适用于迁移学习中常用的七个公共视觉分类数据集,并与最广泛使用的单源域微调方法进行了比较。实验结果表明,与现有的微调方法相比,我们的方法不仅提高了模型中特征提取的准确性,而且为目标任务提供了精确的层微调方案,从而显著提高微调性能。
    Fine-tuning is an important technique in transfer learning that has achieved significant success in tasks that lack training data. However, as it is difficult to extract effective features for single-source domain fine-tuning when the data distribution difference between the source and the target domain is large, we propose a transfer learning framework based on multi-source domain called adaptive multi-source domain collaborative fine-tuning (AMCF) to address this issue. AMCF utilizes multiple source domain models for collaborative fine-tuning, thereby improving the feature extraction capability of model in the target task. Specifically, AMCF employs an adaptive multi-source domain layer selection strategy to customize appropriate layer fine-tuning schemes for the target task among multiple source domain models, aiming to extract more efficient features. Furthermore, a novel multi-source domain collaborative loss function is designed to facilitate the precise extraction of target data features by each source domain model. Simultaneously, it works towards minimizing the output difference among various source domain models, thereby enhancing the adaptability of the source domain model to the target data. In order to validate the effectiveness of AMCF, it is applied to seven public visual classification datasets commonly used in transfer learning, and compared with the most widely used single-source domain fine-tuning methods. Experimental results demonstrate that, in comparison with the existing fine-tuning methods, our method not only enhances the accuracy of feature extraction in the model but also provides precise layer fine-tuning schemes for the target task, thereby significantly improving the fine-tuning performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    小分子药物设计旨在产生靶向特定蛋白质的化合物,在药物发现的早期阶段起着至关重要的作用。最近,利用GPT模型的研究已经出现,在产生分子化合物的各个领域取得了显著的成功。然而,由于制药领域小数据集的持续挑战,在产生目标特定化合物的性能方面存在一些降解。为了解决这个问题,我们提出了一种增强的靶标特异性药物生成模型,Adapt-cMolGPT,它修改了分子表示并优化了微调过程。特别是,我们引入了一种新的微调方法,该方法将适配器模块集成到预训练的基础模型中,并按部分交替进行权重更新。我们通过多次实验评估了所提出的模型,并证明了与以前模型相比的性能改进。在实验结果中,与其他模型相比,Adapt-cMolGPT产生了更多的新颖有效化合物,这些生成的化合物表现出与真实分子数据相似的特性。这些结果表明,我们提出的方法在设计靶向特定蛋白质的药物方面非常有效。
    Small-molecule drug design aims to generate compounds that target specific proteins, playing a crucial role in the early stages of drug discovery. Recently, research has emerged that utilizes the GPT model, which has achieved significant success in various fields to generate molecular compounds. However, due to the persistent challenge of small datasets in the pharmaceutical field, there has been some degradation in the performance of generating target-specific compounds. To address this issue, we propose an enhanced target-specific drug generation model, Adapt-cMolGPT, which modifies molecular representation and optimizes the fine-tuning process. In particular, we introduce a new fine-tuning method that incorporates an adapter module into a pre-trained base model and alternates weight updates by sections. We evaluated the proposed model through multiple experiments and demonstrated performance improvements compared to previous models. In the experimental results, Adapt-cMolGPT generated a greater number of novel and valid compounds compared to other models, with these generated compounds exhibiting properties similar to those of real molecular data. These results indicate that our proposed method is highly effective in designing drugs targeting specific proteins.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们目前最好的科学似乎表明,物理定律和我们宇宙的初始条件已经针对生命的可能性进行了微调。许多科学家和哲学家认为,微调是多重宇宙假设的证据。本文将重点讨论对从微调到多重宇宙的推论的讨论很多的反对意见:这种推理路线犯了反向赌徒的谬误。尽管有几十年前的文学作品,这场哲学辩论与微调和多元宇宙的科学讨论几乎没有联系,它主要围绕着植根于永恒通货膨胀和弦理论的多元宇宙假设的特定形式。正因为如此,从科学到哲学的潜在重要影响,反之亦然,已经被开发不足。在本文中,我将迈出第一步,加入这两个讨论,通过认为对多元宇宙的永恒通货膨胀弦理论概念的关注支持了反向赌徒的谬论指控。它通过支持我们的宇宙是偶然微调的想法来做到这一点,从而解决了人们的担忧,即反赌徒谬论指控的支持者们毫无争议地假设了这一点。
    Our best current science seems to suggest the laws of physics and the initial conditions of our universe are fine-tuned for the possibility of life. A significant number of scientists and philosophers believe that the fine-tuning is evidence for the multiverse hypothesis. This paper will focus on a much-discussed objection to the inference from the fine-tuning to the multiverse: the charge that this line of reasoning commits the inverse gambler\'s fallacy. Despite the existence of a literature going back decades, this philosophical debate has made little contact with scientific discussion of fine-tuning and the multiverse, which mainly revolves around a specific form of the multiverse hypothesis rooted in eternal inflation combined with string theory. Because of this, potentially important implications from science to philosophy, and vice versa, have been left underexplored. In this paper, I will take a first step at joining up these two discussions, by arguing that attention to the eternal inflation + string theory conception of the multiverse supports the inverse gambler\'s fallacy charge. It does this by supporting the idea that our universe is contingently fine-tuned, thus addressing the concern that proponents of the inverse gambler\'s fallacy charge have assumed this without argument.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
    目的:本研究旨在评估用于微调LLM的样本量和样本选择技术,以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别(NER)。
    方法:随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别,一旦建立了适当的协议,注释者独立地注释了另外290个公开声明。从490个注释文档中,抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构(来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示)中微调语言模型的选择,以改善NER。多元回归用于评估样本量(句子)之间的关系,实体密度(每个句子的实体[EPS]),和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
    结果:在架构中,微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义(均P<.001)。在所有情况下,EPS和句子数是F1得分的显著预测因子(P<.001),除了GPT-2_large模型,其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量(以句子的数量衡量)的边际收益递减点,点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样,阈值回归模型表明每股收益的边际收益递减,点估计在1.36和1.38之间。
    结论:相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM,和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多,重要的是训练数据量和模型参数大小。
    BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
    OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
    METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
    RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
    CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号