expert annotation

  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
    目的:本研究旨在评估用于微调LLM的样本量和样本选择技术,以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别(NER)。
    方法:随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别,一旦建立了适当的协议,注释者独立地注释了另外290个公开声明。从490个注释文档中,抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构(来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示)中微调语言模型的选择,以改善NER。多元回归用于评估样本量(句子)之间的关系,实体密度(每个句子的实体[EPS]),和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
    结果:在架构中,微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义(均P<.001)。在所有情况下,EPS和句子数是F1得分的显著预测因子(P<.001),除了GPT-2_large模型,其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量(以句子的数量衡量)的边际收益递减点,点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样,阈值回归模型表明每股收益的边际收益递减,点估计在1.36和1.38之间。
    结论:相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM,和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多,重要的是训练数据量和模型参数大小。
    BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
    OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
    METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
    RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
    CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Eucalyptus is a worldwide hard-wood species which increasingly focused on. To adapt to various biotic and abiotic stresses, Eucalyptus have evolved complex mechanisms, increasing the cellular concentration of reactive oxygen species (ROS) by numerous ROS controlling enzymes. To better analyse the ROS gene network and discuss the differences between four Eucalyptus species, ROS gene network including 11 proteins families (1CysPrx, 2CysPrx, APx, APx-R, CIII Prx, Diox, GPx, Kat, PrxII, PrxQ and Rboh) were annotated and compared in an expert and exhaustive manner from the genomic data available from E. camaldulensis, E. globulus, E. grandis, and E. gunnii. In addition, a specific sequencing strategy was performed in order to determine if the missed sequences in at least one organism are the results of gain/loss events or only sequencing gaps. We observed that the automatic annotation applied to multigenic families is the source of miss-annotation. Base on the family size, the 11 families can be categorized into duplicated gene families (CIII Prx, Kat, 1CysPrx, and GPx), which contain a lot of gene duplication events and non-duplicated families (APx, APx-R, Rboh, DiOx, 2CysPrx, PrxII, and PrxQ). The gene family sizes are much larger in Eucalyptus than most of other angiosperms due to recent gene duplications, which could give higher adaptability to environmental changes and stresses. The cross-species comparative analysis shows gene gain and loss events during the evolutionary process. The 11 families possess different expression patterns, while in the Eucalyptus genus, the ROS families present similar expression patterns. Overall, the comparative analysis might be a good criterion to evaluate the adaptation of different species with different characters, but only if data mining is as exhaustive as possible. It is also a good indicator to explore the evolutionary process.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    The poplar leaf rust fungus, Melampsora larici-populina has been established as a tree-microbe interaction model. Understanding the molecular mechanisms controlling infection by pathogens appears essential for durable management of tree plantations. In biotrophic plant-parasites, effectors are known to condition host cell colonization. Thus, investigation of candidate secreted effector proteins (CSEPs) is a major goal in the poplar-poplar rust interaction. Unlike oomycetes, fungal effectors do not share conserved motifs and candidate prediction relies on a set of a priori criteria established from reported bona fide effectors. Secretome prediction, genome-wide analysis of gene families and transcriptomics of M. larici-populina have led to catalogs of more than a thousand secreted proteins. Automatized effector-mining pipelines hold great promise for rapid and systematic identification and prioritization of CSEPs for functional characterization. In this review, we report on and discuss the current status of the poplar rust fungus secretome and prediction of candidate effectors from this species.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号