named-entity recognition

命名实体识别
  • 文章类型: Journal Article
    背景:大型语言模型(LLM)具有支持健康信息学中有前途的新应用的潜力。然而,缺乏在生物医学和卫生政策背景下对LLM进行微调以执行特定任务的样本量考虑因素的实际数据。
    目的:本研究旨在评估用于微调LLM的样本量和样本选择技术,以支持针对利益冲突披露声明的自定义数据集的改进的命名实体识别(NER)。
    方法:随机抽取200份披露声明进行注释。所有“人员”和“ORG”实体均由2个评估者识别,一旦建立了适当的协议,注释者独立地注释了另外290个公开声明。从490个注释文档中,抽取了2500个不同大小范围的分层随机样本。2500个训练集子样本用于在2个模型架构(来自变压器[BERT]和生成预训练变压器[GPT]的双向编码器表示)中微调语言模型的选择,以改善NER。多元回归用于评估样本量(句子)之间的关系,实体密度(每个句子的实体[EPS]),和训练的模型性能(F1分数)。此外,单预测阈值回归模型用于评估增加样本量或实体密度导致边际收益递减的可能性。
    结果:在架构中,微调模型的顶线NER性能从F1分数=0.79到F1分数=0.96不等。双预测多元线性回归模型的多重R2在0.6057~0.7896之间有统计学意义(均P<.001)。在所有情况下,EPS和句子数是F1得分的显著预测因子(P<.001),除了GPT-2_large模型,其中每股收益不是显著的预测因子(P=0.184)。模型阈值表示由增加的训练数据集样本量(以句子的数量衡量)的边际收益递减点,点估计范围从RoBERTa_large的439个句子到GPT-2_large的527个句子。同样,阈值回归模型表明每股收益的边际收益递减,点估计在1.36和1.38之间。
    结论:相对适度的样本量可用于微调适用于生物医学文本的NER任务的LLM,和训练数据实体密度应代表性地近似生产数据中的实体密度。训练数据质量和模型架构的预期用途(文本生成与文本处理或分类)可能是,或更多,重要的是训练数据量和模型参数大小。
    BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.
    OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.
    METHODS: A random sample of 200 disclosure statements was prepared for annotation. All \"PERSON\" and \"ORG\" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.
    RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.
    CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture\'s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:各个学科的研究扩展导致已发表的论文和期刊大幅增加,强调数据库建设和知识获取需要可靠的文本挖掘平台。这篇摘要介绍了GPDMiner(基因,蛋白质,和疾病矿工),为生物医学领域设计的平台,解决学术论文数量不断增加带来的挑战。
    方法:GPDMiner是一个利用高级信息检索技术的文本挖掘平台。它通过搜索PubMed的特定查询来运行,提取和分析与生物医学领域相关的信息。该系统旨在辨别和说明从自动信息提取获得的生物医学实体之间的关系。
    结果:GPDMiner的实施证明了其在浏览广泛的生物医学文献中的功效。它有效地检索,摘录,并分析信息,突出基因之间的重要联系,蛋白质,和疾病。该平台还允许用户以各种格式保存他们的分析结果,包括Excel和图像。
    结论:GPDMiner在生物医学领域可用的文本挖掘工具阵列中提供了值得注意的附加功能。该工具为研究人员提供了一种有效的解决方案,可以从生物医学文献中发现的大量非结构化文本中导航和提取相关信息,从而提供独特的能力,使其与现有的方法区分开来。它的应用有望极大地受益于该领域的研究人员,增强知识发现和数据管理能力。
    OBJECTIVE: The expansion of research across various disciplines has led to a substantial increase in published papers and journals, highlighting the necessity for reliable text mining platforms for database construction and knowledge acquisition. This abstract introduces GPDMiner(Gene, Protein, and Disease Miner), a platform designed for the biomedical domain, addressing the challenges posed by the growing volume of academic papers.
    METHODS: GPDMiner is a text mining platform that utilizes advanced information retrieval techniques. It operates by searching PubMed for specific queries, extracting and analyzing information relevant to the biomedical field. This system is designed to discern and illustrate relationships between biomedical entities obtained from automated information extraction.
    RESULTS: The implementation of GPDMiner demonstrates its efficacy in navigating the extensive corpus of biomedical literature. It efficiently retrieves, extracts, and analyzes information, highlighting significant connections between genes, proteins, and diseases. The platform also allows users to save their analytical outcomes in various formats, including Excel and images.
    CONCLUSIONS: GPDMiner offers a notable additional functionality among the array of text mining tools available for the biomedical field. This tool presents an effective solution for researchers to navigate and extract relevant information from the vast unstructured texts found in biomedical literature, thereby providing distinctive capabilities that set it apart from existing methodologies. Its application is expected to greatly benefit researchers in this domain, enhancing their capacity for knowledge discovery and data management.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    UNASSIGNED: The adoption of electronic health records (EHRs) has produced enormous amounts of data, creating research opportunities in clinical data sciences. Several concept recognition systems have been developed to facilitate clinical information extraction from these data. While studies exist that compare the performance of many concept recognition systems, they are typically developed internally and may be biased due to different internal implementations, parameters used, and limited number of systems included in the evaluations. The goal of this research is to evaluate the performance of existing systems to retrieve relevant clinical concepts from EHRs.
    UNASSIGNED: We investigated six concept recognition systems, including CLAMP, cTAKES, MetaMap, NCBO Annotator, QuickUMLS, and ScispaCy. Clinical concepts extracted included procedures, disorders, medications, and anatomical location. The system performance was evaluated on two datasets: the 2010 i2b2 and the MIMIC-III. Additionally, we assessed the performance of these systems in five challenging situations, including negation, severity, abbreviation, ambiguity, and misspelling.
    UNASSIGNED: For clinical concept extraction, CLAMP achieved the best performance on exact and inexact matching, with an F-score of 0.70 and 0.94, respectively, on i2b2; and 0.39 and 0.50, respectively, on MIMIC-III. Across the five challenging situations, ScispaCy excelled in extracting abbreviation information (F-score: 0.86) followed by NCBO Annotator (F-score: 0.79). CLAMP outperformed in extracting severity terms (F-score 0.73) followed by NCBO Annotator (F-score: 0.68). CLAMP outperformed other systems in extracting negated concepts (F-score 0.63).
    UNASSIGNED: Several concept recognition systems exist to extract clinical information from unstructured data. This study provides an external evaluation by end-users of six commonly used systems across different extraction tasks. Our findings suggest that CLAMP provides the most comprehensive set of annotations for clinical concept extraction tasks and associated challenges. Comparing standard extraction tasks across systems provides guidance to other clinical researchers when selecting a concept recognition system relevant to their clinical information extraction task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    电子健康档案的采用增加了临床数据量,这为医疗保健研究提供了机会。有几种生物医学注释系统已用于促进临床数据分析。然而,缺乏临床注释比较来选择最适合特定临床任务的工具。在这项工作中,我们使用MIMIC-III数据库中的临床注释,并评估了三个注释系统以识别四种类型的实体:(1)程序,(2)混乱,(3)药物,(4)解剖学。我们的初步结果表明,BioPortal在提取疾病和药物时表现良好。这可以为临床研究人员提供对患者健康模式的真实临床见解,并且可以创建注释数据集的第一个版本。
    The adoption of electronic health records has increased the volume of clinical data, which has opened an opportunity for healthcare research. There are several biomedical annotation systems that have been used to facilitate the analysis of clinical data. However, there is a lack of clinical annotation comparisons to select the most suitable tool for a specific clinical task. In this work, we used clinical notes from the MIMIC-III database and evaluated three annotation systems to identify four types of entities: (1) procedure, (2) disorder, (3) drug, and (4) anatomy. Our preliminary results demonstrate that BioPortal performs well when extracting disorder and drug. This can provide clinical researchers with real-clinical insights into patient\'s health patterns and it may allow to create a first version of an annotated dataset.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Finding,探索和过滤疾病和生物医学实体之间频繁的基于句子的关联,在疾病相关的PubMed文献中共同提到,是一个挑战,随着出版物数量的增加。Darling是一个Web应用程序,它利用名称实体识别来识别PubMed文章中与人类相关的生物医学术语,在OMIM中提到,DisGeNet和人类表型本体论(HPO)疾病记录,并生成交互式生物医学实体关联网络。这个网络中的节点代表基因,蛋白质,化学品,功能,组织,疾病,环境和表型。用户可以通过标识符进行搜索,术语/实体或自由文本,并以注释格式探索相关摘要。
    Finding, exploring and filtering frequent sentence-based associations between a disease and a biomedical entity, co-mentioned in disease-related PubMed literature, is a challenge, as the volume of publications increases. Darling is a web application, which utilizes Name Entity Recognition to identify human-related biomedical terms in PubMed articles, mentioned in OMIM, DisGeNET and Human Phenotype Ontology (HPO) disease records, and generates an interactive biomedical entity association network. Nodes in this network represent genes, proteins, chemicals, functions, tissues, diseases, environments and phenotypes. Users can search by identifiers, terms/entities or free text and explore the relevant abstracts in an annotated format.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    BACKGROUND: Interactions of microbes and diseases are of great importance for biomedical research. However, large-scale of microbe-disease interactions are hidden in the biomedical literature. The structured databases for microbe-disease interactions are in limited amounts. In this paper, we aim to construct a large-scale database for microbe-disease interactions automatically. We attained this goal via applying text mining methods based on a deep learning model with a moderate curation cost. We also built a user-friendly web interface that allows researchers to navigate and query required information.
    RESULTS: Firstly, we manually constructed a golden-standard corpus and a sliver-standard corpus (SSC) for microbe-disease interactions for curation. Moreover, we proposed a text mining framework for microbe-disease interaction extraction based on a pretrained model BERE. We applied named entity recognition tools to detect microbe and disease mentions from the free biomedical texts. After that, we fine-tuned the pretrained model BERE to recognize relations between targeted entities, which was originally built for drug-target interactions or drug-drug interactions. The introduction of SSC for model fine-tuning greatly improved detection performance for microbe-disease interactions, with an average reduction in error of approximately 10%. The MDIDB website offers data browsing, custom searching for specific diseases or microbes, and batch downloading.
    CONCLUSIONS: Evaluation results demonstrate that our method outperform the baseline model (rule-based PKDE4J) with an average [Formula: see text]-score of 73.81%. For further validation, we randomly sampled nearly 1000 predicted interactions by our model, and manually checked the correctness of each interaction, which gives a 73% accuracy. The MDIDB webiste is freely avaliable throuth http://dbmdi.com/index/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    最近,食品科学引起了很多关注。关于食物相互作用有许多开放的研究问题,作为主要的环境因素之一,与其他健康相关的实体,如疾病,治疗,和毒品。在过去的20年里,在自然语言处理和机器学习方面已经做了大量的工作,以实现生物医学信息的提取。然而,食品科学领域的机器学习资源仍然不足,食品信息提取方法的开发问题引起了人们的关注。食物语义资源很少,基于规则的食物信息提取方法也很少,这通常取决于一些外部资源。然而,2019年,通过使用几种食物语义资源,发布了带有食物实体及其规范化的注释语料库。
    在这项研究中,我们研究了最近发布的来自变压器(BERT)模型的双向编码器表示,它提供了最先进的信息提取结果,可以对食物信息提取进行微调。
    我们介绍FoodNER,这是一个基于语料库的食物命名实体识别方法的集合。它由15个不同的模型组成,通过对5组语义资源进行3个预训练的BERT模型进行微调而获得:食物与非食物实体,Hansard食品语义标签的2个子集,FoodOn语义标签,和系统化的医学命名法临床术语食品语义标签。
    所有BERT模型都提供了非常有希望的结果,在区分食物与非食物实体的任务中,宏F1得分为93.30%至94.31%,这代表了食品信息提取的新技术。考虑到语义标签被预测的任务,所有BERT模型都再次获得了非常有希望的结果,他们的宏F1得分从73.39%到78.96%不等。
    FoodNER可用于在5种不同的任务中提取和注释食物实体:食物与非食物实体,并通过使用最接近的Hansard语义标签在食物组级别上区分食物实体,父Hansard语义标签,FoodOn语义标签,或医学临床术语语义标签的系统化命名法。
    Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources.
    In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction.
    We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags.
    All BERT models provided very promising results with 93.30% to 94.31% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39% to 78.96%.
    FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    UNASSIGNED: It has been over a year since the first known case of coronavirus disease (COVID-19) emerged, yet the pandemic is far from over. To date, the coronavirus pandemic has infected over eighty million people and has killed more than 1.78 million worldwide. This study aims to explore \"how useful is Reddit social media platform to surveil COVID-19 pandemic?\" and \"how do people\'s concerns/behaviors change over the course of COVID-19 pandemic in North Carolina?\". The purpose of this study was to compare people\'s thoughts, behavior changes, discussion topics, and the number of confirmed cases and deaths by applying natural language processing (NLP) to COVID-19 related data.
    UNASSIGNED: In this study, we collected COVID-19 related data from 18 subreddits of North Carolina from March to August 2020. Next, we applied methods from natural language processing and machine learning to analyze collected Reddit posts using feature engineering, topic modeling, custom named-entity recognition (NER), and BERT-based (Bidirectional Encoder Representations from Transformers) sentence clustering. Using these methods, we were able to glean people\'s responses and their concerns about COVID-19 pandemic in North Carolina.
    UNASSIGNED: We observed a positive change in attitudes towards masks for residents in North Carolina. The high-frequency words in all subreddit corpora for each of the COVID-19 mitigation strategy categories are: Distancing (DIST)-\"social distance/distancing\", \"lockdown\", and \"work from home\"; Disinfection (DIT)-\"(hand) sanitizer/soap\", \"hygiene\", and \"wipe\"; Personal Protective Equipment (PPE)-\"mask/facemask(s)/face shield\", \"n95(s)/kn95\", and \"cloth/gown\"; Symptoms (SYM)-\"death\", \"flu/influenza\", and \"cough/coughed\"; Testing (TEST)-\"cases\", \"(antibody) test\", and \"test results (positive/negative)\".
    UNASSIGNED: The findings in our study show that the use of Reddit data to monitor COVID-19 pandemic in North Carolina (NC) was effective. The study shows the utility of NLP methods (e.g. cosine similarity, Latent Dirichlet Allocation (LDA) topic modeling, custom NER and BERT-based sentence clustering) in discovering the change of the public\'s concerns/behaviors over the course of COVID-19 pandemic in NC using Reddit data. Moreover, the results show that social media data can be utilized to surveil the epidemic situation in a specific community.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Increasingly, popular online museums have significantly changed the way people acquire cultural knowledge. These online museums have been generating abundant amounts of cultural relics data. In recent years, researchers have used deep learning models that can automatically extract complex features and have rich representation capabilities to implement named-entity recognition (NER). However, the lack of labeled data in the field of cultural relics makes it difficult for deep learning models that rely on labeled data to achieve excellent performance. To address this problem, this paper proposes a semi-supervised deep learning model named SCRNER (Semi-supervised model for Cultural Relics\' Named Entity Recognition) that utilizes the bidirectional long short-term memory (BiLSTM) and conditional random fields (CRF) model trained by seldom labeled data and abundant unlabeled data to attain an effective performance. To satisfy the semi-supervised sample selection, we propose a repeat-labeled (relabeled) strategy to select samples of high confidence to enlarge the training set iteratively. In addition, we use embeddings from language model (ELMo) representations to dynamically acquire word representations as the input of the model to solve the problem of the blurred boundaries of cultural objects and Chinese characteristics of texts in the field of cultural relics. Experimental results demonstrate that our proposed model, trained on limited labeled data, achieves an effective performance in the task of named entity recognition of cultural relics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    电子健康记录(EHR)中的自由文本描述可能对临床研究和护理优化感兴趣。然而,自由文本不容易被计算机解释,因此,价值有限。自然语言处理(NLP)算法可以通过向其附加本体概念来使自由文本机器可解释。然而,NLP算法的实现并没有得到一致的评估。因此,本研究的目的是回顾当前用于开发和评估将临床文本片段映射到本体概念的NLP算法的方法.为了规范算法的评估,减少研究之间的异质性,我们提出了一份建议清单。
    两位审稿人检查了Scopus索引的出版物,IEEE,MEDLINE,EMBASE,ACM数字图书馆,和ACL选集。包括有关NLP的出版物,用于将临床文本从EHR映射到本体论概念。Year,国家,设置,目标,评估和验证方法,NLP算法,术语系统,数据集大小和语言,绩效指标,参考标准,概括性,操作使用,并提取了源代码可用性。这些研究的目标是通过归纳的方式进行分类的。这些结果用于定义建议。
    确定了两千三百五十五个独特的研究。256项研究报告了将自由文本映射到本体概念的NLP算法的开发。77项描述了发展和评价。22项研究未对未知数据进行验证,68项研究未进行外部验证。在23项声称他们的算法是可推广的研究中,5通过外部验证对此进行了测试。关于使用NLP系统和算法的16项建议列表,数据的使用,评估和验证,结果的介绍,并开发了结果的普适性。
    我们发现了许多异构方法来报告NLP算法的开发和评估,这些算法将临床文本映射到本体概念。超过四分之一的已确定出版物没有进行评估。此外,超过四分之一的纳入研究没有进行验证,88%未进行外部验证。我们认为,我们的建议,除了现有的报告标准之外,将增加未来研究和NLP算法在医学中的可重复性和可重用性。
    Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations.
    Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies\' objectives were categorized by way of induction. These results were used to define recommendations.
    Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed.
    We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号