information extraction

信息抽取
  • 文章类型: Journal Article
    背景:在肿瘤学中,电子健康记录包含用于诊断的文本关键信息,分期,和癌症患者的治疗计划。然而,文本数据处理需要大量的时间和精力,这限制了这些数据的利用。自然语言处理(NLP)技术的最新进展,包括大型语言模型,可用于癌症研究。特别是,从手术病理报告中提取病理阶段所需的信息可用于根据最新的癌症分期指南更新癌症分期。
    目的:本研究有两个主要目的。第一个目标是评估从基于文本的手术病理报告中提取信息并基于所提取的信息使用针对肺癌患者的微调生成语言模型(GLM)来确定病理阶段的性能。第二个目标是确定在资源受限的计算环境中利用相对较小的GLM进行信息提取的可行性。
    方法:从首尔国立大学邦当医院(SNUBH)的通用数据模型数据库收集肺癌手术病理报告,韩国的一家三级医院。我们根据这些报告选择了肿瘤淋巴结(TN)分类所需的42个描述符,并通过两名临床专家的验证创建了黄金标准。病理报告和金标准用于生成用于训练和评估GLM的提示反应对,然后将其用于从病理报告中提取分期所需的信息。
    结果:我们使用提取的信息评估了六个训练模型的信息提取性能以及它们在TN分类中的性能。演绎的Mistral-7B模型,用演绎数据集预先训练,整体表现最好,信息提取问题的精确匹配率为92.24%,分类精度为0.9876(同时预测T和N分类)。
    结论:这项研究表明,使用演绎数据集训练GLM可以提高信息提取性能,和GLM具有相对较少数量的大约70亿个参数可以在这个问题上实现高性能。提出的基于GLM的信息提取方法有望在临床决策支持中有用,肺癌分期和研究。
    BACKGROUND: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines.
    OBJECTIVE: This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment.
    METHODS: Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports.
    RESULTS: We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification.
    CONCLUSIONS: This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    文献挖掘取得了重大进展,允许从文献中提取目标信息。然而,生物学文献通常包括难以以易于编辑的格式提取的生物通路图像。为了应对这一挑战,这项研究旨在开发一个名为“生物通路信息的提取”(EBPI)的机器学习框架。该框架自动搜索相关出版物,从文献中的图像中提取生物通路信息,包括基因,酶,和代谢物,并以表格格式生成输出。为此,这个框架决定了生化反应的方向,并检测和分类生物通路图像中的文本。通过将提取的途径信息与手动策划的途径图进行比较来评估EBPI的性能。EBPI将有助于以高通量方式从文献中提取生物途径信息,可用于通路研究,包括代谢工程。
    There have been significant advances in literature mining, allowing for the extraction of target information from the literature. However, biological literature often includes biological pathway images that are difficult to extract in an easily editable format. To address this challenge, this study aims to develop a machine learning framework called the \"Extraction of Biological Pathway Information\" (EBPI). The framework automates the search for relevant publications, extracts biological pathway information from images within the literature, including genes, enzymes, and metabolites, and generates the output in a tabular format. For this, this framework determines the direction of biochemical reactions, and detects and classifies texts within biological pathway images. Performance of EBPI was evaluated by comparing the extracted pathway information with manually curated pathway maps. EBPI will be useful for extracting biological pathway information from the literature in a high-throughput manner, and can be used for pathway studies, including metabolic engineering.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:临床记录包含患者病史的非结构化表示,包括医疗问题和处方药之间的关系。探讨癌症药物与相关症状负担之间的关系,我们提取结构化的,来自肿瘤学笔记临床叙述的医学问题和药物信息的语义表示。
    方法:我们提出了癌症事件和关系(CACER)的临床概念注释,一种新颖的语料库,具有超过48.000个医学问题和药物事件的细粒度注释,以及10.000个药物-问题和问题-问题关系。利用CACER,我们开发和评估基于变压器的信息提取模型,如变压器的双向编码器表示(BERT),微调语言网络文本到文本转换转换器(Flan-T5),大型语言模型元AI(Llama3),以及使用微调和上下文学习(ICL)的生成预训练变压器4(GPT-4)。
    结果:在事件提取中,经过微调的BERT和Llama3型号在88.2-88.0F1达到了最高性能,与88.4F1的注释者间协议(IAA)相当。在关系提取中,微调的BERT,Flan-T5和Llama3在61.8-65.3F1达到了最高性能。带有ICL的GPT-4在两个任务中都实现了最差的性能。
    结论:在ICL中,微调模型的性能明显优于GPT-4,强调带注释的训练数据和模型优化的重要性。此外,BERT模型的表现与Llama3相似。为了我们的任务,大型语言模型与较小的BERT模型相比没有性能优势。
    结论:我们介绍CACER,一种新颖的语料库,具有针对医学问题的细粒度注释,毒品,以及它们在肿瘤学临床叙事中的关系。最先进的变压器模型在多个提取任务中实现了与IAA相当的性能。
    OBJECTIVE: Clinical notes contain unstructured representations of patient histories, including the relationships between medical problems and prescription drugs. To investigate the relationship between cancer drugs and their associated symptom burden, we extract structured, semantic representations of medical problem and drug information from the clinical narratives of oncology notes.
    METHODS: We present Clinical concept Annotations for Cancer Events and Relations (CACER), a novel corpus with fine-grained annotations for over 48 000 medical problems and drug events and 10 000 drug-problem and problem-problem relations. Leveraging CACER, we develop and evaluate transformer-based information extraction models such as Bidirectional Encoder Representations from Transformers (BERT), Fine-tuned Language Net Text-To-Text Transfer Transformer (Flan-T5), Large Language Model Meta AI (Llama3), and Generative Pre-trained Transformers-4 (GPT-4) using fine-tuning and in-context learning (ICL).
    RESULTS: In event extraction, the fine-tuned BERT and Llama3 models achieved the highest performance at 88.2-88.0 F1, which is comparable to the inter-annotator agreement (IAA) of 88.4 F1. In relation extraction, the fine-tuned BERT, Flan-T5, and Llama3 achieved the highest performance at 61.8-65.3 F1. GPT-4 with ICL achieved the worst performance across both tasks.
    CONCLUSIONS: The fine-tuned models significantly outperformed GPT-4 in ICL, highlighting the importance of annotated training data and model optimization. Furthermore, the BERT models performed similarly to Llama3. For our task, large language models offer no performance advantage over the smaller BERT models.
    CONCLUSIONS: We introduce CACER, a novel corpus with fine-grained annotations for medical problems, drugs, and their relationships in clinical narratives of oncology notes. State-of-the-art transformer models achieved performance comparable to IAA for several extraction tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    最近出现了许多方法和管道,用于从诸如科学出版物和专利之类的文件中自动提取知识图。然而,采用这些方法来整合微博帖子和新闻等替代文本源已被证明具有挑战性,因为它们难以对开放领域的实体和关系进行建模,通常在这些来源中发现。在本文中,我们提出了一种增强的信息提取管道,用于从社交媒体平台上的微博帖子中提取包含开放领域实体的知识图谱。我们的管道利用依赖关系解析,并通过对单词嵌入进行分层聚类,以无监督的方式对实体关系进行分类。我们提供了一个从10万条关于数字转换的推文语料库中提取语义三元组的用例,并公开发布生成的知识图。在同一个数据集上,我们进行了两个实验评估,表明该系统产生了精度超过95%的三倍,并且在精度方面优于大约5%的类似管道,同时产生相对较高数量的三元组。
    Numerous methods and pipelines have recently emerged for the automatic extraction of knowledge graphs from documents such as scientific publications and patents. However, adapting these methods to incorporate alternative text sources like micro-blogging posts and news has proven challenging as they struggle to model open-domain entities and relations, typically found in these sources. In this paper, we propose an enhanced information extraction pipeline tailored to the extraction of a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms. Our pipeline leverages dependency parsing and classifies entity relations in an unsupervised manner through hierarchical clustering over word embeddings. We provide a use case on extracting semantic triples from a corpus of 100 thousand tweets about digital transformation and publicly release the generated knowledge graph. On the same dataset, we conduct two experimental evaluations, showing that the system produces triples with precision over 95% and outperforms similar pipelines of around 5% in terms of precision, while generating a comparatively higher number of triples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    开源,轻量级和离线生成大型语言模型(LLM)具有临床信息提取的前景,因为它们适合使用商品硬件在安全环境中操作而无需令牌成本。通过创建简单的狼疮性肾炎(LN)肾组织病理学注释模式并生成金标准数据,这项研究使用三种最先进的轻量级LLM调查了基于提示的策略,即BioMistral-DARE-7B(BioMistral),Llama-2-13B(Llama2),和Mistral-7B-指示-v0.2(Mistral)。我们在零射击学习环境中检查了这些LLM的性能,以进行肾脏组织病理学报告信息提取。结合四种提示策略,包括批处理提示(BP)的组合,单任务提示(SP),思想链(CoT)和标准简单提示(SSP),我们的研究结果表明,与Llama2相比,Mistral和BioMistral始终表现出更高的性能。米斯特拉尔记录了最高的表现,获得0.996[95%CI:0.993,0.999]的F1分,用于在所有BP设置中提取各种肾小球亚型的数量,以及0.898[95%CI:0.871,0.921]在BP+SSP设置下提取免疫标志物的相关值。这项研究强调了离线LLM提供准确和安全的临床信息提取的能力,这可以作为一个有希望的替代他们的重量级在线同行。
    Open source, lightweight and offline generative large language models (LLMs) hold promise for clinical information extraction due to their suitability to operate in secured environments using commodity hardware without token cost. By creating a simple lupus nephritis (LN) renal histopathology annotation schema and generating gold standard data, this study investigates prompt-based strategies using three state-of-the-art lightweight LLMs, namely BioMistral-DARE-7B (BioMistral), Llama-2-13B (Llama 2), and Mistral-7B-instruct-v0.2 (Mistral). We examine the performance of these LLMs within a zero-shot learning environment for renal histopathology report information extraction. Incorporating four prompting strategies, including combinations of batch prompt (BP), single task prompt (SP), chain of thought (CoT) and standard simple prompt (SSP), our findings indicate that both Mistral and BioMistral consistently demonstrated higher performance compared to Llama 2. Mistral recorded the highest performance, achieving an F1-score of 0.996 [95% CI: 0.993, 0.999] for extracting the numbers of various subtypes of glomeruli across all BP settings and 0.898 [95% CI: 0.871, 0.921] in extracting relational values of immune markers under the BP+SSP setting. This study underscores the capability of offline LLMs to provide accurate and secure clinical information extraction, which can serve as a promising alternative to their heavy-weight online counterparts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    癌症是全球死亡的主要原因,流行病学和临床癌症登记对于加强肿瘤护理和促进科学研究至关重要。然而,医疗数据的异构格局对当前的肿瘤记录手动过程提出了重大挑战。本文探讨了大型语言模型(LLM)将非结构化医疗报告转换为德国基本肿瘤学数据集要求的结构化格式的潜力。我们的发现表明,将LLM集成到现有的医院数据管理系统或癌症登记处可以显着提高癌症数据收集的质量和完整性-这是诊断和治疗癌症以及提高治疗效果和益处的重要组成部分。这项工作有助于更广泛地讨论人工智能或LLM的潜力,以彻底改变一般医疗数据处理和报告,特别是癌症护理。
    With cancer being a leading cause of death globally, epidemiological and clinical cancer registration is paramount for enhancing oncological care and facilitating scientific research. However, the heterogeneous landscape of medical data presents significant challenges to the current manual process of tumor documentation. This paper explores the potential of Large Language Models (LLMs) for transforming unstructured medical reports into the structured format mandated by the German Basic Oncology Dataset. Our findings indicate that integrating LLMs into existing hospital data management systems or cancer registries can significantly enhance the quality and completeness of cancer data collection - a vital component for diagnosing and treating cancer and improving the effectiveness and benefits of therapies. This work contributes to the broader discussion on the potential of artificial intelligence or LLMs to revolutionize medical data processing and reporting in general and cancer care in particular.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    临床PDF文档中的正文文本的自动提取对于增强下游NLP任务是必要的,但仍然是一个挑战。本研究提出了一种无监督算法,旨在利用大量数据提取正文文本。在聚合页面上使用DBSCAN聚类,我们的方法使用文本块的内容和坐标来提取和组织文本块。评估结果表明,在各种医学专业来源中,精确度得分从0.82到0.98,召回得分从0.62到0.94,F1得分从0.71到0.96。未来的工作包括动态参数调整,以提高准确性和使用更大的数据集。
    Automatic extraction of body-text within clinical PDF documents is necessary to enhance downstream NLP tasks but remains a challenge. This study presents an unsupervised algorithm designed to extract body-text leveraging large volume of data. Using DBSCAN clustering over aggregate pages, our method extracts and organize text blocks using their content and coordinates. Evaluation results demonstrate precision scores ranging from 0.82 to 0.98, recall scores from 0.62 to 0.94, and F1-scores from 0.71 to 0.96 across various medical specialty sources. Future work includes dynamic parameter adjustments for improved accuracy and using larger datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    手工标记临床身体可能是昂贵且不灵活的,每次需要提取新类时都需要重新注释。PICO(参与者,干预,比较器,结果)信息提取可以加快进行系统评价以回答临床问题。然而,PICO经常扩展到其他实体,如研究类型和设计,审判背景,和时间框架,需要手动重新注释现有的语料库。在本文中,我们采用Snorkel的弱监督方法,将临床语料库扩展到新实体,而无需大量的手标签。具体来说,我们通过“研究类型和设计”提取的示例,使用新实体丰富了EBM-PICO语料库。利用薄弱的监督,我们在4,081个EBM-PICO文件上获得程序标签,在测试集上实现85.02%的F1分数。
    Hand-labelling clinical corpora can be costly and inflexible, requiring re-annotation every time new classes need to be extracted. PICO (Participant, Intervention, Comparator, Outcome) information extraction can expedite conducting systematic reviews to answer clinical questions. However, PICO frequently extends to other entities such as Study type and design, trial context, and timeframe, requiring manual re-annotation of existing corpora. In this paper, we adapt Snorkel\'s weak supervision methodology to extend clinical corpora to new entities without extensive hand labelling. Specifically, we enrich the EBM-PICO corpus with new entities through an example of \"Study type and design\" extraction. Using weak supervision, we obtain programmatic labels on 4,081 EBM-PICO documents, achieving an F1-score of 85.02% on the test set.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:临床自然语言处理和信息提取(IE)领域的快速技术进步导致了有关研究可比性和可复制性的挑战。
    目的:本文提出了一个报告指南,以规范临床文本中涉及IE研究的方法和结果的描述。
    方法:该指南是根据以前从包括34项研究在内的自由文本放射学报告中对IE进行范围审查的数据提取经验制定的。
    结果:该指南包括五个顶级类别信息模型,architecture,数据,注释,和结果。总的来说,我们定义了与这些类别相关的IE研究中要报告的28个方面。
    结论:拟议的指南有望为从临床文本中描述IE的研究制定报告标准,并促进整个研究领域的统一性。预期的未来技术进步可能会使指南有必要定期更新。在未来的研究中,我们计划开发一种分类法,明确定义相应的价值集,并通过遵循基于共识的方法将本指南和分类法整合起来.
    BACKGROUND: The rapid technical progress in the domain of clinical Natural Language Processing and information extraction (IE) has resulted in challenges concerning the comparability and replicability of studies.
    OBJECTIVE: This paper proposes a reporting guideline to standardize the description of methodologies and outcomes for studies involving IE from clinical texts.
    METHODS: The guideline is developed based on the experiences gained from data extraction for a previously conducted scoping review on IE from free-text radiology reports including 34 studies.
    RESULTS: The guideline comprises the five top-level categories information model, architecture, data, annotation, and outcomes. In total, we define 28 aspects to be reported on in IE studies related to these categories.
    CONCLUSIONS: The proposed guideline is expected to set a standard for reporting in studies describing IE from clinical text and promote uniformity across the research field. Expected future technological advancements may make regular updates of the guideline necessary. In future research, we plan to develop a taxonomy that clearly defines corresponding value sets as well as integrating both this guideline and the taxonomy by following a consensus-based methodology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在生态学和进化生物学中,来自已发表文献的数据的综合和建模通常用于生成见解和测试跨系统的理论。然而,搜索的任务,筛选,从文献中提取数据往往是艰巨的。研究人员可能会手动处理数百至数千个文章进行系统评价,荟萃分析,和编译合成数据集。随着相关文章扩展到数万或数十万,基于计算机的方法可以提高效率,基于文献的研究的透明度和可重复性。由于基于机器学习的语言模型的发展,可用于文本挖掘的方法正在迅速变化。我们回顾了不断增长的方法,将它们映射到三个广泛的范式(基于频率的方法,传统的自然语言处理和基于深度学习的语言模型)。这是学习基础和前沿概念的切入点,词汇,以及促进将这些工具整合到生态和进化研究中的方法。我们涵盖了生态文本建模的方法,生成训练数据,开发自定义模型并与大型语言模型进行交互,并讨论在生态和进化中实现这些方法的挑战和可能的解决方案。
    In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号