关键词: Bio-medical NER In-context learning Instruction tuning LLM Llama PICO frame extraction

Mesh : Humans Clinical Trials as Topic / methods Natural Language Processing Data Mining / methods Machine Learning

来  源:   DOI:10.1016/j.ymeth.2024.04.005

Abstract:
In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. More specifically, both of the proposed frameworks utilize AlpaCare as base LLM which employs both few-shot in-context learning and instruction tuning techniques to extract PICO-related terms from the clinical trial reports. We applied these approaches to the widely used coarse-grained datasets such as EBM-NLP, EBM-COMET and fine-grained datasets such as EBM-NLPrev and EBM-NLPh. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at https://github.com/shrimonmuke0202/AlpaPICO.git.
摘要:
近年来,临床试验报告的出版激增,这使得进行系统审查具有挑战性。自动提取人口,干预,比较器,和临床试验研究的结果(PICO)可以缓解传统上耗时的手动审查系统评价过程.PICO帧提取的现有方法涉及监督方法,该方法依赖于BIO标签标记形式的手动注释数据点的存在。最近的方法,如上下文学习(ICL),已被证明对许多下游NLP任务有效,需要使用带标签的示例。在这项工作中,我们采用ICL策略,利用大型语言模型(LLM)的预训练知识,在LLM的预培训阶段收集,从无监督设置的临床试验文档中自动提取与PICO相关的术语,以绕过大量注释数据实例的可用性。此外,为了在有大量注释样本可用的oracle场景中展示LLM的最高有效性,我们采用指令调整策略,通过使用低秩适应(LORA)在低资源环境中对PICO帧提取任务进行巨大模型的训练。更具体地说,这两个拟议的框架都利用AlpaCare作为基础LLM,它采用了少量上下文学习和指令调整技术,从临床试验报告中提取与PICO相关的术语.我们将这些方法应用于广泛使用的粗粒度数据集,如EBM-NLP,EBM-COMET和细粒度数据集,如EBM-NLTPrev和EBM-NLPh。我们的实证结果表明,我们提出的基于ICL的框架在所有版本的EBM-NLP数据集上产生了可比的结果,而我们提出的框架的指令调整版本在所有不同的EBM-NLP数据集上产生了最新的结果。我们的项目可在https://github.com/sprimonmuke0202/AlpaPICO上获得。git.
公众号