关键词: Clinical trials Domain-specific Language models Normalization Vaccine ontology

Mesh : Clinical Trials as Topic Biological Ontologies Vaccines / immunology Humans Natural Language Processing Unified Medical Language System

来  源:   DOI:10.1186/s13326-024-00318-x   PDF(Pubmed)

Abstract:
BACKGROUND: Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.
RESULTS: gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.
RESULTS: In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate\'s accuracy and 90.0% on top 10 candidate\'s accuracy.
CONCLUSIONS: This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
摘要:
背景:疫苗通过提供针对传染病的保护而彻底改变了公共卫生。它们刺激免疫系统并产生记忆细胞以防御目标疾病。临床试验评估疫苗性能,包括剂量,管理路线,和潜在的副作用。
结果:gov是一个有价值的临床试验信息库,但是其中的疫苗数据缺乏标准化,导致自动概念图的挑战,疫苗相关知识的发展,基于证据的决策,和疫苗监测。
结果:在这项研究中,我们开发了一个利用多个领域知识来源的级联框架,包括临床试验,统一医疗语言系统(UMLS)和疫苗本体论(VO),增强领域特定语言模型的性能,以自动映射来自临床试验的VO。疫苗本体(VO)是一个基于社区的本体,旨在促进疫苗数据标准化,一体化,和计算机辅助推理。我们的方法涉及从各种来源提取和注释数据。然后,我们对PubMedBERT模型进行了预训练,导致CTPubMedBERT的发展。随后,我们通过整合SAPBERT增强了CTPubMedBERT,使用UMLS进行了预训练,导致CTPubMedBERT+SAPBERT。通过使用疫苗本体论语料库和临床试验的疫苗数据进行微调,进一步完善。产生CTPubMedBERT+SAPBERT+VO模型。最后,我们利用了一组预先训练的模型,连同加权的基于规则的集成方法,标准化疫苗语料,提高流程的准确性。概念规范化中的排序过程涉及对潜在概念进行优先级排序和排序,以识别给定上下文的最合适匹配。我们对十大概念进行了排名,我们的实验结果表明,我们提出的级联框架在疫苗图谱上的表现始终优于现有的有效基线,前1名候选人的准确率达到71.8%,前10名候选人的准确率达到90.0%。
结论:这项研究提供了一个详细的见解,一个级联的框架微调的特定领域的语言模型,改善从临床试验的VO映射。通过有效地利用特定领域的信息,并应用不同的预训练BERT模型的加权基于规则的集合,我们的框架可以显著增强临床试验的VO图谱.
公众号