背景:研究空白是指现有知识体系中未回答的问题,由于缺乏研究或结果不确定。研究差距是科学研究的重要起点和动力。确定研究差距的传统方法,如文献综述和专家意见,可能很耗时,劳动密集型,而且容易产生偏见.在处理快速发展或时间敏感的主题时,它们也可能不足。因此,需要创新的可扩展方法来确定研究差距,系统地评估文献,并优先考虑感兴趣的主题的进一步研究领域。
目的:在本文中,我们提出了一种基于机器学习的方法,通过分析科学文献来识别研究差距。我们使用COVID-19大流行作为案例研究。
方法:我们使用COVID-19开放研究(CORD-19)数据集进行了分析,以确定COVID-19文献中的研究空白,其中包括1,121,433篇与COVID-19大流行有关的论文。我们的方法基于BERTopic主题建模技术,它利用转换器和基于类的术语频率-逆文档频率来创建密集的集群,从而允许易于解释的主题。我们基于BERTopic的方法涉及3个阶段:嵌入文档,聚类文档(降维和聚类),和代表主题(生成候选和最大化候选相关性)。
结果:应用研究选择标准后,我们在本研究的分析中纳入了33,206篇摘要.最终的研究差距清单确定了21个不同的领域,分为6个主要主题。这些主题是:\“COVID-19的病毒”,\“COVID-19的危险因素”,\“预防COVID-19”,\“COVID-19的治疗”,\“COVID-19期间的医疗保健服务,\”和COVID-19的影响。\"最突出的话题,在超过一半的分析研究中观察到,是“COVID-19的影响。
结论:提出的基于机器学习的方法有可能发现科学文献中的研究空白。本研究并非旨在取代选定主题内的个别文献研究。相反,它可以作为指导,在与以前的出版物指定用于未来探索的研究问题相关的特定领域制定精确的文献检索查询。未来的研究应该利用从目标区域最常见的数据库中检索到的最新研究列表。在可行的情况下,全文或,至少,应该对讨论部分进行分析,而不是将其分析局限于摘要。此外,未来的研究可以评估更有效的建模算法,尤其是那些将主题建模与统计不确定性量化相结合的方法,如共形预测。
BACKGROUND: Research gaps refer to unanswered questions in the existing body of knowledge, either due to a lack of studies or inconclusive results. Research gaps are essential starting points and motivation in scientific research. Traditional methods for identifying research gaps, such as literature reviews and expert opinions, can be time consuming, labor intensive, and prone to bias. They may also fall short when dealing with rapidly evolving or time-sensitive subjects. Thus, innovative scalable approaches are needed to identify research gaps, systematically assess the literature, and prioritize areas for further study in the topic of interest.
OBJECTIVE: In this paper, we propose a machine learning-based approach for identifying research gaps through the analysis of scientific literature. We used the COVID-19 pandemic as a case study.
METHODS: We conducted an analysis to identify research gaps in COVID-19 literature using the COVID-19 Open Research (CORD-19) data set, which comprises 1,121,433 papers related to the COVID-19 pandemic. Our approach is based on the BERTopic topic modeling technique, which leverages transformers and class-based term frequency-inverse document frequency to create dense clusters allowing for easily interpretable topics. Our BERTopic-based approach involves 3 stages: embedding documents, clustering documents (dimension reduction and clustering), and representing topics (generating candidates and maximizing candidate relevance).
RESULTS: After applying the study selection criteria, we included 33,206 abstracts in the analysis of this study. The final list of research gaps identified 21 different areas, which were grouped into 6 principal topics. These topics were: \"virus of COVID-19,\" \"risk factors of COVID-19,\" \"prevention of COVID-19,\" \"treatment of COVID-19,\" \"health care delivery during COVID-19,\" \"and impact of COVID-19.\" The most prominent topic, observed in over half of the analyzed studies, was \"the impact of COVID-19.\"
CONCLUSIONS: The proposed machine learning-based approach has the potential to identify research gaps in scientific literature. This study is not intended to replace individual literature research within a selected topic. Instead, it can serve as a guide to formulate precise literature search queries in specific areas associated with research questions that previous publications have earmarked for future exploration. Future research should leverage an up-to-date list of studies that are retrieved from the most common databases in the target area. When feasible, full texts or, at minimum, discussion sections should be analyzed rather than limiting their analysis to abstracts. Furthermore, future studies could evaluate more efficient modeling algorithms, especially those combining topic modeling with statistical uncertainty quantification, such as conformal prediction.