关键词: BERT BiLSTM BioBERT Cancer detection Deep learning Genetic mutation LSTM Text classification

来  源:   DOI:10.1016/j.heliyon.2024.e32279   PDF(Pubmed)

Abstract:
Early cancer detection and treatment depend on the discovery of specific genes that cause cancer. The classification of genetic mutations was initially done manually. However, this process relies on pathologists and can be a time-consuming task. Therefore, to improve the precision of clinical interpretation, researchers have developed computational algorithms that leverage next-generation sequencing technologies for automated mutation analysis. This paper utilized four deep learning classification models with training collections of biomedical texts. These models comprise bidirectional encoder representations from transformers for Biomedical text mining (BioBERT), a specialized language model implemented for biological contexts. Impressive results in multiple tasks, including text classification, language inference, and question answering, can be obtained by simply adding an extra layer to the BioBERT model. Moreover, bidirectional encoder representations from transformers (BERT), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM) have been leveraged to produce very good results in categorizing genetic mutations based on textual evidence. The dataset used in the work was created by Memorial Sloan Kettering Cancer Center (MSKCC), which contains several mutations. Furthermore, this dataset poses a major classification challenge in the Kaggle research prediction competitions. In carrying out the work, three challenges were identified: enormous text length, biased representation of the data, and repeated data instances. Based on the commonly used evaluation metrics, the experimental results show that the BioBERT model outperforms other models with an F1 score of 0.87 and 0.850 MCC, which can be considered as improved performance compared to similar results in the literature that have an F1 score of 0.70 achieved with the BERT model.
摘要:
早期的癌症检测和治疗取决于发现导致癌症的特定基因。遗传突变的分类最初是手动完成的。然而,这个过程依赖于病理学家,可能是一项耗时的任务。因此,为了提高临床解释的精度,研究人员开发了利用下一代测序技术进行自动化突变分析的计算算法.本文利用四个深度学习分类模型和生物医学文本的训练集合。这些模型包括来自生物医学文本挖掘变压器(BioBERT)的双向编码器表示,为生物上下文实现的专用语言模型。在多个任务中令人印象深刻的结果,包括文本分类,语言推理,和问题回答,可以通过简单地添加一个额外的层到Biobert模型获得。此外,来自变压器(BERT)的双向编码器表示,长短期记忆(LSTM),和双向LSTM(BiLSTM)已被利用在基于文本证据对基因突变进行分类方面产生非常好的结果。工作中使用的数据集是由纪念斯隆·凯特琳癌症中心(MSKCC)创建的,其中包含几个突变。此外,该数据集在Kaggle研究预测竞赛中构成了重大分类挑战。在开展工作中,确定了三个挑战:巨大的文本长度,数据的偏见表示,和重复的数据实例。根据常用的评估指标,实验结果表明,BioBERT模型优于其他模型,F1得分为0.87和0.850MCC,与使用BERT模型获得的F1评分为0.70的文献中的类似结果相比,这可以被认为是改进的性能。
公众号