关键词: BERT GNN LLM MBC XAI extra trees classifier node classification random forest classifier univariate selection

来  源:   DOI:10.3390/diagnostics14131365   PDF(Pubmed)

Abstract:
Metastatic breast cancer (MBC) continues to be a leading cause of cancer-related deaths among women. This work introduces an innovative non-invasive breast cancer classification model designed to improve the identification of cancer metastases. While this study marks the initial exploration into predicting MBC, additional investigations are essential to validate the occurrence of MBC. Our approach combines the strengths of large language models (LLMs), specifically the bidirectional encoder representations from transformers (BERT) model, with the powerful capabilities of graph neural networks (GNNs) to predict MBC patients based on their histopathology reports. This paper introduces a BERT-GNN approach for metastatic breast cancer prediction (BG-MBC) that integrates graph information derived from the BERT model. In this model, nodes are constructed from patient medical records, while BERT embeddings are employed to vectorise representations of the words in histopathology reports, thereby capturing semantic information crucial for classification by employing three distinct approaches (namely univariate selection, extra trees classifier for feature importance, and Shapley values to identify the features that have the most significant impact). Identifying the most crucial 30 features out of 676 generated as embeddings during model training, our model further enhances its predictive capabilities. The BG-MBC model achieves outstanding accuracy, with a detection rate of 0.98 and an area under curve (AUC) of 0.98, in identifying MBC patients. This remarkable performance is credited to the model\'s utilisation of attention scores generated by the LLM from histopathology reports, effectively capturing pertinent features for classification.
摘要:
转移性乳腺癌(MBC)仍然是女性癌症相关死亡的主要原因。这项工作介绍了一种创新的非侵入性乳腺癌分类模型,旨在改善癌症转移的识别。虽然这项研究标志着预测MBC的初步探索,额外的调查对于验证MBC的发生至关重要.我们的方法结合了大型语言模型(LLM)的优势,特别是来自变压器(BERT)模型的双向编码器表示,图神经网络(GNN)的强大功能,可根据组织病理学报告预测MBC患者。本文介绍了一种用于转移性乳腺癌预测(BG-MBC)的BERT-GNN方法,该方法集成了从BERT模型得出的图形信息。在这个模型中,节点是根据病人的医疗记录构建的,虽然BERT嵌入被用来对组织病理学报告中的单词进行矢量化表示,从而通过采用三种不同的方法(即单变量选择,用于特征重要性的额外树分类器,和Shapley值,以确定影响最显著的特征)。确定在模型训练期间作为嵌入生成的676个中最关键的30个特征,我们的模型进一步增强了其预测能力。BG-MBC模型具有出色的准确性,在识别MBC患者时,检出率为0.98,曲线下面积(AUC)为0.98。这种显著的表现归功于模型对LLM从组织病理学报告中产生的注意力得分的利用,有效地捕获相关特征进行分类。
公众号