Self-supervised learning

自监督学习
  • 文章类型: Journal Article
    通过Twitter等社交平台快速传播未经验证的信息对社会稳定构成相当大的危险。识别真实索赔和虚假索赔具有挑战性,和以前的谣言检测方法的工作往往不能有效地捕获传播结构特征。这些方法还经常忽略与来源帖子的讨论主题无关的评论的存在。为了解决这个问题,我们介绍了一种新颖的方法:结构感知多级图注意网络(SAMGAT)用于谣言分类。SAMGAT采用动态注意力机制,将GATv2和点积注意力融合在一起,以捕捉帖子之间的上下文关系,允许根据中央节点的立场来改变注意力分数。该模型包含了一种结构感知的注意力机制,该机制可以学习可以指示边缘存在的注意力权重,有效反映谣言的传播结构。此外,SAMGAT采用了前k注意过滤机制来选择最相关的相邻节点,增强其关注谣言传播关键结构特征的能力。此外,SAMGAT包括一个索赔指导的注意力汇集机制,该机制具有阈值步骤,可以在构建事件表示时专注于信息量最大的帖子。在基准数据集上的实验结果表明,SAMGAT在识别谣言方面优于最先进的方法,并提高了早期谣言检测的有效性。
    The rapid dissemination of unverified information through social platforms like Twitter poses considerable dangers to societal stability. Identifying real versus fake claims is challenging, and previous work on rumor detection methods often fails to effectively capture propagation structure features. These methods also often overlook the presence of comments irrelevant to the discussion topic of the source post. To address this, we introduce a novel approach: the Structure-Aware Multilevel Graph Attention Network (SAMGAT) for rumor classification. SAMGAT employs a dynamic attention mechanism that blends GATv2 and dot-product attention to capture the contextual relationships between posts, allowing for varying attention scores based on the stance of the central node. The model incorporates a structure-aware attention mechanism that learns attention weights that can indicate the existence of edges, effectively reflecting the propagation structure of rumors. Moreover, SAMGAT incorporates a top-k attention filtering mechanism to select the most relevant neighboring nodes, enhancing its ability to focus on the key structural features of rumor propagation. Furthermore, SAMGAT includes a claim-guided attention pooling mechanism with a thresholding step to focus on the most informative posts when constructing the event representation. Experimental results on benchmark datasets demonstrate that SAMGAT outperforms state-of-the-art methods in identifying rumors and improves the effectiveness of early rumor detection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    小儿呼吸系统疾病诊断和后续治疗需要准确和可解释的分析。胸部X光检查是识别和监测儿童各种胸部疾病的最具成本效益和快速的方法。自我监督和迁移学习的最新发展显示了它们在医学成像中的潜力,包括胸部X光片区域。在这篇文章中,我们提出了一个三阶段框架,从成人胸部X光片转移知识,以帮助诊断和解释小儿胸部疾病.我们使用不同的预训练和微调策略进行了全面的实验,以开发变压器或卷积神经网络模型,然后对其进行定性和定量评估。ViT-Base/16模型,使用CheXpert数据集进行微调,一个大的胸部X光数据集,成为最有效的,在六个疾病类别中,平均AUC为0.761(95%CI:0.759-0.763),并显示出高敏感性(平均0.639)和特异性(平均0.683),这表明了它强大的辨别能力。基线模型,ViT-Small/16和ViT-Base/16,当直接在儿科CXR数据集上训练时,仅达到平均AUC评分0.646(95%CI:0.641-0.651)和0.654(95%CI:0.648-0.660),分别。定性,我们的模型擅长定位患病区域,优于在ImageNet和其他微调方法上预先训练的模型,从而提供优越的解释。源代码可以在线获得,数据可以从PhysioNet获得。
    Pediatric respiratory disease diagnosis and subsequent treatment require accurate and interpretable analysis. A chest X-ray is the most cost-effective and rapid method for identifying and monitoring various thoracic diseases in children. Recent developments in self-supervised and transfer learning have shown their potential in medical imaging, including chest X-ray areas. In this article, we propose a three-stage framework with knowledge transfer from adult chest X-rays to aid the diagnosis and interpretation of pediatric thorax diseases. We conducted comprehensive experiments with different pre-training and fine-tuning strategies to develop transformer or convolutional neural network models and then evaluate them qualitatively and quantitatively. The ViT-Base/16 model, fine-tuned with the CheXpert dataset, a large chest X-ray dataset, emerged as the most effective, achieving a mean AUC of 0.761 (95% CI: 0.759-0.763) across six disease categories and demonstrating a high sensitivity (average 0.639) and specificity (average 0.683), which are indicative of its strong discriminative ability. The baseline models, ViT-Small/16 and ViT-Base/16, when directly trained on the Pediatric CXR dataset, only achieved mean AUC scores of 0.646 (95% CI: 0.641-0.651) and 0.654 (95% CI: 0.648-0.660), respectively. Qualitatively, our model excels in localizing diseased regions, outperforming models pre-trained on ImageNet and other fine-tuning approaches, thus providing superior explanations. The source code is available online and the data can be obtained from PhysioNet.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    如今,自动驾驶技术已经广泛流行。智能车辆已经配备了各种传感器(例如,视觉传感器,激光雷达,深度相机等.).其中,具有定制语义分割和感知算法的视觉系统在场景理解中起着至关重要的作用。然而,传统的有监督语义分割需要大量的像素级人工标注来完成模型训练。尽管少拍方法在一定程度上减少了注释工作,他们仍然是劳动密集型的。在本文中,提出了一种基于多任务学习和密集注意力计算的自监督少镜头语义分割方法(MLDAC)。图像的显著部分被分成两部分;其中一部分用作少拍分割的支持掩模,同时将预测结果与其他部分和整个区域之间的交叉熵损失分别作为多任务学习计算,以提高模型的泛化能力。SwinTransformer用作我们的骨干,以不同的比例提取特征图。然后将这些特征图输入到多个层级的密集注意力计算块以增强像素级对应。通过尺度间混合和特征跳跃连接得到最终的预测结果。实验结果表明,MLDAC在PASCAL-5i和COCO-20i数据集上获得了55.1%和26.8%的单发自监督少发分割,分别。此外,它在FSS-1000少数镜头数据集上达到78.1%,证明其功效。
    Nowadays, autonomous driving technology has become widely prevalent. The intelligent vehicles have been equipped with various sensors (e.g., vision sensors, LiDAR, depth cameras etc.). Among them, the vision systems with tailored semantic segmentation and perception algorithms play critical roles in scene understanding. However, the traditional supervised semantic segmentation needs a large number of pixel-level manual annotations to complete model training. Although few-shot methods reduce the annotation work to some extent, they are still labor intensive. In this paper, a self-supervised few-shot semantic segmentation method based on Multi-task Learning and Dense Attention Computation (dubbed MLDAC) is proposed. The salient part of an image is split into two parts; one of them serves as the support mask for few-shot segmentation, while cross-entropy losses are calculated between the other part and the entire region with the predicted results separately as multi-task learning so as to improve the model\'s generalization ability. Swin Transformer is used as our backbone to extract feature maps at different scales. These feature maps are then input to multiple levels of dense attention computation blocks to enhance pixel-level correspondence. The final prediction results are obtained through inter-scale mixing and feature skip connection. The experimental results indicate that MLDAC obtains 55.1% and 26.8% one-shot mIoU self-supervised few-shot segmentation on the PASCAL-5i and COCO-20i datasets, respectively. In addition, it achieves 78.1% on the FSS-1000 few-shot dataset, proving its efficacy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    复杂环境下的鲁棒目标检测,视觉条件差,和开放场景在自动驾驶中提出了重大的技术挑战。这些挑战需要开发毫米波(mmWave)雷达点云数据和视觉图像的先进融合方法。为了解决这些问题,本文提出了一种雷达相机鲁棒融合网络(RCRFNet),它利用自监督学习和开放集识别来有效地利用来自两个传感器的互补信息。具体来说,该网络通过视锥关联方法使用匹配的雷达摄像机数据来生成自我监督信号,加强网络培训。雷达点云和视觉图像之间的全局和局部深度一致性的集成,连同图像特征,帮助构造用于检测未知目标的对象类置信度。此外,这些技术与多层特征提取主干和多模态特征检测头相结合,以实现鲁棒的对象检测。在nuScenes公共数据集上的实验表明,RCRFNet优于最先进的(SOTA)方法,特别是在低视觉能见度的条件下和检测未知类对象时。
    Robust object detection in complex environments, poor visual conditions, and open scenarios presents significant technical challenges in autonomous driving. These challenges necessitate the development of advanced fusion methods for millimeter-wave (mmWave) radar point cloud data and visual images. To address these issues, this paper proposes a radar-camera robust fusion network (RCRFNet), which leverages self-supervised learning and open-set recognition to effectively utilise the complementary information from both sensors. Specifically, the network uses matched radar-camera data through a frustum association approach to generate self-supervised signals, enhancing network training. The integration of global and local depth consistencies between radar point clouds and visual images, along with image features, helps construct object class confidence levels for detecting unknown targets. Additionally, these techniques are combined with a multi-layer feature extraction backbone and a multimodal feature detection head to achieve robust object detection. Experiments on the nuScenes public dataset demonstrate that RCRFNet outperforms state-of-the-art (SOTA) methods, particularly in conditions of low visual visibility and when detecting unknown class objects.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    认知科学家认为,像人类这样的适应性智能体通过学习因果心理模拟来执行空间推理任务。学习这些模拟的问题被称为预测世界建模。我们提出了第一个框架,用于从传感器观察中学习开放词汇预测世界模型(OV-PWM)。该模型是通过分层变分自动编码器(HVAE)实现的,该编码器能够从累积的部分观测中预测各种准确的完全观测环境。我们证明了OV-PWM可以对潜在组合嵌入的高维嵌入图进行建模,这些嵌入图表示可以通过足够的相似性推断来推断的重叠语义集。OV-PWM将现有的两级闭集PWM方法简化为单级端到端学习方法。CARLA模拟器实验表明,OV-PWM可以学习紧凑的潜在表示,并生成具有道路标记等精细细节的多样化和准确的世界。在城市评估序列上,在六个查询语义上实现69mIoU。我们提出OV-PWM作为一种通用的持续学习范式,用于为未来的通用移动机器人提供空间语义记忆和学习的内部仿真能力。
    Cognitive scientists believe that adaptable intelligent agents like humans perform spatial reasoning tasks by learned causal mental simulation. The problem of learning these simulations is called predictive world modeling. We present the first framework for a learning open-vocabulary predictive world model (OV-PWM) from sensor observations. The model is implemented through a hierarchical variational autoencoder (HVAE) capable of predicting diverse and accurate fully observed environments from accumulated partial observations. We show that the OV-PWM can model high-dimensional embedding maps of latent compositional embeddings representing sets of overlapping semantics inferable by sufficient similarity inference. The OV-PWM simplifies the prior two-stage closed-set PWM approach to the single-stage end-to-end learning method. CARLA simulator experiments show that the OV-PWM can learn compact latent representations and generate diverse and accurate worlds with fine details like road markings, achieving 69 mIoU over six query semantics on an urban evaluation sequence. We propose the OV-PWM as a versatile continual learning paradigm for providing spatio-semantic memory and learned internal simulation capabilities to future general-purpose mobile robots.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    统一帕金森病评定量表(UPDRS)用于识别帕金森病(PD)患者并评估其严重程度。评级对于疾病进展监测和治疗调整至关重要。这项研究旨在通过开发一个创新的框架来提高PD管理的能力,该框架将深度学习与可穿戴传感器技术相结合,以提高UPDRS评估的准确性。我们引入了一系列深度学习模型来估计UPDRS第三部分的分数,利用来自可穿戴传感器的运动数据。我们的方法利用了一种新颖的多共享任务自监督卷积神经网络-长短期记忆(CNN-LSTM)框架,该框架处理原始陀螺仪信号及其频谱图表示。该技术旨在改善自然人类活动期间PD严重程度的估计准确性。利用来自24名从事日常活动的PD患者的526分钟数据,我们的方法显示,估计的UPDRS-III评分与临床评估评分之间的相关性为0.89.该模型优于单通道和多通道CNN设置的基准,LSTM,和CNN-LSTM模型,并建立了与最近最先进的方法相比的UPDRS-III自由体运动分数估计的新标准。这些结果标志着生物工程应用于PD监测的实质性进展,为日常生活环境中PD症状的可靠和连续评估提供了一个可靠的框架。
    The Unified Parkinson\'s Disease Rating Scale (UPDRS) is used to recognize patients with Parkinson\'s disease (PD) and rate its severity. The rating is crucial for disease progression monitoring and treatment adjustment. This study aims to advance the capabilities of PD management by developing an innovative framework that integrates deep learning with wearable sensor technology to enhance the precision of UPDRS assessments. We introduce a series of deep learning models to estimate UPDRS Part III scores, utilizing motion data from wearable sensors. Our approach leverages a novel Multi-shared-task Self-supervised Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) framework that processes raw gyroscope signals and their spectrogram representations. This technique aims to refine the estimation accuracy of PD severity during naturalistic human activities. Utilizing 526 min of data from 24 PD patients engaged in everyday activities, our methodology demonstrates a strong correlation of 0.89 between estimated and clinically assessed UPDRS-III scores. This model outperforms the benchmark set by single and multichannel CNN, LSTM, and CNN-LSTM models and establishes a new standard in UPDRS-III score estimation for free-body movements compared to recent state-of-the-art methods. These results signify a substantial step forward in bioengineering applications for PD monitoring, providing a robust framework for reliable and continuous assessment of PD symptoms in daily living settings.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这项研究中,我们提出了一个简单的多通道框架,用于对比学习(MC-SimCLR)来编码空间音频的“what”和“where”。MC-SimCLR从未标记的空间音频中学习联合频谱和空间表示,从而增强下游任务中的事件分类和声音定位。在其核心,我们提出了一个多级数据增强管道,以增强不同级别的音频特征,包括波形,梅尔光谱图,和广义互相关(GCC)特征。此外,我们引入了简单而有效的通道增强方法来随机交换麦克风和面罩Mel和GCC通道的顺序。通过使用这些增强,我们发现,在学习表示之上的线性层在事件分类准确性和定位误差方面显著优于监督模型。我们还对每种增强方法的效果进行了全面分析,并比较了使用不同数量的标记数据的微调性能。
    In this study, we present a simple multi-channel framework for contrastive learning (MC-SimCLR) to encode \'what\' and \'where\' of spatial audios. MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios, thereby enhancing both event classification and sound localization in downstream tasks. At its core, we propose a multi-level data augmentation pipeline that augments different levels of audio features, including waveforms, Mel spectrograms, and generalized cross-correlation (GCC) features. In addition, we introduce simple yet effective channel-wise augmentation methods to randomly swap the order of the microphones and mask Mel and GCC channels. By using these augmentations, we find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error. We also perform a comprehensive analysis of the effect of each augmentation method and a comparison of the fine-tuning performance using different amounts of labeled data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:个人感应,利用从生态环境中患者的可穿戴设备被动和近乎连续地收集的数据,是监测情绪障碍(MD)的有希望的范例,全球疾病负担的主要决定因素。然而,收集和注释可穿戴数据是资源密集型的。因此,这种研究通常只能招募几十名患者。这构成了将现代监督机器学习技术应用于MD检测的主要障碍之一。
    目的:在本文中,我们克服了这一数据瓶颈,并在自监督学习(SSL)的最新进展的基础上,从可穿戴设备数据中推进了对急性MD发作的检测.这种方法利用未标记的数据在预训练期间学习表示,随后被用于监督任务。
    方法:我们收集了使用EmpaticaE4腕带记录的开放访问数据集,与MD监测无关,个人感知任务——从超级马里奥玩家的情感识别到本科生的压力检测——并设计了一个预处理管道,执行体内/体外检测,睡眠/唤醒检测,分割,和(可选地)特征提取。161例E4记录的受试者,我们引入了E4SelfLearning,迄今为止最大的开放访问集合,和它的预处理管道。我们开发了一种新颖的E4定制变压器(E4mer)架构,作为SSL和完全监督学习的蓝图;我们评估了自我监督预培训是否以及在何种条件下导致了对完全监督基线的改进(即,完全监督的E4mer和预深度学习算法)从64个记录片段中检测急性MD发作(n=32,50%,急性,n=32,50%,稳定)患者。
    结果:使用我们的新型E4mer或极端梯度增强(XGBoost),SSL的性能明显优于完全监督的管道:n=3353(81.23%)对n=3110(75.35%;E4mer)和n=2973(72.02%;XGBoost)从总共4128个片段中正确分类了记录片段。SSL性能与用于预训练的特定代理任务密切相关,以及无标签的数据可用性。
    结论:我们发现SSL,一种范式,其中模型在未标记的数据上进行预训练,在部署到感兴趣的有监督目标任务之前不需要人工注释,有助于克服注释瓶颈;预训练代理任务的选择和预训练的未标记数据的大小是SSL成功的关键决定因素。我们介绍了E4mer,可以用于SSL,并分享了E4SelfLearning系列,连同它的预处理管道,这可以促进和加快未来对个人感知SSL的研究。
    BACKGROUND: Personal sensing, leveraging data passively and near-continuously collected with wearables from patients in their ecological environment, is a promising paradigm to monitor mood disorders (MDs), a major determinant of the worldwide disease burden. However, collecting and annotating wearable data is resource intensive. Studies of this kind can thus typically afford to recruit only a few dozen patients. This constitutes one of the major obstacles to applying modern supervised machine learning techniques to MD detection.
    OBJECTIVE: In this paper, we overcame this data bottleneck and advanced the detection of acute MD episodes from wearables\' data on the back of recent advances in self-supervised learning (SSL). This approach leverages unlabeled data to learn representations during pretraining, subsequently exploited for a supervised task.
    METHODS: We collected open access data sets recording with the Empatica E4 wristband spanning different, unrelated to MD monitoring, personal sensing tasks-from emotion recognition in Super Mario players to stress detection in undergraduates-and devised a preprocessing pipeline performing on-/off-body detection, sleep/wake detection, segmentation, and (optionally) feature extraction. With 161 E4-recorded subjects, we introduced E4SelfLearning, the largest-to-date open access collection, and its preprocessing pipeline. We developed a novel E4-tailored transformer (E4mer) architecture, serving as the blueprint for both SSL and fully supervised learning; we assessed whether and under which conditions self-supervised pretraining led to an improvement over fully supervised baselines (ie, the fully supervised E4mer and pre-deep learning algorithms) in detecting acute MD episodes from recording segments taken in 64 (n=32, 50%, acute, n=32, 50%, stable) patients.
    RESULTS: SSL significantly outperformed fully supervised pipelines using either our novel E4mer or extreme gradient boosting (XGBoost): n=3353 (81.23%) against n=3110 (75.35%; E4mer) and n=2973 (72.02%; XGBoost) correctly classified recording segments from a total of 4128 segments. SSL performance was strongly associated with the specific surrogate task used for pretraining, as well as with unlabeled data availability.
    CONCLUSIONS: We showed that SSL, a paradigm where a model is pretrained on unlabeled data with no need for human annotations before deployment on the supervised target task of interest, helps overcome the annotation bottleneck; the choice of the pretraining surrogate task and the size of unlabeled data for pretraining are key determinants of SSL success. We introduced E4mer, which can be used for SSL, and shared the E4SelfLearning collection, along with its preprocessing pipeline, which can foster and expedite future research into SSL for personal sensing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    软组织肉瘤,与宫颈癌和食道癌的发病率相似,来自各种软组织,如平滑肌,脂肪,和纤维组织。成像中肉瘤的有效分割对于准确诊断至关重要。
    本研究收集了45例大腿软组织肉瘤患者的多模态MRI图像,总计8,640张图像。这些图像由临床医生注释以描绘肉瘤区域,创建一个全面的数据集。我们基于UNet框架开发了一种新颖的细分模型,用残差网络和注意力机制增强,以改进特定于模态的信息提取。此外,采用自监督学习策略来优化编码器的特征提取能力。
    与单模态输入相比,新模型在使用多模态MRI图像时表现出优越的分割性能。通过各种实验设置验证了模型利用创建的数据集的有效性,确认增强的能力,以表征肿瘤区域在不同的模式。
    多模态MRI图像和先进的机器学习技术在我们的模型中的集成显着改善了大腿成像中软组织肉瘤的分割。这一进步有助于临床医生更好地诊断和了解患者的病情,利用不同成像方式的优势。进一步的研究可以探索这些技术在其他类型的软组织肉瘤和其他解剖部位的应用。
    UNASSIGNED: Soft tissue sarcomas, similar in incidence to cervical and esophageal cancers, arise from various soft tissues like smooth muscle, fat, and fibrous tissue. Effective segmentation of sarcomas in imaging is crucial for accurate diagnosis.
    UNASSIGNED: This study collected multi-modal MRI images from 45 patients with thigh soft tissue sarcoma, totaling 8,640 images. These images were annotated by clinicians to delineate the sarcoma regions, creating a comprehensive dataset. We developed a novel segmentation model based on the UNet framework, enhanced with residual networks and attention mechanisms for improved modality-specific information extraction. Additionally, self-supervised learning strategies were employed to optimize feature extraction capabilities of the encoders.
    UNASSIGNED: The new model demonstrated superior segmentation performance when using multi-modal MRI images compared to single-modal inputs. The effectiveness of the model in utilizing the created dataset was validated through various experimental setups, confirming the enhanced ability to characterize tumor regions across different modalities.
    UNASSIGNED: The integration of multi-modal MRI images and advanced machine learning techniques in our model significantly improves the segmentation of soft tissue sarcomas in thigh imaging. This advancement aids clinicians in better diagnosing and understanding the patient\'s condition, leveraging the strengths of different imaging modalities. Further studies could explore the application of these techniques to other types of soft tissue sarcomas and additional anatomical sites.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在数字病理学中,整个幻灯片图像(WSI)被广泛用于癌症诊断和预后预测等应用。视觉变换器(ViT)模型最近已成为一种有前途的方法,用于编码WSI的大区域,同时保留补丁之间的空间关系。然而,由于大量的模型参数和有限的标记数据,将变压器模型应用于WSI仍然具有挑战性。在这项研究中,我们提出了一个借口任务,以自我监督的方式训练变压器模型。我们的模型,MaskHIT,使用变压器输出来重建屏蔽补丁,以对比损失衡量。我们使用来自TCGA的7000多个WSI对MaskHIT模型进行了预训练,并在多个实验中广泛评估了其性能,涵盖生存预测,癌症亚型分类,和等级预测任务。我们的实验表明,预训练过程可以实现对WSI的上下文感知理解,促进基于斑块位置和视觉模式的代表性组织学特征的学习,并且对于ViT模型在WSI级别任务上实现最佳结果至关重要。在生存预测和癌症亚型分类任务上,预先训练的MaskHIT超过了各种多实例学习方法3%和2%,也优于最近最先进的基于变压器的方法。最后,MaskHIT模型生成的注意力图与病理学家的注释之间的比较表明,该模型可以准确地识别每个任务的整个载玻片上的临床相关组织学结构。
    In digital pathology, whole-slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction. Vision transformer (ViT) models have recently emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches. However, due to the large number of model parameters and limited labeled data, applying transformer models to WSIs remains challenging. In this study, we propose a pretext task to train the transformer model in a self-supervised manner. Our model, MaskHIT, uses the transformer output to reconstruct masked patches, measured by contrastive loss. We pre-trained MaskHIT model using over 7000 WSIs from TCGA and extensively evaluated its performance in multiple experiments, covering survival prediction, cancer subtype classification, and grade prediction tasks. Our experiments demonstrate that the pre-training procedure enables context-aware understanding of WSIs, facilitates the learning of representative histological features based on patch positions and visual patterns, and is essential for the ViT model to achieve optimal results on WSI-level tasks. The pre-trained MaskHIT surpasses various multiple instance learning approaches by 3% and 2% on survival prediction and cancer subtype classification tasks, and also outperforms recent state-of-the-art transformer-based methods. Finally, a comparison between the attention maps generated by the MaskHIT model with pathologist\'s annotations indicates that the model can accurately identify clinically relevant histological structures on the whole slide for each task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号