cross-modal

跨模态
  • 文章类型: Journal Article
    外在的手感触觉线索对消费者在食品和饮料消费中的体验的影响是公认的。然而,它们对三叉神经感知的影响,特别是辣椒素或辛辣食物引起的口腔刺激,不太了解。这项研究旨在确定手感触摸与辣椒素引起的口腔刺激之间存在交叉模式关联。这项研究调查了这些潜在的关联是由触觉材料的感官贡献(通过仪器物理参数测量)还是由情感反应(通过享乐量表和自我报告的情绪问卷进行评估,EsSenseProfile®,消费者)。在我们的研究中,96名参与者品尝了辣椒素溶液,同时接触了9种手感触觉材料,即,纸板,亚麻,藤条,硅胶,不锈钢,砂纸(精细),砂纸(粗糙),海绵,还有毛巾.随后,他们对自己的喜好和情绪反应进行了评分,口腔刺激的感知强度,以及手感触觉和口腔刺激之间的一致性。仪器测量表征了手感触觉材料的表面纹理,与收集的感官数据相关。结果表明,手感触摸和辣椒素引起的口腔刺激之间存在独特的交叉模态关联。具体来说,虽然砂纸显示出与口腔刺激的感觉高度一致,不锈钢被发现是最不一致的。这些关联受到两种常见情绪反应的影响(“活跃,\"\"咄咄逼人,\"\"大胆,\"\"精力充沛,\"\"有罪,\"和\"担心\")由手感触觉材料和辣椒素引起,以及参与者对手感触觉材料和表面纹理特征的喜好。这项研究提供了触觉感觉和辣椒素引起的口腔刺激之间交叉模式的经验证据。为这一领域的未来研究开辟了新的途径。
    The influence of extrinsic hand-feel touch cues on consumer experiences in food and beverage consumption is well established. However, their impact on trigeminal perception, particularly the oral irritation caused by capsaicin or spicy foods, is less understood. This study aimed to determine the existence of cross-modal associations between hand-feel touch and capsaicin-induced oral irritation. This study investigated whether these potential associations were driven by the sensory contributions of the hand-feel tactile materials (measured by instrumental physical parameters) or by affective responses (evaluated through hedonic scales and the self-reported emotion questionnaire, EsSense Profile®, by consumers). In our study, 96 participants tasted a capsaicin solution while engaging with nine hand-feel tactile materials, i.e., cardboard, linen, rattan, silicone, stainless steel, sandpaper (fine), sandpaper (rough), sponge, and towel. They subsequently rated their liking and emotional responses, perceived intensity of oral irritation, and the congruency between hand-feel tactile sensation and oral irritation. Instrumental measurements characterized the surface texture of the hand-feel tactile materials, which were correlated with the collected sensory data. The results revealed that unique cross-modal associations between hand-feel touch and capsaicin-induced oral irritation. Specifically, while sandpapers demonstrated high congruence with the sensation of oral irritation, stainless steel was found to be least congruent. These associations were influenced by both the common emotional responses (\"active,\" \"aggressive,\" \"daring,\" \"energetic,\" \"guilty,\" and \"worried\") evoked by the hand-feel tactile materials and the capsaicin, as well as by participants\' liking for the hand-feel tactile materials and the characteristics of the surface textures. This study provides empirical evidence of the cross-modality between hand-feel tactile sensations and capsaicin-induced oral irritation, opening new avenues for future research in this area.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于骨架的动作识别,以其计算效率和对照明变化的冷漠而闻名,已经成为运动分析领域的焦点。然而,当前的大多数方法通常只提取全局骨架特征,忽略各种部分肢体运动之间的潜在语义关系。例如,诸如“刷牙”和“刷毛”之类的动作之间的细微差别主要通过特定元素来区分。虽然结合肢体动作提供了一个动作的更全面的表现,仅仅依靠骨架点被证明不足以捕捉这些细微差别。因此,将详细的语言描述集成到骨骼特征的学习过程中至关重要。这促使我们探索将细粒度的语言描述集成到骨骼特征的学习过程中,以捕获更具歧视性的骨骼行为表示。为此,在这项工作中,我们引入了一种新的语言驱动的部分语义关联学习框架(LPSR)。在使用最先进的大型语言模型来生成局部肢体运动的语言描述并进一步约束局部运动的学习的同时,我们还聚合了全局骨架点表示和文本表示(从LLM生成),以获得更广义的跨模态行为表示。在此基础上,我们提出了一个循环注意交互模块来建模部分肢体运动之间的隐含相关性。大量的烧蚀实验证明了本文方法的有效性,我们的方法也获得了最先进的结果。
    Skeleton-based action recognition, renowned for its computational efficiency and indifference to lighting variations, has become a focal point in the realm of motion analysis. However, most current methods typically only extract global skeleton features, overlooking the potential semantic relationships among various partial limb motions. For instance, the subtle differences between actions such as \"brush teeth\" and \"brush hair\" are mainly distinguished by specific elements. Although combining limb movements provides a more holistic representation of an action, relying solely on skeleton points proves inadequate for capturing these nuances. Therefore, integrating detailed linguistic descriptions into the learning process of skeleton features is essential. This motivates us to explore integrating fine-grained language descriptions into the learning process of skeleton features to capture more discriminative skeleton behavior representations. To this end, we introduce a new Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) in this work. While using state-of-the-art large language models to generate linguistic descriptions of local limb motions and further constrain the learning of local motions, we also aggregate global skeleton point representations and textual representations (which generated from an LLM) to obtain a more generalized cross-modal behavioral representation. On this basis, we propose a cyclic attentional interaction module to model the implicit correlations between partial limb motions. Numerous ablation experiments demonstrate the effectiveness of the method proposed in this paper, and our method also obtains state-of-the-art results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    连续手语识别(CSLR)是将手语视频转换为光泽序列的任务。现有的基于深度学习的手语识别方法通常依赖于大规模的训练数据和丰富的监督信息。然而,当前手语数据集有限,它们仅在句子级别而不是框架级别进行注释。对手语数据的监管不足对手语识别提出了严峻的挑战,这可能导致手语识别模型训练不足。为了解决上述问题,我们提出了一种用于连续手语识别的跨模态知识蒸馏方法,其中包含两个教师模型和一个学生模型。教师模型之一是Sign2Text对话教师模型,输入手语视频和对话句,输出手语识别结果。另一种教师模式是Text2Gloss翻译教师模式,其目标是将文本句子翻译成光泽序列。两种教师模式都可以提供信息丰富的软标签来辅助学生模式的训练,这是一个通用的手语识别模型。我们对多个常用的手语数据集进行了广泛的实验,即,凤凰2014T,CSL-Daily和QSL,结果表明,所提出的跨模态知识蒸馏方法通过将多模态信息从教师模型传递到学生模型,能够有效提高手语识别的准确率。代码可在https://github.com/glq-1992/cross-modal-knowledge-restrination_new上找到。
    Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a gloss sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and they are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language data poses a serious challenge for sign language recognition, which may result in insufficient training of sign language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video and a dialogue sentence as input and outputs the sign language recognition result. The other teacher model is the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available at https://github.com/glq-1992/cross-modal-knowledge-distillation_new.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    慢性神经性疼痛和慢性耳鸣被比作幻影感觉,在这种情况下,完全或部分的感觉失传会导致从记忆中获得的缺失信息的填充。150名学员,50伴有耳鸣,50名慢性疼痛患者和50名健康对照者进行了静息状态脑电图检查。从所有感觉皮层记录源局部电流密度(嗅觉,味觉,体感,听觉,前庭,视觉)以及海马旁区域。还在这些感兴趣区域之间计算借助于滞后相位同步的功能连通性。疼痛和耳鸣与伽马带活动有关,反映预测误差,除了嗅觉皮层和味觉皮层。功能连接识别除了海马旁的化学感觉之外,每个感觉皮质之间的θ频率连接,但不是在个体感觉皮层之间。当一个感官领域被剥夺时,其他感官可以提供海马旁的“上下文”区域,最有可能的声音或体感感觉来填补空白,应用绑架性的“鸭子测试”方法,即,基于存储的多感官一致性。这个新概念为开发疼痛和耳鸣的新治疗方法铺平了道路,使用多感官(即视觉,前庭,体感,听觉)调制,有或没有相关的海马旁靶向。
    Chronic neuropathic pain and chronic tinnitus have been likened to phantom percepts, in which a complete or partial sensory deafferentation results in a filling in of the missing information derived from memory. 150 participants, 50 with tinnitus, 50 with chronic pain and 50 healthy controls underwent a resting state EEG. Source localized current density is recorded from all the sensory cortices (olfactory, gustatory, somatosensory, auditory, vestibular, visual) as well as the parahippocampal area. Functional connectivity by means of lagged phase synchronization is also computed between these regions of interest. Pain and tinnitus are associated with gamma band activity, reflecting prediction errors, in all sensory cortices except the olfactory and gustatory cortex. Functional connectivity identifies theta frequency connectivity between each of the sensory cortices except the chemical senses to the parahippocampus, but not between the individual sensory cortices. When one sensory domain is deprived, the other senses may provide the parahippocampal \'contextual\' area with the most likely sound or somatosensory sensation to fill in the gap, applying an abductive \'duck test\' approach, i.e., based on stored multisensory congruence. This novel concept paves the way to develop novel treatments for pain and tinnitus, using multisensory (i.e. visual, vestibular, somatosensory, auditory) modulation with or without associated parahippocampal targeting.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    中脑多感觉神经元在其如何处理跨模态(例如视觉-听觉)信号方面经历显著的出生后转变。在早期阶段,来自常见事件的信号是竞争性处理的;然而,在后期阶段,它们会被合作处理,从而提高它们的显著性。这种转变反映了对跨模态配置的适应,这些配置一直经历过,并提供了与常见事件相对应的信息。在这里测试的是一个假设,即公开的行为遵循类似的成熟。猫在全向声音中饲养,从而损害了这一发展过程所需的经验。然后将动物重复暴露于在空间的每一侧上变化的视觉和听觉刺激的不同配置(例如,时空一致或空间不同),并且使用检测/定位任务评估它们的行为。动物表现出与所提供的经验一致的增强的刺激表现:一致的刺激引起了增强的行为,其中提供了空间上一致的跨模式经验,并且空间上不同的刺激引起了增强的行为,其中提供了空间上不同的跨模态体验。与经验不一致的跨模式配置并不能增强响应。在多感官发育过程中,这种灵活性的假定好处是使神经回路(以及它们控制的行为)对它们将发挥作用的环境特征敏感。这些实验表明,这些过程具有高度的灵活性,这样两个(冲突的)多感官原则可以通过跨模式经验在空间的相对两侧甚至在同一动物实现。
    Midbrain multisensory neurons undergo a significant postnatal transition in how they process cross-modal (e.g. visual-auditory) signals. In early stages, signals derived from common events are processed competitively; however, at later stages they are processed cooperatively such that their salience is enhanced. This transition reflects adaptation to cross-modal configurations that are consistently experienced and become informative about which correspond to common events. Tested here was the assumption that overt behaviors follow a similar maturation. Cats were reared in omnidirectional sound thereby compromising the experience needed for this developmental process. Animals were then repeatedly exposed to different configurations of visual and auditory stimuli (e.g. spatiotemporally congruent or spatially disparate) that varied on each side of space and their behavior was assessed using a detection/localization task. Animals showed enhanced performance to stimuli consistent with the experience provided: congruent stimuli elicited enhanced behaviors where spatially congruent cross-modal experience was provided, and spatially disparate stimuli elicited enhanced behaviors where spatially disparate cross-modal experience was provided. Cross-modal configurations not consistent with experience did not enhance responses. The presumptive benefit of such flexibility in the multisensory developmental process is to sensitize neural circuits (and the behaviors they control) to the features of the environment in which they will function. These experiments reveal that these processes have a high degree of flexibility, such that two (conflicting) multisensory principles can be implemented by cross-modal experience on opposite sides of space even within the same animal.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    虽然已经确定,传统上被认为是单峰的感觉皮层区域可以通过来自优势模式以外的其他模式的刺激来激活,这种外来模态激活的功能仍然不清楚。在这里,我们表明,早期听觉皮层的视觉激活可能与猴子是否从事视听任务有关,到猴子对这些任务的视觉成分做出反应的时候,以及猴子对这些任务的听觉成分的反应的正确性。视觉激活和行为之间的这些关系表明,可以招募听觉皮层来进行视觉引导行为,并且视觉激活可以使听觉皮层成为可能,以便为处理未来的声音做好准备。因此,我们的研究提供了证据,表明感觉皮层中的外来模态激活可以有助于受试者对外来和主导模态的刺激执行任务的能力。
    While it is well established that sensory cortical regions traditionally thought to be unimodal can be activated by stimuli from modalities other than the dominant one, functions of such foreign-modal activations are still not clear. Here we show that visual activations in early auditory cortex can be related to whether or not the monkeys engaged in audio-visual tasks, to the time when the monkeys reacted to the visual component of such tasks, and to the correctness of the monkeys\' response to the auditory component of such tasks. These relationships between visual activations and behavior suggest that auditory cortex can be recruited for visually-guided behavior and that visual activations can prime auditory cortex such that it is prepared for processing future sounds. Our study thus provides evidence that foreign-modal activations in sensory cortex can contribute to a subject\'s ability to perform tasks on stimuli from foreign and dominant modalities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于变化的规则对感官刺激的灵活响应对于适应动态环境至关重要。然而,目前尚不清楚大脑如何编码和使用规则信息来指导行为。这里,我们进行了单单元记录,而头部固定的小鼠执行了跨模态的感觉选择任务,它们在两个规则之间切换:舔响应触觉刺激而拒绝视觉刺激,反之亦然。沿着包括初级(S1)和次级(S2)体感区域的皮层感觉运动处理流,内侧(MM)和前外侧(ALM)运动区,在触觉刺激之前和对触觉刺激的响应中,单神经元活动在两个规则之间进行了区分。我们假设这些区域的神经种群会显示出依赖于规则的预备状态,这将塑造随后的感官加工和行为。运动皮层区域(MM和ALM)的研究结果支持了这一假设:(1)当前任务规则可以从刺激前的种群活动中解码;(2)包含种群活动的神经子空间在两个规则之间有所不同;(3)刺激前状态的光遗传学破坏损害了任务性能。我们的发现表明,响应感觉输入的灵活动作选择可以通过运动皮层中预备状态的配置来进行。
    Flexible responses to sensory stimuli based on changing rules are critical for adapting to a dynamic environment. However, it remains unclear how the brain encodes and uses rule information to guide behavior. Here, we made single-unit recordings while head-fixed mice performed a cross-modal sensory selection task where they switched between two rules: licking in response to tactile stimuli while rejecting visual stimuli, or vice versa. Along a cortical sensorimotor processing stream including the primary (S1) and secondary (S2) somatosensory areas, and the medial (MM) and anterolateral (ALM) motor areas, single-neuron activity distinguished between the two rules both prior to and in response to the tactile stimulus. We hypothesized that neural populations in these areas would show rule-dependent preparatory states, which would shape the subsequent sensory processing and behavior. This hypothesis was supported for the motor cortical areas (MM and ALM) by findings that (1) the current task rule could be decoded from pre-stimulus population activity; (2) neural subspaces containing the population activity differed between the two rules; and (3) optogenetic disruption of pre-stimulus states impaired task performance. Our findings indicate that flexible action selection in response to sensory input can occur via configuration of preparatory states in the motor cortex.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    细粒度表示是基于深度学习的物种分类的基础,在这种情况下,跨模态对比学习是一种有效的方法。物种的多样性以及自然语言固有的上下文歧义对保护区图像数据的跨模态表示对齐提出了主要挑战。将跨模态检索任务与生成任务集成有助于基于上下文理解的跨模态表示对齐。然而,在对比学习过程中,除了学习数据本身的差异,一对编码器不可避免地学习由编码器波动引起的差异。后者导致了收敛的捷径,导致表示质量较差,并且在特征的共享空间内不准确地反映原始数据集中的样本之间的相似关系。要实现细粒度的交叉模态表示对齐,我们首先提出了一个剩余注意力网络,以增强跨模态编码器动量更新期间的一致性。在此基础上,我们从多任务的角度提出动量编码作为跨模态信息的桥梁,有效改善跨模态互信息,表示质量,并优化跨模态共享语义空间内特征点的分布。通过多任务获取跨模态语义理解的动量编码队列,我们围绕事实信息的不变图像特征对齐模糊的自然语言表示,减轻上下文歧义,增强模型鲁棒性。实验验证表明,我们提出的跨模态动量编码器的多任务视角在公共数据集上的标准化图像分类任务和图像-文本跨模态检索任务上的相似模型在排行榜上的表现高达8%,验证了该方法的有效性。对我们自建的保护区图像-文本配对数据集的定性实验表明,我们提出的方法准确地执行了8142种物种之间的跨模态检索和生成任务,证明其在细粒度跨模态图像-文本保护区图像数据集上的有效性。
    Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image-text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image-text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image-text conservation area image datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    智慧农业领域的番茄叶部病害防治亟待重视和加强。本文提出了一种名为LAFANet的图像文本检索方法,它集成了图像和文本信息,用于多模态数据的联合分析,帮助农业从业者提供更全面和深入的诊断证据,以确保西红柿的质量和产量。首先,我们关注六种常见的番茄叶病图像和文字描述,创建番茄叶病图像文本检索数据集(TLDITRD),将图像文本检索引入番茄叶病检索领域。然后,利用ViT和BERT模型,我们提取详细的图像特征和文本特征序列,合并来自图像-文本对的上下文信息。为了解决复杂背景导致的图像文本检索错误,我们提出了可学习融合注意力(LFA)来放大文本和图像特征的融合,从而从这两种模式中提取大量的语义见解。为了进一步深入研究各种模态之间的语义联系,我们提出了一种假否定淘汰-对抗否定选择(FNE-ANS)方法。该方法旨在识别特定针对三元组函数中的假阴性的对抗性阴性实例,从而对模型施加约束。为了增强模型的泛化和精确性能力,我们提出了对抗正则化(AR)。这种方法涉及在模型训练期间合并对抗性扰动,从而增强其对输入数据的微小变化的弹性和适应性。实验结果表明,与现有的超现代模型相比,LAFANet在TLDITRD数据集上的表现优于现有模型,top1,top5,top10分别达到83.3%和90.0%,top1,top5,top10达到80.3%,93.7%,和96.3%。LAFANet为通过图像-文本相关检索番茄叶病提供了新的技术支持和算法见解。
    Tomato leaf disease control in the field of smart agriculture urgently requires attention and reinforcement. This paper proposes a method called LAFANet for image-text retrieval, which integrates image and text information for joint analysis of multimodal data, helping agricultural practitioners to provide more comprehensive and in-depth diagnostic evidence to ensure the quality and yield of tomatoes. First, we focus on six common tomato leaf disease images and text descriptions, creating a Tomato Leaf Disease Image-Text Retrieval Dataset (TLDITRD), introducing image-text retrieval into the field of tomato leaf disease retrieval. Then, utilizing ViT and BERT models, we extract detailed image features and sequences of textual features, incorporating contextual information from image-text pairs. To address errors in image-text retrieval caused by complex backgrounds, we propose Learnable Fusion Attention (LFA) to amplify the fusion of textual and image features, thereby extracting substantial semantic insights from both modalities. To delve further into the semantic connections across various modalities, we propose a False Negative Elimination-Adversarial Negative Selection (FNE-ANS) approach. This method aims to identify adversarial negative instances that specifically target false negatives within the triplet function, thereby imposing constraints on the model. To bolster the model\'s capacity for generalization and precision, we propose Adversarial Regularization (AR). This approach involves incorporating adversarial perturbations during model training, thereby fortifying its resilience and adaptability to slight variations in input data. Experimental results show that, compared with existing ultramodern models, LAFANet outperformed existing models on TLDITRD dataset, with top1, top5, and top10 reaching 83.3% and 90.0%, and top1, top5, and top10 reaching 80.3%, 93.7%, and 96.3%. LAFANet offers fresh technical backing and algorithmic insights for the retrieval of tomato leaf disease through image-text correlation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大规模数字整片图像(WSI)数据集分析在计算机辅助癌症诊断中获得了广泛的关注。基于内容的组织病理学图像检索(CBHIR)是一种在大型数据库中搜索与细节和语义匹配的数据样本的技术,向病理学家提供相关诊断信息。然而,目前的方法受限于千兆像素的难度,WSI的可变大小,以及对手动注释的依赖。在这项工作中,我们提出了一种新颖的组织病理学语言-图像表示学习框架,用于细粒度数字病理学跨模态检索,它利用配对诊断报告从WSI学习细粒度语义。构建了基于锚的WSI编码器来提取分层区域特征,并引入了基于提示的文本编码器来从诊断报告中学习细粒度的语义。所提出的框架使用多变量跨模态损失函数进行训练,以从实例级别和区域级别的诊断报告中学习语义信息。培训后,它可以基于多模态数据库执行四种类型的检索任务,以支持诊断需求。我们在内部数据集和公共数据集上进行了实验,以评估所提出的方法。大量实验证明了所提出方法的有效性及其对当前组织病理学检索方法的优势。该代码可在https://github.com/hudingyi/FGCR获得。
    Large-scale digital whole slide image (WSI) datasets analysis have gained significant attention in computer-aided cancer diagnosis. Content-based histopathological image retrieval (CBHIR) is a technique that searches a large database for data samples matching input objects in both details and semantics, offering relevant diagnostic information to pathologists. However, the current methods are limited by the difficulty of gigapixels, the variable size of WSIs, and the dependence on manual annotations. In this work, we propose a novel histopathology language-image representation learning framework for fine-grained digital pathology cross-modal retrieval, which utilizes paired diagnosis reports to learn fine-grained semantics from the WSI. An anchor-based WSI encoder is built to extract hierarchical region features and a prompt-based text encoder is introduced to learn fine-grained semantics from the diagnosis reports. The proposed framework is trained with a multivariate cross-modal loss function to learn semantic information from the diagnosis report at both the instance level and region level. After training, it can perform four types of retrieval tasks based on the multi-modal database to support diagnostic requirements. We conducted experiments on an in-house dataset and a public dataset to evaluate the proposed method. Extensive experiments have demonstrated the effectiveness of the proposed method and its advantages to the present histopathology retrieval methods. The code is available at https://github.com/hudingyi/FGCR.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号