
  • 文章类型: Journal Article
    The influence of extrinsic hand-feel touch cues on consumer experiences in food and beverage consumption is well established. However, their impact on trigeminal perception, particularly the oral irritation caused by capsaicin or spicy foods, is less understood. This study aimed to determine the existence of cross-modal associations between hand-feel touch and capsaicin-induced oral irritation. This study investigated whether these potential associations were driven by the sensory contributions of the hand-feel tactile materials (measured by instrumental physical parameters) or by affective responses (evaluated through hedonic scales and the self-reported emotion questionnaire, EsSense Profile®, by consumers). In our study, 96 participants tasted a capsaicin solution while engaging with nine hand-feel tactile materials, i.e., cardboard, linen, rattan, silicone, stainless steel, sandpaper (fine), sandpaper (rough), sponge, and towel. They subsequently rated their liking and emotional responses, perceived intensity of oral irritation, and the congruency between hand-feel tactile sensation and oral irritation. Instrumental measurements characterized the surface texture of the hand-feel tactile materials, which were correlated with the collected sensory data. The results revealed that unique cross-modal associations between hand-feel touch and capsaicin-induced oral irritation. Specifically, while sandpapers demonstrated high congruence with the sensation of oral irritation, stainless steel was found to be least congruent. These associations were influenced by both the common emotional responses (\"active,\" \"aggressive,\" \"daring,\" \"energetic,\" \"guilty,\" and \"worried\") evoked by the hand-feel tactile materials and the capsaicin, as well as by participants\' liking for the hand-feel tactile materials and the characteristics of the surface textures. This study provides empirical evidence of the cross-modality between hand-feel tactile sensations and capsaicin-induced oral irritation, opening new avenues for future research in this area.






  • 文章类型: Journal Article
    Skeleton-based action recognition, renowned for its computational efficiency and indifference to lighting variations, has become a focal point in the realm of motion analysis. However, most current methods typically only extract global skeleton features, overlooking the potential semantic relationships among various partial limb motions. For instance, the subtle differences between actions such as \"brush teeth\" and \"brush hair\" are mainly distinguished by specific elements. Although combining limb movements provides a more holistic representation of an action, relying solely on skeleton points proves inadequate for capturing these nuances. Therefore, integrating detailed linguistic descriptions into the learning process of skeleton features is essential. This motivates us to explore integrating fine-grained language descriptions into the learning process of skeleton features to capture more discriminative skeleton behavior representations. To this end, we introduce a new Linguistic-Driven Partial Semantic Relevance Learning framework (LPSR) in this work. While using state-of-the-art large language models to generate linguistic descriptions of local limb motions and further constrain the learning of local motions, we also aggregate global skeleton point representations and textual representations (which generated from an LLM) to obtain a more generalized cross-modal behavioral representation. On this basis, we propose a cyclic attentional interaction module to model the implicit correlations between partial limb motions. Numerous ablation experiments demonstrate the effectiveness of the method proposed in this paper, and our method also obtains state-of-the-art results.






  • 文章类型: Journal Article
    Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a gloss sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and they are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language data poses a serious challenge for sign language recognition, which may result in insufficient training of sign language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video and a dialogue sentence as input and outputs the sign language recognition result. The other teacher model is the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available at






  • 文章类型: Journal Article
    Chronic neuropathic pain and chronic tinnitus have been likened to phantom percepts, in which a complete or partial sensory deafferentation results in a filling in of the missing information derived from memory. 150 participants, 50 with tinnitus, 50 with chronic pain and 50 healthy controls underwent a resting state EEG. Source localized current density is recorded from all the sensory cortices (olfactory, gustatory, somatosensory, auditory, vestibular, visual) as well as the parahippocampal area. Functional connectivity by means of lagged phase synchronization is also computed between these regions of interest. Pain and tinnitus are associated with gamma band activity, reflecting prediction errors, in all sensory cortices except the olfactory and gustatory cortex. Functional connectivity identifies theta frequency connectivity between each of the sensory cortices except the chemical senses to the parahippocampus, but not between the individual sensory cortices. When one sensory domain is deprived, the other senses may provide the parahippocampal \'contextual\' area with the most likely sound or somatosensory sensation to fill in the gap, applying an abductive \'duck test\' approach, i.e., based on stored multisensory congruence. This novel concept paves the way to develop novel treatments for pain and tinnitus, using multisensory (i.e. visual, vestibular, somatosensory, auditory) modulation with or without associated parahippocampal targeting.






  • 文章类型: Journal Article
    Midbrain multisensory neurons undergo a significant postnatal transition in how they process cross-modal (e.g. visual-auditory) signals. In early stages, signals derived from common events are processed competitively; however, at later stages they are processed cooperatively such that their salience is enhanced. This transition reflects adaptation to cross-modal configurations that are consistently experienced and become informative about which correspond to common events. Tested here was the assumption that overt behaviors follow a similar maturation. Cats were reared in omnidirectional sound thereby compromising the experience needed for this developmental process. Animals were then repeatedly exposed to different configurations of visual and auditory stimuli (e.g. spatiotemporally congruent or spatially disparate) that varied on each side of space and their behavior was assessed using a detection/localization task. Animals showed enhanced performance to stimuli consistent with the experience provided: congruent stimuli elicited enhanced behaviors where spatially congruent cross-modal experience was provided, and spatially disparate stimuli elicited enhanced behaviors where spatially disparate cross-modal experience was provided. Cross-modal configurations not consistent with experience did not enhance responses. The presumptive benefit of such flexibility in the multisensory developmental process is to sensitize neural circuits (and the behaviors they control) to the features of the environment in which they will function. These experiments reveal that these processes have a high degree of flexibility, such that two (conflicting) multisensory principles can be implemented by cross-modal experience on opposite sides of space even within the same animal.






  • 文章类型: Journal Article
    While it is well established that sensory cortical regions traditionally thought to be unimodal can be activated by stimuli from modalities other than the dominant one, functions of such foreign-modal activations are still not clear. Here we show that visual activations in early auditory cortex can be related to whether or not the monkeys engaged in audio-visual tasks, to the time when the monkeys reacted to the visual component of such tasks, and to the correctness of the monkeys\' response to the auditory component of such tasks. These relationships between visual activations and behavior suggest that auditory cortex can be recruited for visually-guided behavior and that visual activations can prime auditory cortex such that it is prepared for processing future sounds. Our study thus provides evidence that foreign-modal activations in sensory cortex can contribute to a subject\'s ability to perform tasks on stimuli from foreign and dominant modalities.






  • 文章类型: Journal Article
    Flexible responses to sensory stimuli based on changing rules are critical for adapting to a dynamic environment. However, it remains unclear how the brain encodes and uses rule information to guide behavior. Here, we made single-unit recordings while head-fixed mice performed a cross-modal sensory selection task where they switched between two rules: licking in response to tactile stimuli while rejecting visual stimuli, or vice versa. Along a cortical sensorimotor processing stream including the primary (S1) and secondary (S2) somatosensory areas, and the medial (MM) and anterolateral (ALM) motor areas, single-neuron activity distinguished between the two rules both prior to and in response to the tactile stimulus. We hypothesized that neural populations in these areas would show rule-dependent preparatory states, which would shape the subsequent sensory processing and behavior. This hypothesis was supported for the motor cortical areas (MM and ALM) by findings that (1) the current task rule could be decoded from pre-stimulus population activity; (2) neural subspaces containing the population activity differed between the two rules; and (3) optogenetic disruption of pre-stimulus states impaired task performance. Our findings indicate that flexible action selection in response to sensory input can occur via configuration of preparatory states in the motor cortex.






  • 文章类型: Journal Article
    Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image-text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image-text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image-text conservation area image datasets.






  • 文章类型: Journal Article
    Tomato leaf disease control in the field of smart agriculture urgently requires attention and reinforcement. This paper proposes a method called LAFANet for image-text retrieval, which integrates image and text information for joint analysis of multimodal data, helping agricultural practitioners to provide more comprehensive and in-depth diagnostic evidence to ensure the quality and yield of tomatoes. First, we focus on six common tomato leaf disease images and text descriptions, creating a Tomato Leaf Disease Image-Text Retrieval Dataset (TLDITRD), introducing image-text retrieval into the field of tomato leaf disease retrieval. Then, utilizing ViT and BERT models, we extract detailed image features and sequences of textual features, incorporating contextual information from image-text pairs. To address errors in image-text retrieval caused by complex backgrounds, we propose Learnable Fusion Attention (LFA) to amplify the fusion of textual and image features, thereby extracting substantial semantic insights from both modalities. To delve further into the semantic connections across various modalities, we propose a False Negative Elimination-Adversarial Negative Selection (FNE-ANS) approach. This method aims to identify adversarial negative instances that specifically target false negatives within the triplet function, thereby imposing constraints on the model. To bolster the model\'s capacity for generalization and precision, we propose Adversarial Regularization (AR). This approach involves incorporating adversarial perturbations during model training, thereby fortifying its resilience and adaptability to slight variations in input data. Experimental results show that, compared with existing ultramodern models, LAFANet outperformed existing models on TLDITRD dataset, with top1, top5, and top10 reaching 83.3% and 90.0%, and top1, top5, and top10 reaching 80.3%, 93.7%, and 96.3%. LAFANet offers fresh technical backing and algorithmic insights for the retrieval of tomato leaf disease through image-text correlation.






  • 文章类型: Journal Article
    Large-scale digital whole slide image (WSI) datasets analysis have gained significant attention in computer-aided cancer diagnosis. Content-based histopathological image retrieval (CBHIR) is a technique that searches a large database for data samples matching input objects in both details and semantics, offering relevant diagnostic information to pathologists. However, the current methods are limited by the difficulty of gigapixels, the variable size of WSIs, and the dependence on manual annotations. In this work, we propose a novel histopathology language-image representation learning framework for fine-grained digital pathology cross-modal retrieval, which utilizes paired diagnosis reports to learn fine-grained semantics from the WSI. An anchor-based WSI encoder is built to extract hierarchical region features and a prompt-based text encoder is introduced to learn fine-grained semantics from the diagnosis reports. The proposed framework is trained with a multivariate cross-modal loss function to learn semantic information from the diagnosis report at both the instance level and region level. After training, it can perform four types of retrieval tasks based on the multi-modal database to support diagnostic requirements. We conducted experiments on an in-house dataset and a public dataset to evaluate the proposed method. Extensive experiments have demonstrated the effectiveness of the proposed method and its advantages to the present histopathology retrieval methods. The code is available at





