自然语言驱动的视听场景感知对话系统多模态表示学习 [J].Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog System.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

With the development of multimedia systems in wireless environments, the rising need for artificial intelligence is to design a system that can properly communicate with humans with a comprehensive understanding of various types of information in a human-like manner. Therefore, this paper addresses an audio-visual scene-aware dialog system that can communicate with users about audio-visual scenes. It is essential to understand not only visual and textual information but also audio information in a comprehensive way. Despite the substantial progress in multimodal representation learning with language and visual modalities, there are still two caveats: ineffective use of auditory information and the lack of interpretability of the deep learning systems\' reasoning. To address these issues, we propose a novel audio-visual scene-aware dialog system that utilizes a set of explicit information from each modality as a form of natural language, which can be fused into a language model in a natural way. It leverages a transformer-based decoder to generate a coherent and correct response based on multimodal knowledge in a multitask learning setting. In addition, we also address the way of interpreting the model with a response-driven temporal moment localization method to verify how the system generates the response. The system itself provides the user with the evidence referred to in the system response process as a form of the timestamp of the scene. We show the superiority of the proposed model in all quantitative and qualitative measurements compared to the baseline. In particular, the proposed model achieved robust performance even in environments using all three modalities, including audio. We also conducted extensive experiments to investigate the proposed model. In addition, we obtained state-of-the-art performance in the system response reasoning task.

摘要：

随着无线环境下多媒体系统的发展,对人工智能的日益增长的需求是设计一个能够以类似人类的方式全面理解各种信息的正确与人类交流的系统。因此,本文介绍了一种视听场景感知对话系统，该系统可以与用户进行有关视听场景的交流。不仅要全面理解视觉和文本信息，还要理解音频信息。尽管在语言和视觉模式的多模态表示学习方面取得了重大进展，仍然有两个警告:听觉信息的无效使用和深度学习系统推理缺乏可解释性。为了解决这些问题,我们提出了一种新颖的视听场景感知对话系统，该系统利用来自每种模态的一组显式信息作为自然语言的形式，可以自然地融合到语言模型中。它利用基于变压器的解码器在多任务学习设置中基于多模式知识生成连贯且正确的响应。此外,我们还讨论了用响应驱动的时间矩定位方法解释模型的方法，以验证系统如何生成响应。系统本身向用户提供在系统响应过程中提到的证据,作为场景的时间戳的形式。与基线相比，我们显示了所提出的模型在所有定量和定性测量中的优越性。特别是,即使在使用所有三种模态的环境中，所提出的模型也实现了稳健的性能，包括音频。我们还进行了大量的实验来研究所提出的模型。此外,我们在系统响应推理任务中获得了最先进的性能。