协同模态融合减轻视觉问答中的语言偏见 [J].Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model\'s neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.

摘要：

语言偏见是视觉问答(VQA)中值得注意的问题，其中模型往往依赖于问题和答案之间的虚假相关性来进行预测。这阻止了模型的有效推广，导致性能下降。为了解决这种偏见，我们提出了一种新的模态融合协同去偏置算法(CoD)。在我们的方法中，偏差被认为是模型在预测过程中忽略了来自特定模态的信息。我们采用协作训练方法来促进不同模态之间的相互建模，实现有效的特征融合，并使模型能够充分利用多模态知识进行预测。我们在各种数据集上的实验，包括VQA-CPv2、VQAv2和VQA-VS，使用不同的验证策略，证明我们方法的有效性。值得注意的是,采用基本基线模型对VQA-CPv2的准确率为60.14%。