抑郁症风险检测的多模态传感：集成音频，视频,和文本数据。Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Depression is a major psychological disorder with a growing impact worldwide. Traditional methods for detecting the risk of depression, predominantly reliant on psychiatric evaluations and self-assessment questionnaires, are often criticized for their inefficiency and lack of objectivity. Advancements in deep learning have paved the way for innovations in depression risk detection methods that fuse multimodal data. This paper introduces a novel framework, the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN), designed to amalgamate auditory, visual, and textual cues for a comprehensive analysis of depression risk. Our approach encompasses three dedicated branches-Audio Branch, Video Branch, and Text Branch-each responsible for extracting salient features from the corresponding modality. These features are subsequently fused through a multimodal fusion (MMF) module, yielding a robust feature vector that feeds into a predictive modeling layer. To further our research, we devised an emotion elicitation paradigm based on two distinct tasks-reading and interviewing-implemented to gather a rich, sensor-based depression risk detection dataset. The sensory equipment, such as cameras, captures subtle facial expressions and vocal characteristics essential for our analysis. The research thoroughly investigates the data generated by varying emotional stimuli and evaluates the contribution of different tasks to emotion evocation. During the experiment, the AVTF-TBN model has the best performance when the data from the two tasks are simultaneously used for detection, where the F1 Score is 0.78, Precision is 0.76, and Recall is 0.81. Our experimental results confirm the validity of the paradigm and demonstrate the efficacy of the AVTF-TBN model in detecting depression risk, showcasing the crucial role of sensor-based data in mental health detection.

摘要：

抑郁症是一种主要的心理障碍，在世界范围内影响越来越大。检测抑郁症风险的传统方法，主要依赖精神病学评估和自我评估问卷，经常因其效率低下和缺乏客观性而受到批评。深度学习的进步为融合多模式数据的抑郁症风险检测方法的创新铺平了道路。本文介绍了一个新颖的框架，音频,视频,和文本融合-三分支网络(AVTF-TBN)，旨在融合听觉，视觉，和文本线索，全面分析抑郁风险。我们的方法包括三个专用分支-音频分支，视频分支,和文本分支-每个负责从相应的模态中提取显著特征。这些特征随后通过多模态融合(MMF)模块融合，产生一个强大的特征向量，该特征向量输入到预测建模层。为了进一步研究，我们设计了一个基于两个不同任务的情感启发范式——阅读和面试——来收集富人，基于传感器的抑郁症风险检测数据集。感官设备，比如摄像头,捕捉微妙的面部表情和声音特征对我们的分析至关重要。该研究彻底调查了不同情绪刺激产生的数据，并评估了不同任务对情绪唤起的贡献。在实验过程中,当来自两个任务的数据同时用于检测时，AVTF-TBN模型具有最佳性能，其中F1得分为0.78，精度为0.76，召回为0.81。我们的实验结果证实了范式的有效性，并证明了AVTF-TBN模型在检测抑郁风险方面的有效性，展示基于传感器的数据在心理健康检测中的关键作用。