speech emotion recognition

语音情感识别
  • 文章类型: Journal Article
    近年来,人工智能,机器学习(ML)模型有了显著的进步,在不同部门提供变革性的解决方案。语音中的情感识别尤其受益于ML技术,彻底改变其准确性和适用性。本文通过结合两种不同的方法,提出了一种在罗马尼亚语音分析中进行情感检测的方法:使用GPTTransformer的语义分析和使用openSMILE的声学分析。结果表明,准确度为74%,精密度几乎为82%。由于有限且低质量的数据集,观察到若干系统限制。然而,通过分析情绪以识别心理健康障碍,这也为我们的研究开辟了新的视野。
    In recent years, artificial intelligence, and machine learning (ML) models have advanced significantly, offering transformative solutions across diverse sectors. Emotion recognition in speech has particularly benefited from ML techniques, revolutionizing its accuracy and applicability. This article proposes a method for emotion detection in Romanian speech analysis by combining two distinct approaches: semantic analysis using GPT Transformer and acoustic analysis using openSMILE. The results showed an accuracy of 74% and a precision of almost 82%. Several system limitations were observed due to the limited and low-quality dataset. However, it also opened a new horizon in our research by analyzing emotions to identify mental health disorders.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    需要有效利用处于痴呆症各个阶段的人的情绪进行预防,早期干预,和护理规划。有了可用于理解和解决人们情感需求的技术,这项研究旨在开发语音情绪识别(SER)技术,以对痴呆症高危人群的情绪进行分类。
    通过人类听觉评估,将痴呆症高危人群的语音样本分为不同的情绪,对其结果进行了注释,以指导深度学习方法。该架构结合了卷积神经网络,长期短期记忆,注意层,和Wav2Vec2,一种新颖的特征提取器,用于开发自动语音情感识别。
    在参与者的演讲中发现了27种情绪。这些情绪分为6种详细的情绪:幸福,兴趣,悲伤,挫败感,愤怒,中立,并进一步进入三种基本情绪:积极的,负,中立。为了提高算法性能,使用不同的数据源-语音和文本-以及不同的情绪数量,应用了多种学习方法。最终,一个2阶段的算法-初始基于文本的分类,然后是基于语音的分析-实现了最高的准确性,达到70%。
    本研究中确定的不同情绪归因于参与者的特征和数据收集的方法。痴呆症高危人群对伴侣机器人的讲话也解释了SER算法相对较低的性能。因此,这项研究表明,系统全面地构建了一个来自痴呆症患者的数据集。
    UNASSIGNED: The emotions of people at various stages of dementia need to be effectively utilized for prevention, early intervention, and care planning. With technology available for understanding and addressing the emotional needs of people, this study aims to develop speech emotion recognition (SER) technology to classify emotions for people at high risk of dementia.
    UNASSIGNED: Speech samples from people at high risk of dementia were categorized into distinct emotions via human auditory assessment, the outcomes of which were annotated for guided deep-learning method. The architecture incorporated convolutional neural network, long short-term memory, attention layers, and Wav2Vec2, a novel feature extractor to develop automated speech-emotion recognition.
    UNASSIGNED: Twenty-seven kinds of Emotions were found in the speech of the participants. These emotions were grouped into 6 detailed emotions: happiness, interest, sadness, frustration, anger, and neutrality, and further into 3 basic emotions: positive, negative, and neutral. To improve algorithmic performance, multiple learning approaches were applied using different data sources-voice and text-and varying the number of emotions. Ultimately, a 2-stage algorithm-initial text-based classification followed by voice-based analysis-achieved the highest accuracy, reaching 70%.
    UNASSIGNED: The diverse emotions identified in this study were attributed to the characteristics of the participants and the method of data collection. The speech of people at high risk of dementia to companion robots also explains the relatively low performance of the SER algorithm. Accordingly, this study suggests the systematic and comprehensive construction of a dataset from people with dementia.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    部署在现实世界应用程序上的语音情感识别(SER)系统可能会遇到不受约束的背景噪声污染的语音。为了解决这个问题,语音增强(SE)模块可以被附接到SER系统以补偿输入的环境差异。尽管SE模块可以提高给定语音的质量和清晰度,存在影响SER的辨别声学特征的风险,这些特征对环境差异具有弹性。探索这个想法,我们建议只增强降低情绪识别性能的弱特征。我们的模型首先通过使用干净的语音一次使用一个声学特征训练的多个模型来识别弱特征集。在训练了单特征模型之后,我们通过测量三个标准对每个语音特征进行排名:表现,鲁棒性,以及结合了性能和鲁棒性的联合排名。我们通过从每个等级的底部到顶部累积递增特征来对弱特征进行分组。一旦定义了弱特征集,我们只增强那些薄弱的功能,保持弹性特征不变。我们用低级描述符(LLD)来实现这些想法。我们表明,直接增强弱LLD比从增强的语音信号中提取LLD具有更好的性能。我们对MSP-Podcast语料库的干净和嘈杂版本的实验表明,所提出的方法产生17.7%(唤醒),21.2%(优势),在10dB信噪比(SNR)条件下增强所有LLD的系统上,以及3.3%(效价)性能增益。
    A speech emotion recognition (SER) system deployed on a real-world application can encounter speech contaminated with unconstrained background noise. To deal with this issue, a speech enhancement (SE) module can be attached to the SER system to compensate for the environmental difference of an input. Although the SE module can improve the quality and intelligibility of a given speech, there is a risk of affecting discriminative acoustic features for SER that are resilient to environmental differences. Exploring this idea, we propose to enhance only weak features that degrade the emotion recognition performance. Our model first identifies weak feature sets by using multiple models trained with one acoustic feature at a time using clean speech. After training the single-feature models, we rank each speech feature by measuring three criteria: performance, robustness, and a joint rank ranking that combines performance and robustness. We group the weak features by cumulatively incrementing the features from the bottom to the top of each rank. Once the weak feature set is defined, we only enhance those weak features, keeping the resilient features unchanged. We implement these ideas with the low-level descriptors (LLDs). We show that directly enhancing the weak LLDs leads to better performance than extracting LLDs from an enhanced speech signal. Our experiment with clean and noisy versions of the MSP-Podcast corpus shows that the proposed approach yields a 17.7% (arousal), 21.2% (dominance), and 3.3% (valence) performance gains over a system that enhances all the LLDs for the 10dB signal-to-noise ratio (SNR) condition.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    语音情感识别(SER)因其在心理评估等各个领域的广泛应用而成为数据科学中一个突出的动态研究领域,移动服务,和电脑游戏,移动服务。在以前的研究中,许多研究利用手动工程特征进行情绪分类,导致了值得称赞的准确性。然而,这些功能在复杂的场景中往往表现不佳,导致分类精度降低。这些方案包括:1.包含不同语音模式的数据集,方言,口音,或情感表达的变化。2.具有背景噪声的数据。3.情绪分布在数据集之间显著变化的情况可能是具有挑战性的。4.由于记录条件的变化,组合来自不同来源的数据集引入了复杂性,数据质量,和情感表达。因此,有必要提高SER技术的分类性能。为了解决这个问题,本研究引入了一个新的SER框架。在特征提取之前,应用信号预处理和数据增强方法来增强可用数据,从每个信号中得出18个信息特征。使用特征选择技术获得判别特征集,然后将其用作使用SAVEE进行情感识别的输入,RAVDESS,和EMO-DB数据集。此外,这项研究还实现了一个跨语料库模型,该模型合并了来自三个数据集的与共同情绪相关的所有语音文件.实验结果表明,与该领域的现有框架相比,SER框架具有更高的性能。值得注意的是,本研究中提出的框架在各种数据集上实现了显著的准确率。具体来说,所提出的模型获得了95%的准确率,94%,97%,和97%的SAVEE,RAVDESS,EMO-DB和跨语料库数据集。这些结果强调了我们提议的框架对SER领域的重大贡献。
    Speech emotion recognition (SER) stands as a prominent and dynamic research field in data science due to its extensive application in various domains such as psychological assessment, mobile services, and computer games, mobile services. In previous research, numerous studies utilized manually engineered features for emotion classification, resulting in commendable accuracy. However, these features tend to underperform in complex scenarios, leading to reduced classification accuracy. These scenarios include: 1. Datasets that contain diverse speech patterns, dialects, accents, or variations in emotional expressions. 2. Data with background noise. 3. Scenarios where the distribution of emotions varies significantly across datasets can be challenging. 4. Combining datasets from different sources introduce complexities due to variations in recording conditions, data quality, and emotional expressions. Consequently, there is a need to improve the classification performance of SER techniques. To address this, a novel SER framework was introduced in this study. Prior to feature extraction, signal preprocessing and data augmentation methods were applied to augment the available data, resulting in the derivation of 18 informative features from each signal. The discriminative feature set was obtained using feature selection techniques which was then utilized as input for emotion recognition using the SAVEE, RAVDESS, and EMO-DB datasets. Furthermore, this research also implemented a cross-corpus model that incorporated all speech files related to common emotions from three datasets. The experimental outcomes demonstrated the superior performance of SER framework compared to existing frameworks in the field. Notably, the framework presented in this study achieved remarkable accuracy rates across various datasets. Specifically, the proposed model obtained an accuracy of 95%, 94%,97%, and 97% on SAVEE, RAVDESS, EMO-DB and cross-corpus datasets respectively. These results underscore the significant contribution of our proposed framework to the field of SER.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    言语中的情绪有多种表达方式,和语音情感识别(SER)模型可能在看不见的语料库上表现不佳,这些语料库包含与训练数据库中表达的情感因素不同的情感因素。要构造一个对看不见的语料库鲁棒的SER模型,正则化方法或度量损失已经被研究。在本文中,我们提出了一种SER方法,该方法结合了每个训练样本的相对难度和标记可靠性。受代理锚损失的启发,我们提出了一种新的损失函数,该函数为给定小批量中情感标签更难估计的样本提供了更高的梯度。由于注释者可以基于情感表达来标记情感,该情感表达驻留在对话上下文或其他模态中,但在给定的语音话语中并不明显,一些情绪标签可能不可靠,这些不可靠的标签可能会更严重地影响建议的损失功能。在这方面,我们建议对预先训练的SER模型错误分类的样本应用标签平滑。实验结果表明,通过对错误分类的数据采用所提出的带有标签平滑的损失函数,可以提高SER对看不见的语料库的性能。
    Emotions in speech are expressed in various ways, and the speech emotion recognition (SER) model may perform poorly on unseen corpora that contain different emotional factors from those expressed in training databases. To construct an SER model robust to unseen corpora, regularization approaches or metric losses have been studied. In this paper, we propose an SER method that incorporates relative difficulty and labeling reliability of each training sample. Inspired by the Proxy-Anchor loss, we propose a novel loss function which gives higher gradients to the samples for which the emotion labels are more difficult to estimate among those in the given minibatch. Since the annotators may label the emotion based on the emotional expression which resides in the conversational context or other modality but is not apparent in the given speech utterance, some of the emotional labels may not be reliable and these unreliable labels may affect the proposed loss function more severely. In this regard, we propose to apply label smoothing for the samples misclassified by a pre-trained SER model. Experimental results showed that the performance of the SER on unseen corpora was improved by adopting the proposed loss function with label smoothing on the misclassified data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于性别的语音情感识别对于实现更准确、个性化,以及技术中的移情互动,healthcare,心理学,和社会科学。在本文中,我们提出了一种新的性别-情感模型。首先,从语音信号中提取性别和情感特征,为我们的识别模型奠定基础。第二,遗传算法(GA)处理的高维特征,并使用Fisher评分进行评估。第三,功能按其重要性排名,并通过基于特征重要性的新型交叉和突变方法对遗传算法进行了改进,提高识别精度。最后,使用支持向量机(SVM)在四个常见的英语数据集上,将所提出的算法与最先进的算法进行了比较,它在精度上表现出卓越的性能,精度,召回,F1分数,选定特征的数量,运行时间。所提出的算法在区分中立、悲伤,和恐惧的情绪,由于微妙的声音差异,重叠的音高和音调可变性,和类似的韵律特征。值得注意的是,性别区分的主要特征主要涉及mel频率倒谱系数(MFCC)和logMFCC.
    Speech emotion recognition based on gender holds great importance for achieving more accurate, personalized, and empathetic interactions in technology, healthcare, psychology, and social sciences. In this paper, we present a novel gender-emotion model. First, gender and emotion features were extracted from voice signals to lay the foundation for our recognition model. Second, a genetic algorithm (GA) processed high-dimensional features, and the Fisher score was used for evaluation. Third, features were ranked by their importance, and the GA was improved through novel crossover and mutation methods based on feature importance, to improve the recognition accuracy. Finally, the proposed algorithm was compared with state-of-the-art algorithms on four common English datasets using support vector machines (SVM), and it demonstrated superior performance in accuracy, precision, recall, F1-score, the number of selected features, and running time. The proposed algorithm faced challenges in distinguishing between neutral, sad, and fearful emotions, due to subtle vocal differences, overlapping pitch and tone variability, and similar prosodic features. Notably, the primary features for gender-based differentiation mainly involved mel frequency cepstral coefficients (MFCC) and log MFCC.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在人机交互系统中,语音情感识别(SER)起着至关重要的作用,因为它使计算机能够理解和响应用户的情绪。在过去,SER显著强调了从语音信号中提取的声学特性。使用视觉信号来增强SER性能,然而,深度学习和计算机视觉的最新发展使之成为可能。这项工作利用轻量级的VisionTransformer(ViT)模型提出了一种改进语音情感识别的新方法。我们利用ViT模型的功能来捕获图像中的空间依赖性和高级特征,这些特征是输入到模型中的mel频谱图输入的情绪状态的充分指标。为了确定我们提出的方法的效率,我们对两个基准语音情感数据集进行了全面的实验,多伦多英语演讲集(TESS)和柏林情感数据库(EMODB)。我们广泛的实验结果表明,语音情感识别准确性有了相当大的提高,证明了其可推广性,因为它达到了98%,91%,数据集上的准确率分别为93%(TESS-EMODB)。对比实验结果表明,基于非重叠补丁的特征提取方法大大改善了语音情感识别的纪律。我们的研究表明,将视觉变压器模型集成到SER系统中的潜力,与其他最先进的技术相比,为需要从语音中准确识别情感的现实应用开辟了新的机会。
    In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users\' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model\'s capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    深度学习推动了情感识别在多个领域的突破,尤其是语音情感识别(SER)。作为语音情感识别的重要组成部分,最相关的声学特征提取一直受到现有研究者的关注。针对当前语音信号中包含的情感信息分布分散,不能全面整合局部和全局信息的问题,本文提出了一种基于门控循环单元(GRU)和多头注意力的网络模型。我们在IEMOCAP和Emo-DB语料库上评估了我们提出的情绪模型。实验结果表明,基于Bi-GRU和多头注意力的网络模型在检测多个评价指标方面明显优于传统网络模型。同时,我们还将该模型应用于语音情感分析任务。在CH-SIMS和MOSI数据集上,该模型具有良好的泛化性能。
    Deep learning promotes the breakthrough of emotion recognition in many fields, especially speech emotion recognition (SER). As an important part of speech emotion recognition, the most relevant acoustic feature extraction has always attracted the attention of existing researchers. Aiming at the problem that the emotional information contained in the current speech signals is distributed dispersedly and cannot comprehensively integrate local and global information, this paper presents a network model based on a gated recurrent unit (GRU) and multi-head attention. We evaluate our proposed emotion model on the IEMOCAP and Emo-DB corpora. The experimental results show that the network model based on Bi-GRU and multi-head attention is significantly better than the traditional network model at detecting multiple evaluation indicators. At the same time, we also apply the model to a speech sentiment analysis task. On the CH-SIMS and MOSI datasets, the model shows excellent generalization performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    自2020年以来,COVID-19疫情改变了我们的医疗保健行为。被迫戴口罩确实影响了医患互动观念,因此,建立令人满意的关系不仅仅是对面部表情的同情。为了克服面具的负担,声音变得更加重要。因此,口头和非口头交流将是医疗咨询和其他对话期间医患互动的关键标准。在这些年里,语音情感识别一直是一个热门的研究领域。尽管进行了大量的工作,医疗场景中的非语言情感识别仍需揭示。在这项研究中,我们在汉语普通话语音语料库NTHU-NTUA汉语互动情感语料库(NNIME)上研究了YAMNet迁移学习,并使用真实世界的皮肤病临床记录来测试泛化能力。结果表明,在NNIME数据上验证的准确性为激活预测的0.59和效价的0.57。此外,在医生-患者数据集上,激活的验证准确性为0.24,效价为0.58,分别。
    Since 2020, the COVID-19 epidemic has changed our lives in healthcare behaviors. Forced to wear masks influenced doctor-patient interaction perceptions truly, thus, to build a satisfying relationship is not just empathize with facial expressions. The voice becomes more important for the sake of conquering the burden of masks. Hence, verbal and non-verbal communication will be crucial criteria for doctor-patient interaction during medical consultations and other conversations. In these years, speech emotion recognition has been a popular research domain. In spite of abundant work conducted, nonverbal emotion recognition in medical scenarios is still required to reveal. In this study, we investigate YAMNet transfer learning on Chinese Mandarin speech corpus NTHU-NTUA Chinese Interactive Emotion Corpus (NNIME) and use real-world dermatology clinic recording to test the generalization capability. The results showed that the accuracy validated on NNIME data was 0.59 for activation prediction and 0.57 for valence. Furthermore, the validation accuracy on the doctor-patient dataset was 0.24 for activation and 0.58 for valence, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    语音情绪识别(SER)通过分析语音信号来识别和分类情绪状态。SER是一个使用机器学习和深度学习技术的新兴研究领域,因为它具有社会文化和商业重要性。适当的数据集是以特定语言进行SER相关研究的重要资源。尽管孟加拉语是世界上使用最多的语言之一,但它显然缺乏SER数据集。有一些BanglaSER数据集,但这些数据集只有少数对话,演员数量最少,不适合现实世界的应用。此外,现有的数据集没有考虑情绪的强度水平。特定情感表达的强度,比如愤怒或悲伤,在社会行为中起着至关重要的作用。因此,在这项研究中开发了一个现实的Bangla语音数据集,称为KUETBangla情感语音(KBES)数据集。该数据集由900个音频信号组成(即,演讲对话)来自35位年龄不同的演员(20位女性和15位男性)。演讲对话的来源是孟加拉电视电影,戏剧,电视剧,Web系列。有五个情感类别:中立,快乐,悲伤,生气,和厌恶。除了中立,特定情绪的样本分为两个强度级别:低和高。数据集的重要问题是,语音对话几乎是唯一的,相对大量的演员;然而,现有的数据集(如SUBECO和BanglaSER)包含样本,在实验室环境中由一些参与者/研究志愿者反复说出一些预定义的对话框。最后,KBES数据集被暴露为一个九类问题,将情绪分为九类:中性,快乐(低),快乐(高),悲伤(低),悲伤(高),愤怒(低),愤怒(高),厌恶(低)和厌恶(高)。然而,数据集保持对称,包含9个类别中每个类别的100个样本;100个样本也是性别平衡的,男性/女性演员的50个样本。与现有的SER数据集相比,开发的数据集似乎是一个现实的数据集。
    Speech Emotion Recognition (SER) identifies and categorizes emotional states by analyzing speech signals. SER is an emerging research area using machine learning and deep learning techniques due to its socio-cultural and business importance. An appropriate dataset is an important resource for SER related studies in a particular language. There is an apparent lack of SER datasets in Bangla language although it is one of the most spoken languages in the world. There are a few Bangla SER datasets but those consist of only a few dialogs with a minimal number of actors making them unsuitable for real-world applications. Moreover, the existing datasets do not consider the intensity level of emotions. The intensity of a specific emotional expression, such as anger or sadness, plays a crucial role in social behavior. Therefore, a realistic Bangla speech dataset is developed in this study which is called KUET Bangla Emotional Speech (KBES) dataset. The dataset consists of 900 audio signals (i.e., speech dialogs) from 35 actors (20 females and 15 males) with diverse age ranges. Source of the speech dialogs are Bangla Telefilm, Drama, TV Series, Web Series. There are five emotional categories: Neutral, Happy, Sad, Angry, and Disgust. Except Neutral, samples of a particular emotion are divided into two intensity levels: Low and High. The significant issue of the dataset is that the speech dialogs are almost unique with relatively large number of actors; whereas, existing datasets (such as SUBESCO and BanglaSER) contain samples with repeatedly spoken of a few pre-defined dialogs by a few actors/research volunteers in the laboratory environment. Finally, the KBES dataset is exposed as a nine-class problem to classify emotions into nine categories: Neutral, Happy (Low), Happy (High), Sad (Low), Sad (High), Angry (Low), Angry (High), Disgust (Low) and Disgust (High). However, the dataset is kept symmetrical containing 100 samples for each of the nine classes; 100 samples are also gender balanced with 50 samples for male/female actors. The developed dataset seems a realistic dataset while compared with the existing SER datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号