Speech

演讲
  • 文章类型: Dataset
    英国COVID-19声乐音频数据集旨在训练和评估机器学习模型,该模型使用声乐音频对SARS-CoV-2感染状态或相关呼吸道症状进行分类。英国卫生安全局于2021年3月至2022年3月在英国通过国家测试和追踪计划和REACT-1调查招募了自愿参与者,这是在Alpha和DeltaSARS-CoV-2变体以及一些Omicron变体亚谱系的主要传播期间。自愿咳嗽的录音,呼气,和演讲是在“发声并帮助战胜冠状病毒”数字调查中与人口统计一起收集的,症状和自我报告的呼吸状况数据。数字调查提交与SARS-CoV-2测试结果相关。英国COVID-19声乐音频数据集代表了迄今为止最大的SARS-CoV-2PCR参考音频记录集合。PCR结果与72,999名参与者中的70,565名和25,706名阳性病例中的24,105名相关。45.6%的参与者报告了呼吸道症状。该数据集对生物声学研究具有其他潜在用途,11.3%的参与者自我报告哮喘,以及27.2%的流感PCR检测结果。
    The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the \'Speak up and help beat coronavirus\' digital survey alongside demographic, symptom and self-reported respiratory condition data. Digital survey submissions were linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,565 of 72,999 participants and 24,105 of 25,706 positive cases. Respiratory symptoms were reported by 45.6% of participants. This dataset has additional potential uses for bioacoustics research, with 11.3% participants self-reporting asthma, and 27.2% with linked influenza PCR test results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    现有的端到端语音识别方法通常采用基于CTC和Transformer的混合解码器。然而,这些混合解码器中的误差累积问题阻碍了精度的进一步提高。此外,大多数现有模型都建立在Transformer架构上,这往往是复杂和不友好的小数据集。因此,提出了一种用于语音识别的非线性正则化解码方法。首先,我们介绍了非线性变换器解码器,打破传统的从左到右或从右到左的解码顺序,并实现任何字符之间的关联,减轻小数据集上Transformer体系结构的限制。其次,我们提出了一种新颖的正则化注意力模块来优化注意力得分矩阵,减少早期错误对后期输出的影响。最后,我们引入微小模型来解决模型参数过大的挑战。实验结果表明,我们的模型表现出良好的性能。与基线相比,我们的模型实现了0.12%的识别改进,0.54%,0.51%,和1.2%的Aishell1,Primewords,免费ST中文语料库,和维吾尔语的普通语音16.1数据集,分别。
    Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    适应说话者的变异性是一个复杂的、多层次的认知过程。它涉及将注意力转移到说话者的声音特征以及他们演讲的语言内容。由于语音和语音处理之间的相互依存关系,与单说话者环境相比,多说话者环境通常会产生额外的处理成本。在语音信号中的多个声学线索上有效地分配注意力的失败或不能具有有害的语言学习后果。然而,没有研究检查多说话者处理在具有非典型感知的人群中的影响,交流的社会和语言处理,包括自闭症患者.采用经典的单词监控任务,我们调查了澳大利亚英语自闭症患者(n=24)和非自闭症患者(n=28)的谈话者变异性的影响。听众对目标单词做出了回应(例如,苹果,鸭子,玉米)在随机的单词序列中。一半的序列是由一个说话者说的,另一半是由多个说话者说的。结果显示,自闭症参与者对准确发现的目标单词的敏感性得分与非自闭症参与者没有差异,不管他们是由一个或多个说话的人说。不出所料,非自闭症组显示出与谈话者变异性相关的既定处理成本(例如,响应时间较慢)。值得注意的是,自闭症听众的响应时间在单说话者或多说话者条件下没有差异,表明他们在适应谈话者的可变性时没有表现出感知处理成本。本发现对自闭症感知以及言语和语言处理的理论具有启示意义。
    Accommodating talker variability is a complex and multi-layered cognitive process. It involves shifting attention to the vocal characteristics of the talker as well as the linguistic content of their speech. Due to an interdependence between voice and phonological processing, multi-talker environments typically incur additional processing costs compared to single-talker environments. A failure or inability to efficiently distribute attention over multiple acoustic cues in the speech signal may have detrimental language learning consequences. Yet, no studies have examined effects of multi-talker processing in populations with atypical perceptual, social and language processing for communication, including autistic people. Employing a classic word-monitoring task, we investigated effects of talker variability in Australian English autistic (n = 24) and non-autistic (n = 28) adults. Listeners responded to target words (e.g., apple, duck, corn) in randomised sequences of words. Half of the sequences were spoken by a single talker and the other half by multiple talkers. Results revealed that autistic participants\' sensitivity scores to accurately-spotted target words did not differ to those of non-autistic participants, regardless of whether they were spoken by a single or multiple talkers. As expected, the non-autistic group showed the well-established processing cost associated with talker variability (e.g., slower response times). Remarkably, autistic listeners\' response times did not differ across single- or multi-talker conditions, indicating they did not show perceptual processing costs when accommodating talker variability. The present findings have implications for theories of autistic perception and speech and language processing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    唇语识别迫切需要可穿戴且易于使用的接口,以实现无干扰和高保真的唇读采集,并开发伴随的数据高效解码器建模方法。现有的解决方案遭受不可靠的唇读,渴望数据,并表现出较差的概括性。这里,我们提出了一种可穿戴式唇语解码技术,该技术基于可穿戴式动作捕捉和连续的嘴唇语音运动重建,实现了嘴唇运动的无干扰和高保真采集以及流利唇语的数据高效识别。该方法允许我们从非常有限的用户单词样本语料库中人工生成任何想要的连续语音数据集。通过使用这些人工数据集来训练解码器,对于93个英语句子的实际连续和流畅的唇语语音识别,我们在个体(n=7)中的平均准确率为92.0%,甚至没有观察到用户的训练烧伤,因为所有的训练数据集都是人工生成的。我们的方法极大地减少了用户的训练/学习负荷,并为唇语识别提供了一种数据高效且易于使用的范例。
    Lip language recognition urgently needs wearable and easy-to-use interfaces for interference-free and high-fidelity lip-reading acquisition and to develop accompanying data-efficient decoder-modeling methods. Existing solutions suffer from unreliable lip reading, are data hungry, and exhibit poor generalization. Here, we propose a wearable lip language decoding technology that enables interference-free and high-fidelity acquisition of lip movements and data-efficient recognition of fluent lip language based on wearable motion capture and continuous lip speech movement reconstruction. The method allows us to artificially generate any wanted continuous speech datasets from a very limited corpus of word samples from users. By using these artificial datasets to train the decoder, we achieve an average accuracy of 92.0% across individuals (n = 7) for actual continuous and fluent lip speech recognition for 93 English sentences, even observing no training burn on users because all training datasets are artificially generated. Our method greatly minimizes users\' training/learning load and presents a data-efficient and easy-to-use paradigm for lip language recognition.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    听到自己的语音可以实时进行声学自我监控。左半球运动规划区域被认为会产生传出预测,可以与感觉皮层中的真实反馈进行比较,导致神经抑制与预测和实际感觉之间的重叠程度相称。因此,感觉预测误差可作为检测异常语音的可能机制,然后可以反馈到纠正措施中,允许在线控制语音声学。这项研究的目的是评估失语症(PWA)患者的检测校正回路的完整性,这些患者的左半球病变可能会限制他们控制语音输出变异性的能力。我们记录了脑磁图(MEG),而15个PWA和年龄匹配的控件讲单音节单词并聆听其话语的回放。由此,我们测量了说话诱导的M100神经反应抑制,并将其与病变轮廓和言语行为相关联。在PWA中,说话诱导的抑制和皮层对偏差的敏感性均保持在组水平。pwa中保留更多组织的pwa具有更大的左半球神经抑制和更大的声学异常发音的行为校正,而颞上回的保留与神经抑制或声学行为无关。反过来,进行较大更正的PWA在MEG任务中的明显语音错误较少。因此,当违反该预测时,生成传出预测的运动规划区域对于执行校正是不可或缺的。
    Hearing one\'s own speech allows for acoustic self-monitoring in real time. Left-hemisphere motor planning regions are thought to give rise to efferent predictions that can be compared to true feedback in sensory cortices, resulting in neural suppression commensurate with the degree of overlap between predicted and actual sensations. Sensory prediction errors thus serve as a possible mechanism of detection of deviant speech sounds, which can then feed back into corrective action, allowing for online control of speech acoustics. The goal of this study was to assess the integrity of this detection-correction circuit in persons with aphasia (PWA) whose left-hemisphere lesions may limit their ability to control variability in speech output. We recorded magnetoencephalography (MEG) while 15 PWA and age-matched controls spoke monosyllabic words and listened to playback of their utterances. From this, we measured speaking-induced suppression of the M100 neural response and related it to lesion profiles and speech behavior. Both speaking-induced suppression and cortical sensitivity to deviance were preserved at the group level in PWA. PWA with more spared tissue in pars opercularis had greater left-hemisphere neural suppression and greater behavioral correction of acoustically deviant pronunciations, whereas sparing of superior temporal gyrus was not related to neural suppression or acoustic behavior. In turn, PWA who made greater corrections had fewer overt speech errors in the MEG task. Thus, the motor planning regions that generate the efferent prediction are integral to performing corrections when that prediction is violated.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    缓慢的皮层振荡在处理语音幅度包络中起着至关重要的作用,患有发育性阅读障碍的儿童非典型地感知到这一点。在这里,我们使用在自然语音收听过程中记录的脑电图(EEG)来识别涉及缓慢振荡的神经处理模式,这些模式可能是阅读障碍儿童的特征。在故事聆听范式中,我们发现,非典型的功率动力学和δ和θ振荡之间的相位振幅耦合表征了阅读障碍与其他儿童控制组(通常是发育中的控制,其他语言障碍对照)。我们在语音收听过程中进一步隔离了EEG常见的空间模式(CSP),这些模式可以识别阅读障碍儿童的δ和θ振荡。使用四个delta-bandCSP变量的线性分类器预测阅读障碍状态(0.77AUC)。至关重要的是,当应用于有节奏的音节处理任务期间测量的EEG时,这些空间模式还可以识别出患有阅读障碍的儿童。这种转移效应(即,使用从故事收听任务中得出的神经特征作为基于节奏音节任务的分类器的输入特征的能力)与语音节奏的神经处理中的核心发育缺陷一致。这些发现暗示了阅读障碍背后独特的非典型神经认知语音编码机制,这可能是新的干预措施的目标。
    Slow cortical oscillations play a crucial role in processing the speech amplitude envelope, which is perceived atypically by children with developmental dyslexia. Here we use electroencephalography (EEG) recorded during natural speech listening to identify neural processing patterns involving slow oscillations that may characterize children with dyslexia. In a story listening paradigm, we find that atypical power dynamics and phase-amplitude coupling between delta and theta oscillations characterize dyslexic versus other child control groups (typically-developing controls, other language disorder controls). We further isolate EEG common spatial patterns (CSP) during speech listening across delta and theta oscillations that identify dyslexic children. A linear classifier using four delta-band CSP variables predicted dyslexia status (0.77 AUC). Crucially, these spatial patterns also identified children with dyslexia when applied to EEG measured during a rhythmic syllable processing task. This transfer effect (i.e., the ability to use neural features derived from a story listening task as input features to a classifier based on a rhythmic syllable task) is consistent with a core developmental deficit in neural processing of speech rhythm. The findings are suggestive of distinct atypical neurocognitive speech encoding mechanisms underlying dyslexia, which could be targeted by novel interventions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    肌萎缩侧索硬化症(ALS)是一种特发性,致命的,和以运动神经元退化为特征的快速进行性神经退行性疾病。ALS患者经常经历初始误诊或诊断延迟,这是由于目前无法获得有效的生物标志物。由于言语受损在ALS中是典型的,我们假设健康和ALS参与者在言语任务中的功能差异可以通过皮层模式变化来解释,从而导致ALS的神经生物标志物的鉴定。在这项试点研究中,我们收集了3名早期诊断的ALS患者和3名健康对照者在想象(隐蔽)和公开言语任务期间的脑磁图(MEG)记录.首先,我们计算传感器相关性,与健康对照组相比,说话者与ALS的相关性更大。第二,我们比较了两组之间典型频段中MEG信号的功率,这表明ALS参与者的β带差异更大。第三,我们评估了功能连通性的差异,与健康对照相比,ALS的β带连通性更高。最后,我们进行了单试验分类,这导致了beta波段功能的最高性能(98%)。这些发现在试验中是一致的,短语,以及想象和公开演讲任务的参与者。我们的初步结果表明,语音诱发的β振荡可能是诊断ALS的潜在神经生物标志物。据我们所知,这是单试验神经信号检测ALS的首次证明.
    Amyotrophic lateral sclerosis (ALS) is an idiopathic, fatal, and fast-progressive neurodegenerative disease characterized by the degeneration of motor neurons. ALS patients often experience an initial misdiagnosis or a diagnostic delay due to the current unavailability of an efficient biomarker. Since impaired speech is typical in ALS, we hypothesized that functional differences between healthy and ALS participants during speech tasks can be explained by cortical pattern changes, thereby leading to the identification of a neural biomarker for ALS. In this pilot study, we collected magnetoencephalography (MEG) recordings from three early-diagnosed patients with ALS and three healthy controls during imagined (covert) and overt speech tasks. First, we computed sensor correlations, which showed greater correlations for speakers with ALS than healthy controls. Second, we compared the power of the MEG signals in canonical bands between the two groups, which showed greater dissimilarity in the beta band for ALS participants. Third, we assessed differences in functional connectivity, which showed greater beta band connectivity for ALS than healthy controls. Finally, we performed single-trial classification, which resulted in highest performance with beta band features (∼ 98%). These findings were consistent across trials, phrases, and participants for both imagined and overt speech tasks. Our preliminary results indicate that speech-evoked beta oscillations could be a potential neural biomarker for diagnosing ALS. To our knowledge, this is the first demonstration of the detection of ALS from single-trial neural signals.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    人类语言依赖于对句法信息的正确处理,因为这对于演讲者之间的成功沟通至关重要。作为语言的抽象层次,语法通常与语音信号的物理形式分开研究,因此,经常掩盖可以促进人脑更好句法处理的相互作用。然而,来自成年人的行为和神经证据表明韵律和语法相互作用,对婴儿的研究支持韵律有助于语言学习的概念。在这里,我们分析了一个MEG数据集来研究声学线索,特别是韵律,与以英语为母语的人的大脑中的句法表示进行交互。更具体地说,为了检查韵律是否增强了句法表示的皮层编码,我们直接从大脑活动中解码句法短语边界,并通过韵律边界评估此解码的可能调制。我们的发现表明,韵律边界的存在改善了短语边界的神经表示,表明韵律线索在处理抽象语言特征中的促进作用。这项工作对大脑如何处理不同语言特征的交互式模型具有影响。需要进一步的研究来建立具有不同类型特征的语言中韵律-语法相互作用的神经基础。
    Human language relies on the correct processing of syntactic information, as it is essential for successful communication between speakers. As an abstract level of language, syntax has often been studied separately from the physical form of the speech signal, thus often masking the interactions that can promote better syntactic processing in the human brain. However, behavioral and neural evidence from adults suggests the idea that prosody and syntax interact, and studies in infants support the notion that prosody assists language learning. Here we analyze a MEG dataset to investigate how acoustic cues, specifically prosody, interact with syntactic representations in the brains of native English speakers. More specifically, to examine whether prosody enhances the cortical encoding of syntactic representations, we decode syntactic phrase boundaries directly from brain activity, and evaluate possible modulations of this decoding by the prosodic boundaries. Our findings demonstrate that the presence of prosodic boundaries improves the neural representation of phrase boundaries, indicating the facilitative role of prosodic cues in processing abstract linguistic features. This work has implications for interactive models of how the brain processes different linguistic features. Future research is needed to establish the neural underpinnings of prosody-syntax interactions in languages with different typological characteristics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:作为一项重要的人机交互任务,几十年来,情感识别已经成为一个新兴领域。尽管以前对情绪进行分类的尝试已经取得了很高的性能,几个挑战仍然存在:1)如何使用不同的方式有效地识别情绪仍然具有挑战性。2)由于深度学习所需的计算能力不断增加,如何提供实时检测和提高深度神经网络的鲁棒性具有重要意义。方法:本文,我们提出了一种基于深度学习的多模态情感识别(MER),称为深度情感,它可以自适应地整合面部表情中最具鉴别力的特征,演讲,和脑电图(EEG)来提高MER的性能。具体来说,拟议的深度情感框架由三个分支组成,即,面部分支,演讲科,和脑电图分支。相应地,面部分支利用本文提出的改进的GhostNet神经网络进行特征提取,有效缓解了训练过程中的过拟合现象,与原GhostNet网络相比,提高了分类精度。对于演讲科的工作,本文提出了一种轻量级全卷积神经网络(LFCNN),用于有效提取语音情感特征。关于脑电分支的研究,我们提出了一种能够融合多阶段特征的树状LSTM(tLSTM)模型,用于脑电情感特征提取。最后,采用决策层融合的策略对上述三种模式的识别结果进行融合,导致更全面和准确的性能。结果与结论:对CK+的大量实验,EMO-DB,和MAHNOB-HCI数据集已经证明了本文提出的Deep-Emotion方法的先进性,以及MER方法的可行性和优越性。
    Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    语言植根于我们的写作能力:我们将单词联系在一起,融合他们的含义。链接不限于相邻单词,而是经常跨越中间单词。处理这些非相邻依赖关系(NAD)的能力与大脑的语音采样相冲突:我们以时间有限的块来消费语音,只包含有限数量的单词。不知道我们如何将属于单独块的单词链接在一起。这里,我们报告说我们不能——至少不太好。在我们的脑电图(EEG)研究中,37位人类听众从由音节组成的人工语法(AG)中学习了块和依赖关系。要学习的多音节块大小相等,允许我们采用频率标记方法。在大块的顶部,音节流包含NAD,这些NAD要么局限于单个块,要么跨越块边界。EEG的频率分析揭示了块速率下的频谱峰值,显示参与者学习了这些块。与块内NAD相比,跨边界的NAD与较小的电生理反应相关。这表明,当NAD被限制在同一块时,它们很容易被处理,但在穿越块边界时就不一样了。我们的发现有助于调和经典的概念,即语言是逐步处理的,并且最近有证据表明语音的离散感知采样。这对语言习得和处理以及人类语言语法的一般观点都有影响。
    Language is rooted in our ability to compose: We link words together, fusing their meanings. Links are not limited to neighboring words but often span intervening words. The ability to process these non-adjacent dependencies (NADs) conflicts with the brain\'s sampling of speech: We consume speech in chunks that are limited in time, containing only a limited number of words. It is unknown how we link words together that belong to separate chunks. Here, we report that we cannot-at least not so well. In our electroencephalography (EEG) study, 37 human listeners learned chunks and dependencies from an artificial grammar (AG) composed of syllables. Multi-syllable chunks to be learned were equal-sized, allowing us to employ a frequency-tagging approach. On top of chunks, syllable streams contained NADs that were either confined to a single chunk or crossed a chunk boundary. Frequency analyses of the EEG revealed a spectral peak at the chunk rate, showing that participants learned the chunks. NADs that cross boundaries were associated with smaller electrophysiological responses than within-chunk NADs. This shows that NADs are processed readily when they are confined to the same chunk, but not as well when crossing a chunk boundary. Our findings help to reconcile the classical notion that language is processed incrementally with recent evidence for discrete perceptual sampling of speech. This has implications for language acquisition and processing as well as for the general view of syntax in human language.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号