Speech Acoustics

语音声学
  • 文章类型: Journal Article
    本研究调查了不同类型的语音训练对阿拉伯英语学习者对英语元音的产生和感知的潜在变化的影响。46名阿拉伯语英语学习者被随机分配到三个高变异性元音训练计划之一:感知训练(高变异性语音训练),生产培训,和混合培训计划(生产和感知培训)。测试前和测试后(元音识别,类别歧视,噪声中的语音识别,和元音产生)表明所有训练类型都导致感知和产生的改善。有一些证据表明,改进与训练类型有关:在感知训练条件下的学习者在元音识别方面有所改善,但在元音产生方面没有改善,虽然那些在生产培训条件下的人在感知任务上的表现只有很小的提高,但产量有了更大的改善。然而,训练方式的效果因熟练程度而变得复杂,无论培训模式如何,高熟练程度的学习者都比低熟练程度的学习者从不同类型的培训中受益更多。
    This study investigated the effect of different types of phonetic training on potential changes in the production and perception of English vowels by Arabic learners of English. Forty-six Arabic learners of English were randomly assigned to one of three high variability vowel training programs: Perception training (High Variability Phonetic Training), Production training, and a Hybrid Training program (production and perception training). Pre- and post-tests (vowel identification, category discrimination, speech recognition in noise, and vowel production) showed that all training types led to improvements in perception and production. There was some evidence that improvements were linked to training type: learners in the Perception Training condition improved in vowel identification but not vowel production, while those in the Production Training condition showed only small improvements in performance on perceptual tasks, but greater improvement in production. However, the effects of training modality were complicated by proficiency, with high proficiency learners benefitting more from different types of training regardless of training mode than lower proficiency learners.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们如何产生和感知声音受到喉生理学和生物力学的限制。这样的约束可以将其自身呈现为在说话者之间共享的语音结果空间中的主要维度。本研究试图在语音产生的三维计算模型中识别语音结果空间中的此类主要维度以及潜在的喉部控制机制。使用声带几何形状和刚度的参数变化进行了大规模语音模拟,声门间隙,声道形状,声门下压.主成分分析应用于结合生理控制参数和语音结果测量的数据。结果表明,三个主要维度至少占总方差的50%。前两个维度描述了呼吸-喉部协调在控制产生的声音中低频和高频谐波之间的能量平衡。第三个维度描述了基频的控制。这三个维度的优势表明,沿着这些主要维度的语音变化可能比其他语音变化更一致地产生和被大多数说话者感知,因此更有可能在进化过程中出现并被用来传达重要的个人信息,如情绪和喉的大小。
    How we produce and perceive voice is constrained by laryngeal physiology and biomechanics. Such constraints may present themselves as principal dimensions in the voice outcome space that are shared among speakers. This study attempts to identify such principal dimensions in the voice outcome space and the underlying laryngeal control mechanisms in a three-dimensional computational model of voice production. A large-scale voice simulation was performed with parametric variations in vocal fold geometry and stiffness, glottal gap, vocal tract shape, and subglottal pressure. Principal component analysis was applied to data combining both the physiological control parameters and voice outcome measures. The results showed three dominant dimensions accounting for at least 50% of the total variance. The first two dimensions describe respiratory-laryngeal coordination in controlling the energy balance between low- and high-frequency harmonics in the produced voice, and the third dimension describes control of the fundamental frequency. The dominance of these three dimensions suggests that voice changes along these principal dimensions are likely to be more consistently produced and perceived by most speakers than other voice changes, and thus are more likely to have emerged during evolution and be used to convey important personal information, such as emotion and larynx size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    年龄的增长与对词段中时间线索的敏感性降低有关,特别是当目标单词遵循非信息载体句子或光谱退化时(例如,声编码以模拟人工耳蜗刺激)。这项研究调查了年龄,载体句子,和频谱退化相互作用,导致处理语音时间线索的困难。听力正常的年轻人和老年人在两个连续数上执行了音素分类任务:Buy/Pie对比度与单词初始停止的声音发作时间变化,以及Dish/Ditch对比度与单词最终摩擦音之前的无声间隔变化。目标词是孤立地或在非信息载体句之后呈现的,并且未经处理或通过正弦波声编码(2、4和8通道)降级。与年轻的听众相比,年龄较大的听众对两种时间线索的敏感性均降低。对于购买/馅饼的对比,年龄,承运人句子,和频谱退化相互作用,使得在载体句子条件下,未处理的单词的年龄效应最大。这种模式与碟子/沟渠对比不同,降低光谱分辨率夸大了年龄影响,但是引入载体句子在很大程度上使模式保持不变。这些结果表明,某些时间线索在句子中放置时特别容易老化,可能导致老年人工耳蜗使用者在日常环境中的困难。
    Advancing age is associated with decreased sensitivity to temporal cues in word segments, particularly when target words follow non-informative carrier sentences or are spectrally degraded (e.g., vocoded to simulate cochlear-implant stimulation). This study investigated whether age, carrier sentences, and spectral degradation interacted to cause undue difficulty in processing speech temporal cues. Younger and older adults with normal hearing performed phonemic categorization tasks on two continua: a Buy/Pie contrast with voice onset time changes for the word-initial stop and a Dish/Ditch contrast with silent interval changes preceding the word-final fricative. Target words were presented in isolation or after non-informative carrier sentences, and were unprocessed or degraded via sinewave vocoding (2, 4, and 8 channels). Older listeners exhibited reduced sensitivity to both temporal cues compared to younger listeners. For the Buy/Pie contrast, age, carrier sentence, and spectral degradation interacted such that the largest age effects were seen for unprocessed words in the carrier sentence condition. This pattern differed from the Dish/Ditch contrast, where reducing spectral resolution exaggerated age effects, but introducing carrier sentences largely left the patterns unchanged. These results suggest that certain temporal cues are particularly susceptible to aging when placed in sentences, likely contributing to the difficulties of older cochlear-implant users in everyday environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    即使在轻度至中度形式的听力损失中,高频语音信息也容易受到不准确的感知。一些助听器采用诸如非线性频率压缩(NFC)之类的频率降低方法来帮助听力受损的个体在更易于访问的低频区域中访问高频语音信息。由于这些技术导致显著的频谱失真,S-Sh混淆测试等测试有助于优化NFC设置,以提供失真最小的高频可听度。传统上,此类测试是基于与英语相关的语音对比。这里,评估了NFC处理对英语和普通话听众之间的摩擦感的影响。两组之间的摩擦音辨别差异很小,但显着差异。该研究表明可能需要针对NFC的语言特异性临床拟合程序。
    High-frequency speech information is susceptible to inaccurate perception in even mild to moderate forms of hearing loss. Some hearing aids employ frequency-lowering methods such as nonlinear frequency compression (NFC) to help hearing-impaired individuals access high-frequency speech information in more accessible lower-frequency regions. As such techniques cause significant spectral distortion, tests such as the S-Sh Confusion Test help optimize NFC settings to provide high-frequency audibility with the least distortion. Such tests have been traditionally based on speech contrasts pertinent to English. Here, the effects of NFC processing on fricative perception between English and Mandarin listeners are assessed. Small but significant differences in fricative discrimination were observed between the groups. The study demonstrates possible need for language-specific clinical fitting procedures for NFC.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    语音可以定义为人类通过一系列声音进行交流的能力。因此,语音需要能够产生声信号的发射器(扬声器)和能够成功解码发射器产生的声音的接收器(收听者)(即,声学信号)。时间在这种互动的两端都起着核心作用。一方面,语音制作需要精确和快速的协调,通常在毫秒的数量级内,上声道发声器的(即,舌头,下巴,嘴唇,和velum),他们的复合动作,和声带的激活。另一方面,产生的声信号及时展开,在不同的时间尺度上携带信息。该信息必须由接收器解析和整合,以便正确地传输含义。本章描述了表征语音信号的时间模式,并回顾了探索这些模式产生的神经机制及其在语音理解中的作用的研究。
    Speech can be defined as the human ability to communicate through a sequence of vocal sounds. Consequently, speech requires an emitter (the speaker) capable of generating the acoustic signal and a receiver (the listener) able to successfully decode the sounds produced by the emitter (i.e., the acoustic signal). Time plays a central role at both ends of this interaction. On the one hand, speech production requires precise and rapid coordination, typically within the order of milliseconds, of the upper vocal tract articulators (i.e., tongue, jaw, lips, and velum), their composite movements, and the activation of the vocal folds. On the other hand, the generated acoustic signal unfolds in time, carrying information at different timescales. This information must be parsed and integrated by the receiver for the correct transmission of meaning. This chapter describes the temporal patterns that characterize the speech signal and reviews research that explores the neural mechanisms underlying the generation of these patterns and the role they play in speech comprehension.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    人类和机器的语音识别经常在非最佳但常见的情况下失败。例如,第二语言(L2)语音的单词识别错误率可能很高,尤其是在涉及背景噪声的条件下。同时,人和机器语音识别有时对信号和噪声相关的退化表现出显著的鲁棒性。语音的哪些声学特征解释了可懂度的这种实质性变化?当前的方法将语音与文本对齐,以从特定单词的特定声音中提取一小组预定义的频谱时间属性。然而,这些属性的变化使许多交叉说话者的可懂度变化无法解释。我们研究了一种利用自监督学习获得的感知相似性空间的替代方法。该方法对语音样本之间的区别进行编码,而不需要预定义的声学特征或语音到文本对齐。我们表明,L2英语语音样本在空间中的聚集程度不如L1样本紧密,这反映了L2说话者之间英语水平的差异。严重的,此相似性空间中的距离在感知上是有意义的:L1英语听众对于L2说话者的识别准确性较低,后者的语音在空间中离L1语音更远。这些结果表明,感知相似性可能是一种全新的语音和语言分析方法的基础。
    Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word recognition error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, both human and machine speech recognition sometimes shows remarkable robustness against signal- and noise-related degradation. Which acoustic features of speech explain this substantial variation in intelligibility? Current approaches align speech to text to extract a small set of pre-defined spectro-temporal properties from specific sounds in particular words. However, variation in these properties leaves much cross-talker variation in intelligibility unexplained. We examine an alternative approach utilizing a perceptual similarity space acquired using self-supervised learning. This approach encodes distinctions between speech samples without requiring pre-defined acoustic features or speech-to-text alignment. We show that L2 English speech samples are less tightly clustered in the space than L1 samples reflecting variability in English proficiency among L2 talkers. Critically, distances in this similarity space are perceptually meaningful: L1 English listeners have lower recognition accuracy for L2 speakers whose speech is more distant in the space from L1 speech. These results indicate that perceptual similarity may form the basis for an entirely new speech and language analysis approach.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在接受甲状腺手术的患者中经常观察到单侧声带麻痹。本研究探讨了声学语音分析(客观测量)与语音障碍指数(VHI,自我评估工具)。纳入了有或没有术后单侧声带麻痹(PVCP和NPVCP)的甲状腺手术患者。通过VHI和发音障碍严重程度指数(DSI)工具对患者进行评估。PVCP患者的VHI评分明显高于NPVCP患者。抖动(%)和微光(%)显著增加,而PVCP患者的DSI显著降低。受试者工作特征曲线显示VHI评分与PVCP的诊断相关,其中VHI总分的曲线下面积(AUC)为0.81。在声学参数中,DSI与PVCP高度相关(AUC=0.82,95CI=0.75至0.89)。此外,我们发现VHI评分与语音声学参数之间存在相关性.其中,DSI与功能和VHI评分有中等相关性,R值分别为0.41和0.49。VHI评分和声学参数与PVCP的诊断相关。
    Unilateral vocal cord paralysis is frequently observed in patients who undergo thyroid surgery. This study explored the correlation between acoustic voice analysis (objective measure) and Voice Handicap Index (VHI, a self-assessment tool). One hundred and forty patients who had thyroid surgery with or without postoperative unilateral vocal cord paralysis (PVCP and NPVCP) were included. The patients were evaluated by the VHI and Dysphonia Severity Index (DSI) tools. VHI scores were significantly higher in PVCP patients than in NPVCP patients. Jitter (%) and shimmer (%) were significantly increased, whereas DSI was significantly decreased in PVCP patients. Receiver operating characteristics curve revealed that VHI scores were associated with the diagnosis of PVCP, of which VHI total score yielded an area under the curve (AUC) of 0.81. Among acoustic parameters, DSI was highly associated to PVCP (AUC=0.82, 95%CI=0.75 to 0.89). Moreover, we found a correlation between VHI scores and voice acoustic parameters. Among them, DSI had a moderate correlation with functional and VHI scores, as suggested by an R value of 0.41 and 0.49, respectively. VHI scores and acoustic parameters were associated with the diagnosis of PVCP.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    语音输入的质量影响L1和L2采集的效率。这项研究研究了标准普通话(一种音调语言)中婴儿定向语音(IDS)和外国人定向语音(FDS)的修改,并探讨了IDS和FDS特征如何在双音节单词和更长的话语中表现出来。该研究旨在确定与成人导向语音(ADS)相比,IDS和FDS的哪些特征得到了增强,以及在一组常见的声学参数中测量时,IDS和FDS如何不同。Forwords,发现音调元音持续时间,基频的平均值和范围(F0),IDS和FDS中的词汇音调轮廓相对于ADS得到了增强,除了浸渍音3表现出意外的FDS下降,但与ADS相比,IDS中没有任何修改。对于话语,IDS和FDS强调了时间和F0增强的不同方面:IDS中的平均F0较高,而FDS中的总话语持续时间较长。这些发现增加了有关L1和L2语音输入特征及其在语言习得中的作用的文献。
    The quality of speech input influences the efficiency of L1 and L2 acquisition. This study examined modifications in infant-directed speech (IDS) and foreigner-directed speech (FDS) in Standard Mandarin-a tonal language-and explored how IDS and FDS features were manifested in disyllabic words and a longer discourse. The study aimed to determine which characteristics of IDS and FDS were enhanced in comparison with adult-directed speech (ADS), and how IDS and FDS differed when measured in a common set of acoustic parameters. For words, it was found that tone-bearing vowel duration, mean and range of fundamental frequency (F0), and the lexical tone contours were enhanced in IDS and FDS relative to ADS, except for the dipping Tone 3 that exhibited an unexpected lowering in FDS, but no modification in IDS when compared with ADS. For the discourse, different aspects of temporal and F0 enhancements were emphasized in IDS and FDS: the mean F0 was higher in IDS whereas the total discourse duration was greater in FDS. These findings add to the growing literature on L1 and L2 speech input characteristics and their role in language acquisition.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    听觉处理中与年龄相关的变化可能会减少声学线索的生理编码,导致老年人在背景噪音中难以感知语音。这项研究调查了老年人在对安静和噪音中的元音进行分类的声学线索加权模式上是否与年轻人不同。在两种聆听条件下,所有参与者都主要依靠频谱质量来对声音进行分类。然而,相对于年轻人,老年人对持续时间的依赖程度更高,对光谱质量的依赖程度更低.这些结果表明,老化会改变可能影响语音识别能力的感知线索权重的模式。
    Age-related changes in auditory processing may reduce physiological coding of acoustic cues, contributing to older adults\' difficulty perceiving speech in background noise. This study investigated whether older adults differed from young adults in patterns of acoustic cue weighting for categorizing vowels in quiet and in noise. All participants relied primarily on spectral quality to categorize /ɛ/ and /æ/ sounds under both listening conditions. However, relative to young adults, older adults exhibited greater reliance on duration and less reliance on spectral quality. These results suggest that aging alters patterns of perceptual cue weights that may influence speech recognition abilities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    准确地分类口音和评估非母语使用者的口音的能力是具有挑战性的任务,这主要是由于口音和方言变化的复杂性和多样性。在这项研究中,利用高级预训练语言识别(LID)和说话人识别(SID)模型的嵌入来提高口音分类和非本地口音评估的准确性。研究结果表明,采用预训练的LID和SID模型可以有效地编码语音中的口音/方言信息。此外,LID和SID编码的口音信息补充从头训练的端到端(E2E)口音识别(AID)模型。通过合并所有三个嵌入,所提出的多嵌入AID系统在AID中具有优越的准确性。接下来,研究了利用自动语音识别(ASR)和AID模型来探索强调度估计。ASR模型是专门使用美国英语(en-US)话语训练的E2E连接主义者时间分类模型。AID模型的ASR错误率和en-US输出被用作客观强调度得分。评估结果表明,这两个模型估计的分数之间存在很强的相关性。此外,证明了客观强调性得分和基于人类感知的主观得分之间的稳健相关性,为在非母语语音中使用基于AID和基于ASR的系统进行强调性评估的可靠性和有效性提供证据。这种先进的系统将有利于语言学习中的口音评估以及语音和说话者对清晰度的评估,质量,以及说话者二值化和语音识别的进步。
    The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end (E2E) accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in AID. Next, leveraging automatic speech recognition (ASR) and AID models is investigated to explore accentedness estimation. The ASR model is an E2E connectionist temporal classification model trained exclusively with American English (en-US) utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between scores estimated by the two models. Additionally, a robust correlation between objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of using AID-based and ASR-based systems for accentedness assessment in non-native speech. Such advanced systems would benefit accent assessment in language learning as well as speech and speaker assessment for intelligibility, quality, and speaker diarization and speech recognition advancements.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号