Mesh : Humans Female Speech Perception / physiology Male Adult Speech Acoustics Phonetics Young Adult Cues Voice Quality

来  源:   DOI:10.1121/10.0027932

Abstract:
Anticipatory coarticulation is a highly informative cue to upcoming linguistic information: listeners can identify that the word is ben and not bed by hearing the vowel alone. The present study compares the relative performances of human listeners and a self-supervised pre-trained speech model (wav2vec 2.0) in the use of nasal coarticulation to classify vowels. Stimuli consisted of nasalized (from CVN words) and non-nasalized (from CVCs) American English vowels produced by 60 humans and generated in 36 TTS voices. wav2vec 2.0 performance is similar to human listener performance, in aggregate. Broken down by vowel type: both wav2vec 2.0 and listeners perform higher for non-nasalized vowels produced naturally by humans. However, wav2vec 2.0 shows higher correct classification performance for nasalized vowels, than for non-nasalized vowels, for TTS voices. Speaker-level patterns reveal that listeners\' use of coarticulation is highly variable across talkers. wav2vec 2.0 also shows cross-talker variability in performance. Analyses also reveal differences in the use of multiple acoustic cues in nasalized vowel classifications across listeners and the wav2vec 2.0. Findings have implications for understanding how coarticulatory variation is used in speech perception. Results also can provide insight into how neural systems learn to attend to the unique acoustic features of coarticulation.
摘要:
预先衔接是即将到来的语言信息的高度信息提示:听众可以通过单独听元音来识别单词是本而不是床。本研究比较了人类听众和自我监督的预训练语音模型(wav2vec2.0)在使用鼻关节来对元音进行分类时的相对表现。刺激由60个人产生的鼻化(来自CVN单词)和非鼻化(来自CVC)美国英语元音组成,并以36个TTS声音产生。wav2vec2.0性能类似于人类听众的性能,总的来说。按元音类型分解:wav2vec2.0和听者对人类自然产生的非鼻化元音的表现更高。然而,wav2vec2.0对鼻化元音显示出更高的正确分类性能,而不是非鼻化元音,对于TTS的声音。说话者级别的模式表明,听众对共同发音的使用在说话者之间是高度可变的。wav2vec2.0还显示了性能上的交叉谈话者可变性。分析还揭示了听众和wav2vec2.0在鼻化元音分类中使用多种声学线索的差异。研究结果对于理解如何在言语感知中使用共齿变异具有重要意义。结果还可以深入了解神经系统如何学习共同衔接的独特声学特征。
公众号