Mesh : Humans Speech Perception / physiology Speech Recognition Software Speech Acoustics Phonetics Language Speech Production Measurement / methods Female Male

来  源:   DOI:10.1121/10.0026235

Abstract:
The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information complement an end-to-end (E2E) accent identification (AID) model trained from scratch. By incorporating all three embeddings, the proposed multi-embedding AID system achieves superior accuracy in AID. Next, leveraging automatic speech recognition (ASR) and AID models is investigated to explore accentedness estimation. The ASR model is an E2E connectionist temporal classification model trained exclusively with American English (en-US) utterances. The ASR error rate and en-US output of the AID model are leveraged as objective accentedness scores. Evaluation results demonstrate a strong correlation between scores estimated by the two models. Additionally, a robust correlation between objective accentedness scores and subjective scores based on human perception is demonstrated, providing evidence for the reliability and validity of using AID-based and ASR-based systems for accentedness assessment in non-native speech. Such advanced systems would benefit accent assessment in language learning as well as speech and speaker assessment for intelligibility, quality, and speaker diarization and speech recognition advancements.
摘要:
准确地分类口音和评估非母语使用者的口音的能力是具有挑战性的任务,这主要是由于口音和方言变化的复杂性和多样性。在这项研究中,利用高级预训练语言识别(LID)和说话人识别(SID)模型的嵌入来提高口音分类和非本地口音评估的准确性。研究结果表明,采用预训练的LID和SID模型可以有效地编码语音中的口音/方言信息。此外,LID和SID编码的口音信息补充从头训练的端到端(E2E)口音识别(AID)模型。通过合并所有三个嵌入,所提出的多嵌入AID系统在AID中具有优越的准确性。接下来,研究了利用自动语音识别(ASR)和AID模型来探索强调度估计。ASR模型是专门使用美国英语(en-US)话语训练的E2E连接主义者时间分类模型。AID模型的ASR错误率和en-US输出被用作客观强调度得分。评估结果表明,这两个模型估计的分数之间存在很强的相关性。此外,证明了客观强调性得分和基于人类感知的主观得分之间的稳健相关性,为在非母语语音中使用基于AID和基于ASR的系统进行强调性评估的可靠性和有效性提供证据。这种先进的系统将有利于语言学习中的口音评估以及语音和说话者对清晰度的评估,质量,以及说话者二值化和语音识别的进步。
公众号