关键词: hybrid Transformer decoder nonlinear Transformer regularization attention speech recognition

Mesh : Humans Speech Recognition Software Algorithms Speech / physiology Nonlinear Dynamics Pattern Recognition, Automated / methods

来  源:   DOI:10.3390/s24123846   PDF(Pubmed)

Abstract:
Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.
摘要:
现有的端到端语音识别方法通常采用基于CTC和Transformer的混合解码器。然而,这些混合解码器中的误差累积问题阻碍了精度的进一步提高。此外,大多数现有模型都建立在Transformer架构上,这往往是复杂和不友好的小数据集。因此,提出了一种用于语音识别的非线性正则化解码方法。首先,我们介绍了非线性变换器解码器,打破传统的从左到右或从右到左的解码顺序,并实现任何字符之间的关联,减轻小数据集上Transformer体系结构的限制。其次,我们提出了一种新颖的正则化注意力模块来优化注意力得分矩阵,减少早期错误对后期输出的影响。最后,我们引入微小模型来解决模型参数过大的挑战。实验结果表明,我们的模型表现出良好的性能。与基线相比,我们的模型实现了0.12%的识别改进,0.54%,0.51%,和1.2%的Aishell1,Primewords,免费ST中文语料库,和维吾尔语的普通语音16.1数据集,分别。
公众号