关键词: convolution neural networks deep belief networks forced alignment hidden Markov models vowel duration measurement

来  源:   DOI:10.1109/MLSP.2015.7324331   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm is based on a deep neural network trained at the frame level on manually annotated data from a phonetic study. Specifically, we try two deep-network architectures: convolutional neural network (CNN), and deep belief network (DBN), and compare their accuracy to an HMM-based forced aligner. Results suggest that CNN is better than DBN, and both CNN and HMM-based forced aligner are comparable in their results, but neither of them yielded the same predictions as models fit to manually annotated data.
摘要:
元音持续时间最常用于解决语音中特定问题的研究。到目前为止,这受到了对主观的依赖的阻碍,劳动密集型手动注释。我们的目标是建立一种自动精确测量元音持续时间的算法,其中,该算法的输入是一个语音段,该语音段包含一个元音,该元音之前和之后是辅音(CVC)。我们的算法基于在帧级别对语音研究中的手动注释数据进行训练的深度神经网络。具体来说,我们尝试两种深度网络体系结构:卷积神经网络(CNN),和深度信念网络(DBN),并将其精度与基于HMM的强制对准器进行比较。结果表明,CNN优于DBN,CNN和基于HMM的强制对准器的结果都具有可比性,但是它们都没有产生与模型适合手动注释数据相同的预测。
公众号