关键词: automatic speech recognition computational paralinguistics language diversity language model machine learning speech pathology stuttering whisper

来  源:   DOI:10.3389/fpsyg.2024.1155285   PDF(Pubmed)

Abstract:
UNASSIGNED: Automatic recognition of stutters (ARS) from speech recordings can facilitate objective assessment and intervention for people who stutter. However, the performance of ARS systems may depend on how the speech data are segmented and labelled for training and testing. This study compared two segmentation methods: event-based, which delimits speech segments by their fluency status, and interval-based, which uses fixed-length segments regardless of fluency.
UNASSIGNED: Machine learning models were trained and evaluated on interval-based and event-based stuttered speech corpora. The models used acoustic and linguistic features extracted from the speech signal and the transcriptions generated by a state-of-the-art automatic speech recognition system.
UNASSIGNED: The results showed that event-based segmentation led to better ARS performance than interval-based segmentation, as measured by the area under the curve (AUC) of the receiver operating characteristic. The results suggest differences in the quality and quantity of the data because of segmentation method. The inclusion of linguistic features improved the detection of whole-word repetitions, but not other types of stutters.
UNASSIGNED: The findings suggest that event-based segmentation is more suitable for ARS than interval-based segmentation, as it preserves the exact boundaries and types of stutters. The linguistic features provide useful information for separating supra-lexical disfluencies from fluent speech but may not capture the acoustic characteristics of stutters. Future work should explore more robust and diverse features, as well as larger and more representative datasets, for developing effective ARS systems.
摘要:
从语音录音中自动识别口吃者(ARS)可以促进对口吃者的客观评估和干预。然而,ARS系统的性能可能取决于如何对语音数据进行分段和标记以进行训练和测试。本研究比较了两种分割方法:基于事件的分割方法,根据他们的流利程度来划分演讲片段,和基于间隔的,它使用固定长度的片段,而不考虑流畅性。
机器学习模型在基于间隔和基于事件的口吃语音语料库上进行了训练和评估。模型使用从语音信号中提取的声学和语言特征以及由最先进的自动语音识别系统生成的转录。
结果表明,基于事件的分割比基于间隔的分割具有更好的ARS性能,如通过接收器操作特性的曲线下面积(AUC)所测量的。结果表明,由于分割方法的不同,数据的质量和数量存在差异。包含语言特征改善了对整个单词重复的检测,但不是其他类型的口吃。
研究结果表明,基于事件的分割比基于间隔的分割更适合ARS,因为它保留了口吃的确切边界和类型。语言特征提供了有用的信息,可将超词汇不流与流利的语音分开,但可能无法捕获口吃的声学特征。未来的工作应该探索更强大和多样化的功能,以及更大、更具代表性的数据集,开发有效的ARS系统。
公众号