forced alignment

强制对齐
  • 文章类型: Journal Article
    问题:语音转录在诊断语音障碍(SSD)中至关重要,但容易受到转录者经验和感知偏见的影响。当前强制对准(FA)工具,注释音频文件以确定语音内容及其位置,通常需要手动转录,限制其有效性。方法:我们介绍一个小说,独立于文本的强制对齐模型,可以自主识别各个音素及其边界,解决这些限制。我们的方法利用了先进的,预训练的wav2vec2.0模型将语音分割成令牌并自动识别它们。为了准确识别音素边界,我们使用无监督分割工具,UnsupSeg.片段的标签采用最近邻分类,带有wav2vec2.0标签,在连接主义者时间分类(CTC)崩溃之前,基于最大重叠确定类标签。额外的后处理,包括过度的清洁和语音活动检测,是为了增强分割而实现的。结果:我们使用正常说话者的TIMIT数据集,将我们的模型与现有方法进行了基准测试,第一次,评估了其在包含SSD扬声器的TORGO数据集上的性能。我们的模型展示了有竞争力的表现,在TIMIT上达到76.88%和TORGO上达到70.31%的调和平均得分。含义:这项研究在SSD的评估和诊断方面取得了重大进展,提供比传统方法更客观、更少偏见的方法。我们的模型的有效性,特别是SSD扬声器,为言语病理学的研究和临床应用开辟了新的途径。
    Problem: Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their effectiveness. Method: We introduce a novel, text-independent forced alignment model that autonomously recognises individual phonemes and their boundaries, addressing these limitations. Our approach leverages an advanced, pre-trained wav2vec 2.0 model to segment speech into tokens and recognise them automatically. To accurately identify phoneme boundaries, we utilise an unsupervised segmentation tool, UnsupSeg. Labelling of segments employs nearest-neighbour classification with wav2vec 2.0 labels, before connectionist temporal classification (CTC) collapse, determining class labels based on maximum overlap. Additional post-processing, including overfitting cleaning and voice activity detection, is implemented to enhance segmentation. Results: We benchmarked our model against existing methods using the TIMIT dataset for normal speakers and, for the first time, evaluated its performance on the TORGO dataset containing SSD speakers. Our model demonstrated competitive performance, achieving a harmonic mean score of 76.88% on TIMIT and 70.31% on TORGO. Implications: This research presents a significant advancement in the assessment and diagnosis of SSDs, offering a more objective and less biased approach than traditional methods. Our model\'s effectiveness, particularly with SSD speakers, opens new avenues for research and clinical application in speech pathology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    英语(ING)等社会语言变量的印象主义编码,说话和说话等发音之间的交替,半个多世纪以来,一直是语言变异和变化研究分析工作流程的核心部分。近几十年来,用于各种社会语言数据的自动化测量和编码的技术一直在上升,但是用于编码某些特征的程序,尤其是那些没有明确定义的声学相关因素,如(ING),落后于其他人,如元音和边音。本文探讨了语音录音中自动编码变量(ING)的计算方法,检查与强制对齐相关的自动语音识别程序的使用(使用蒙特利尔强制对齐器)以及监督机器学习算法(线性和径向支持向量机,和随机森林)。考虑到(ING)等发音变量的自动化编码,对社会语言学方法提出了更广泛的问题,例如,不同的人类分析师在他们的印象主义代码中对这些变量达成了多少共识,以及哪些数据可以作为培训和测试自动化程序的“黄金标准”。本文探讨了自动化中的几个考虑因素,和手动,编码社会语言变量,并为自动和手动编码方法提供基准性能数据。我们考虑多种评估算法性能的方法,包括与人类程序员的协议,以及对包括语言和社会因素在内的(ING)分析结果的影响。我们的结果显示了自动编码方法的希望,但也强调了即使使用仔细的人工编码数据,结果的可变性也应该是可以预期的。我们研究的所有数据都来自区域非裔美国人语言和代码的公共语料库,并且衍生数据集(包括我们的手工编码数据)可以在论文中获得。
    Impressionistic coding of sociolinguistic variables like English (ING), the alternation between pronunciations like talkin\' and talking, has been a central part of the analytic workflow in studies of language variation and change for over a half-century. Techniques for automating the measurement and coding for a wide range of sociolinguistic data have been on the rise over recent decades but procedures for coding some features, especially those without clearly defined acoustic correlates like (ING), have lagged behind others, such as vowels and sibilants. This paper explores computational methods for automatically coding variable (ING) in speech recordings, examining the use of automatic speech recognition procedures related to forced alignment (using the Montreal Forced Aligner) as well as supervised machine learning algorithms (linear and radial support vector machines, and random forests). Considering the automated coding of pronunciation variables like (ING) raises broader questions for sociolinguistic methods, such as how much different human analysts agree in their impressionistic codes for such variables and what data might act as the \"gold standard\" for training and testing of automated procedures. This paper explores several of these considerations in automated, and manual, coding of sociolinguistic variables and provides baseline performance data for automated and manual coding methods. We consider multiple ways of assessing algorithms\' performance, including agreement with human coders, as well as the impact on the outcome of an analysis of (ING) that includes linguistic and social factors. Our results show promise for automated coding methods but also highlight that variability in results should be expected even with careful human coded data. All data for our study come from the public Corpus of Regional African American Language and code and derivative datasets (including our hand-coded data) are available with the paper.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    This paper discusses how the transcription hurdle in dialect corpus building can be cleared. While corpus analysis has strongly gained in popularity in linguistic research, dialect corpora are still relatively scarce. This scarcity can be attributed to several factors, one of which is the challenging nature of transcribing dialects, given a lack of both orthographic norms for many dialects and speech technological tools trained on dialect data. This paper addresses the questions (i) how dialects can be transcribed efficiently and (ii) whether speech technological tools can lighten the transcription work. These questions are tackled using the Southern Dutch dialects (SDDs) as case study, for which the usefulness of automatic speech recognition (ASR), respeaking, and forced alignment is considered. Tests with these tools indicate that dialects still constitute a major speech technological challenge. In the case of the SDDs, the decision was made to use speech technology only for the word-level segmentation of the audio files, as the transcription itself could not be sped up by ASR tools. The discussion does however indicate that the usefulness of ASR and other related tools for a dialect corpus project is strongly determined by the sound quality of the dialect recordings, the availability of statistical dialect-specific models, the degree of linguistic differentiation between the dialects and the standard language, and the goals the transcripts have to serve.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    元音持续时间最常用于解决语音中特定问题的研究。到目前为止,这受到了对主观的依赖的阻碍,劳动密集型手动注释。我们的目标是建立一种自动精确测量元音持续时间的算法,其中,该算法的输入是一个语音段,该语音段包含一个元音,该元音之前和之后是辅音(CVC)。我们的算法基于在帧级别对语音研究中的手动注释数据进行训练的深度神经网络。具体来说,我们尝试两种深度网络体系结构:卷积神经网络(CNN),和深度信念网络(DBN),并将其精度与基于HMM的强制对准器进行比较。结果表明,CNN优于DBN,CNN和基于HMM的强制对准器的结果都具有可比性,但是它们都没有产生与模型适合手动注释数据相同的预测。
    Vowel durations are most often utilized in studies addressing specific issues in phonetics. Thus far this has been hampered by a reliance on subjective, labor-intensive manual annotation. Our goal is to build an algorithm for automatic accurate measurement of vowel duration, where the input to the algorithm is a speech segment contains one vowel preceded and followed by consonants (CVC). Our algorithm is based on a deep neural network trained at the frame level on manually annotated data from a phonetic study. Specifically, we try two deep-network architectures: convolutional neural network (CNN), and deep belief network (DBN), and compare their accuracy to an HMM-based forced aligner. Results suggest that CNN is better than DBN, and both CNN and HMM-based forced aligner are comparable in their results, but neither of them yielded the same predictions as models fit to manually annotated data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号