自动语音识别 automatic speech recognition-医云文献数字医云科研云海量医学决策数据服务

automatic speech recognition 关注

自动语音识别

文献(108篇)

百科

视频

1 The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation.

Mason - Alberta 语音分段器：基于深度神经网络和插值的强制对准系统。影响指数 : 1.324
发表时间：Sep 2024 5
来源期刊：Phonetica PMID：39248125

DOI：10.1515/phon-2024-0015
文章类型： Journal Article

给定正字法转录，强制对准系统自动确定语音片段之间的边界，促进大型语料库的使用。在本论文中，我们介绍了一种基于神经网络的强制对准系统，梅森-艾伯塔省语音分段器(MAPS)。MAPS是我们为强制对准系统寻求的两种可能改进的测试平台。首先是将声学模型视为标记器，而不是分类器，出于共同的理解，即细分市场并不是真正离散的，而且往往是重叠的。第二种是插值技术，可以比现代系统中的典型10ms限制更精确的边界。在测试过程中,我们训练的所有系统配置在10ms边界放置容差阈值中显著优于最先进的蒙特利尔强制对准器。实现的最大差异是28.13%的相对性能提高。蒙特利尔强制校准器在大约30ms的公差下开始略微优于我们的模型。我们还反思了强制对准声学建模的训练过程，强调这些模型的输出目标如何与电话之间的相似性概念不匹配，调和这种紧张关系可能需要重新思考任务和输出目标，或者语音本身应该如何分段。
Given an orthographic transcription, forced alignment systems automatically determine boundaries between segments in speech, facilitating the use of large corpora. In the present paper, we introduce a neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). MAPS serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model as a tagger, rather than a classifier, motivated by the common understanding that segments are not truly discrete and often overlap. The second is an interpolation technique to allow more precise boundaries than the typical 10 ms limit in modern systems. During testing, all system configurations we trained significantly outperformed the state-of-the-art Montreal Forced Aligner in the 10 ms boundary placement tolerance threshold. The greatest difference achieved was a 28.13 % relative performance increase. The Montreal Forced Aligner began to slightly outperform our models at around a 30 ms tolerance. We also reflect on the training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians\' conception of similarity between phones and that reconciling this tension may require rethinking the task and output targets or how speech itself should be segmented.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
2 Assessing the Effectiveness of Automatic Speech Recognition Technology in Emergency Medicine Settings: A Comparative Study of Four AI-powered Engines.

在急诊医学设置中评估自动语音识别技术的有效性：四种 AI 引擎的比较研究。影响指数 : 暂无
发表时间：Aug 2024 17
来源期刊：Res Sq PMID：39184074

DOI：10.21203/rs.3.rs-4727659/v1
文章类型： Journal Article

■先进的自动语音识别（ASR）技术在患者相遇期间转录和识别医疗信息方面具有重要的前景，从而实现自动和实时的临床文档，这可以显著减轻护理临床医生的负担。然而，当前一代ASR技术在嘈杂和动态医疗环境中分析对话的性能，如院前或急诊医疗服务（EMS），缺乏足够的验证。本研究探讨了在EMS等快速和嘈杂的医疗环境中部署ASR技术用于临床文档的当前技术局限性和未来潜力。
■在这项研究中，我们评估了四个ASR发动机，包括谷歌语音到文本临床对话，OpenAI语音到文本，亚马逊转录医疗，和Azure语音到文本引擎。用于评估的经验数据是40个EMS模拟记录。对23种EMS电子健康记录（EHR）类别的转录文本进行了准确性分析。还分析了转录中常见的错误类型。
■在所有四个ASR发动机中，Google语音到文本的临床对话表现最好。在所有EHR类别中，在“精神状态”类别中观察到更好的表现(F1=1.0)，“过敏”（F1=0.917），“既往病史”（F1=0.804），“电解质”(F1=1.0)，和“血糖水平”（F1=0.813）。然而,所有四个ASR引擎在转录某些关键类别时都表现出低性能，如“治疗”(F1=0.650)和“药物”(F1=0.577)。
■当前的ASR解决方案不足以在EMS环境中实现临床文档的完全自动化。我们的发现强调了需要进一步改进和开发自动化临床文档技术，以提高在时间关键和动态医疗环境中的识别准确性。
Purpose Cutting-edge automatic speech recognition (ASR) technology holds significant promise in transcribing and recognizing medical information during patient encounters, thereby enabling automatic and real-time clinical documentation, which could significantly alleviate care clinicians\' burdens. Nevertheless, the performance of current-generation ASR technology in analyzing conversations in noisy and dynamic medical settings, such as prehospital or Emergency Medical Services (EMS), lacks sufficient validation. This study explores the current technological limitations and future potential of deploying ASR technology for clinical documentation in fast-paced and noisy medical settings such as EMS. Methods In this study, we evaluated four ASR engines, including Google Speech-to-Text Clinical Conversation, OpenAI Speech-to-Text, Amazon Transcribe Medical, and Azure Speech-to-Text engine. The empirical data used for evaluation were 40 EMS simulation recordings. The transcribed texts were analyzed for accuracy against 23 Electronic Health Records (EHR) categories of EMS. The common types of errors in transcription were also analyzed. Results Among all four ASR engines, Google Speech-to-Text Clinical Conversation performed the best. Among all EHR categories, better performance was observed in categories \"mental state\" (F1 = 1.0), \"allergies\" (F1 = 0.917), \"past medical history\" (F1 = 0.804), \"electrolytes\" (F1 = 1.0), and \"blood glucose level\" (F1 = 0.813). However, all four ASR engines demonstrated low performance in transcribing certain critical categories, such as \"treatment\" (F1 = 0.650) and \"medication\" (F1 = 0.577). Conclusion Current ASR solutions fall short in fully automating the clinical documentation in EMS setting. Our findings highlight the need for further improvement and development of automated clinical documentation technology to improve recognition accuracy in time-critical and dynamic medical settings.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
3 Automatic speech analysis for detecting cognitive decline of older adults.

用于检测老年人认知能力下降的自动语音分析。影响指数 : 6.461
发表时间：2024
来源期刊：Front Public Health PMID：39175901

DOI：10.3389/fpubh.2024.1417966
文章类型： Journal Article

■语音分析有望作为早期发现阿尔茨海默病（AD）和轻度认知功能障碍（MCI）的筛查工具。声学特征和语言特征通常用于语音分析。然而,尚未有研究确定哪种类型的特征提供更好的筛查效果，特别是在中国大量的老龄化人口中。
■首先，为了比较声学特征的筛选效果，语言特征，以及使用相同数据集的组合。其次,使用从汉语母语者那里获得的自我收集的自然话语数据来开发汉语自动诊断模型。
■共有92名来自上海社区的参与者，在训练有素的操作员的指导下，完成了MoCA-B和基于Cookie盗窃的图片描述任务，分为三组，包括AD，MCI和基于MoCA-B评分的健康控制(HC)。声学特征(音高,抖动，微光,MFCC,Formants)和语言特征(词性，类型-令牌比率，信息词,信息单元)被提取。本研究中使用的机器算法包括逻辑回归，随机森林(RF)，支持向量机(SVM)，高斯朴素贝叶斯(GNB)，和k-最近邻居(kNN)。使用声学特征的同一ML模型的验证精度，语言特征，并对其组合进行了比较。
■在训练中，语言特征的准确性通常高于声学特征。SVM区分HC和AD的最高准确率为80.77%，基于从语音数据中提取的所有特征，虽然区分HC和AD或MCI的最高准确度是通过RF实现的80.43%，仅基于语言特征。
■我们的结果表明，语言特征在认知障碍自动诊断中的实用性和有效性，并验证了中文数据自动诊断的适用性。
UNASSIGNED: Speech analysis has been expected to help as a screening tool for early detection of Alzheimer\'s disease (AD) and mild-cognitively impairment (MCI). Acoustic features and linguistic features are usually used in speech analysis. However, no studies have yet determined which type of features provides better screening effectiveness, especially in the large aging population of China.
UNASSIGNED: Firstly, to compare the screening effectiveness of acoustic features, linguistic features, and their combination using the same dataset. Secondly, to develop Chinese automated diagnosis model using self-collected natural discourse data obtained from native Chinese speakers.
UNASSIGNED: A total of 92 participants from communities in Shanghai, completed MoCA-B and a picture description task based on the Cookie Theft under the guidance of trained operators, and were divided into three groups including AD, MCI, and heathy control (HC) based on their MoCA-B score. Acoustic features (Pitches, Jitter, Shimmer, MFCCs, Formants) and linguistic features (part-of-speech, type-token ratio, information words, information units) are extracted. The machine algorithms used in this study included logistic regression, random forest (RF), support vector machines (SVM), Gaussian Naive Bayesian (GNB), and k-Nearest neighbor (kNN). The validation accuracies of the same ML model using acoustic features, linguistic features, and their combination were compared.
UNASSIGNED: The accuracy with linguistic features is generally higher than acoustic features in training. The highest accuracy to differentiate HC and AD is 80.77% achieved by SVM, based on all the features extracted from the speech data, while the highest accuracy to differentiate HC and AD or MCI is 80.43% achieved by RF, based only on linguistic features.
UNASSIGNED: Our results suggest the utility and validity of linguistic features in the automated diagnosis of cognitive impairment, and validated the applicability of automated diagnosis for Chinese language data.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
4 Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

自动语音识别（ ASR ），用于诊断韩国儿童语音障碍的发音。影响指数 : 1.339
发表时间：Aug 2024 20
来源期刊：Clin Linguist Phon PMID：39162064

DOI：10.1080/02699206.2024.2387609
文章类型： Journal Article

这项研究提出了一种自动语音识别（ASR）模型，该模型旨在诊断患有语音障碍（SSD）的儿童的发音问题，以取代临床程序中的手动转录。因为为一般目的而训练的ASR模型主要将输入语音预测为标准拼写单词，众所周知的高性能ASR模型不适合评估SSD儿童的发音。我们对wav2vec2.0XLS-R模型进行了微调，以识别儿童发音的单词，而不是把演讲转换成他们的标准拼写词。该模型经过微调，包括137名具有SSD的儿童的语音数据集，这些儿童发音73个韩语单词，这些单词被选择用于实际的临床诊断。当将其对儿童发音的预测与所听到的发音的人类注释进行比较时，该模型的音素错误率（PER）仅为10％。相比之下,尽管它在一般任务上表现强劲，最先进的ASR模型Whisper在识别带有SSD的儿童的语音方面表现出局限性，PER约为50%。虽然该模型在识别发音不清方面仍需要改进，这项研究表明，ASR模型可以简化临床领域复杂的发音错误诊断程序。
This study presents a model of automatic speech recognition (ASR) that is designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Because ASR models trained for general purposes mainly predict input speech into standard spelling words, well-known high-performance ASR models are not suitable for evaluating pronunciation in children with SSDs. We fine-tuned the wav2vec2.0 XLS-R model to recognise words as they are pronounced by children, rather than converting the speech into their standard spelling words. The model was fine-tuned with a speech dataset of 137 children with SSDs pronouncing 73 Korean words that are selected for actual clinical diagnosis. The model\'s Phoneme Error Rate (PER) was only 10% when its predictions of children\'s pronunciations were compared to human annotations of pronunciations as heard. In contrast, despite its robust performance on general tasks, the state-of-the-art ASR model Whisper showed limitations in recognising the speech of children with SSDs, with a PER of approximately 50%. While the model still requires improvement in terms of the recognition of unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
5 Enhancing Air Traffic Control Communication Systems with Integrated Automatic Speech Recognition: Models, Applications and Performance Evaluation.

通过集成的自动语音识别增强空中交通管制通信系统：模型，应用和性能评估。影响指数 : 3.847
发表时间：Jul 2024 20
来源期刊：Sensors (Basel) PMID：39066111

DOI：10.3390/s24144715
文章类型： Journal Article

在空中交通管制(ATC)中，具有无线电传输的语音通信是控制器和飞行员之间交换信息的主要方式。因此,自动语音识别（ASR）系统的集成具有巨大的潜力，可以减少控制器的工作量，并在各种ATC场景中起着至关重要的作用，这对ATC研究尤其重要。本文对ASR技术在空管通信系统中的应用进行了全面综述。首先,它提供了当前研究的全面概述，包括空管语料库,ASR模型，评估措施和应用场景。提出了一种针对ATC量身定制的更全面，更准确的评估方法，考虑通信传感系统和深度学习技术的进步。这种方法有助于研究人员增强ASR系统并提高ATC系统的整体性能。最后,根据主要挑战和问题确定未来的研究建议。作者真诚地希望这项工作将成为ATC领域内ASR工作的清晰技术路线图，并为研究界做出有价值的贡献。
In air traffic control (ATC), speech communication with radio transmission is the primary way to exchange information between the controller and the pilot. As a result, the integration of automatic speech recognition (ASR) systems holds immense potential for reducing controllers\' workload and plays a crucial role in various ATC scenarios, which is particularly significant for ATC research. This article provides a comprehensive review of ASR technology\'s applications in the ATC communication system. Firstly, it offers a comprehensive overview of current research, including ATC corpora, ASR models, evaluation measures and application scenarios. A more comprehensive and accurate evaluation methodology tailored for ATC is proposed, considering advancements in communication sensing systems and deep learning techniques. This methodology helps researchers in enhancing ASR systems and improving the overall performance of ATC systems. Finally, future research recommendations are identified based on the primary challenges and issues. The authors sincerely hope this work will serve as a clear technical roadmap for ASR endeavors within the ATC domain and make a valuable contribution to the research community.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
6 Co-designing the integration of voice-based conversational AI and web augmentation to amplify web inclusivity.

共同设计基于语音的会话 AI 和 Web 增强的集成，以增强 Web 包容性。影响指数 : 4.996
发表时间：07 2024 13
来源期刊：Sci Rep PMID：39003348

DOI：10.1038/s41598-024-66725-3
文章类型： Journal Article

网络已经成为一种必不可少的资源，但还不是每个人都可以访问的。辅助技术和创新，智能框架，例如,那些使用对话式人工智能的人，帮助克服一些排斥。然而,一些用户仍然遇到障碍。本文展示了以人为中心的方法如何阐明技术限制和差距。它报告了一个三步过程(焦点小组，共同设计，和初步验证)，我们采用了它来调查有语言障碍的人，例如，构音障碍,浏览Web以及如何减少障碍。该方法帮助我们识别挑战并创建新的解决方案，即，网页浏览的模式，通过结合基于语音的会话AI，为受损的语音定制，使用网页视觉增强技术。虽然人工智能研究的当前趋势集中在越来越强大的大型模型上，参与者评论了当前的对话系统如何不能满足他们的需求，以及如何考虑每个人的特殊性对于被称为包容性的技术是很重要的。
The Web has become an essential resource but is not yet accessible to everyone. Assistive technologies and innovative, intelligent frameworks, for example, those using conversational AI, help overcome some exclusions. However, some users still experience barriers. This paper shows how a human-centered approach can shed light on technology limitations and gaps. It reports on a three-step process (focus group, co-design, and preliminary validation) that we adopted to investigate how people with speech impairments, e.g., dysarthria, browse the Web and how barriers can be reduced. The methodology helped us identify challenges and create new solutions, i.e., patterns for Web browsing, by combining voice-based conversational AI, customized for impaired speech, with techniques for the visual augmentation of web pages. While current trends in AI research focus on more and more powerful large models, participants remarked how current conversational systems do not meet their needs, and how it is important to consider each one\'s specificity for a technology to be called inclusive.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
7 Post-Processing Automatic Transcriptions with Machine Learning for Verbal Fluency Scoring.

使用机器学习对语言流畅度评分进行后处理自动转录。影响指数 : 2.723
发表时间：Nov 2023
来源期刊：Speech Commun PMID：38881790

DOI：10.1016/j.specom.2023.102990
文章类型： Journal Article

■将从手动转录中得出的言语流畅度分数与使用机器学习分类器增强的自动语音识别获得的分数进行比较。
■使用亚马逊云科技，我们自动转录了执行动物和字母F语言流利任务的1400个人的语言流利记录。我们手动调整自动转录的时间和内容，以获得“黄金标准”转录。为了使自动评分成为可能，我们训练机器学习分类器来辨别有效和无效的话语。然后，我们计算并比较了手动和自动转录的言语流利度分数。
■对于动物和字母流利的任务，我们实现了有效和无效话语的良好分离。根据自动转录计算的言语流畅性得分与手动校正后计算的得分高度相关。
■许多用于对言语流畅性单词列表进行评分的技术都需要带有单词计时的准确转录。我们证明了机器学习方法可以应用于改进现成的ASR。这些自动导出的分数对于一些应用可能是令人满意的。一些分数之间的低相关性表明在可以可靠地实现全自动方法之前需要改进自动语音识别。
UNASSIGNED: To compare verbal fluency scores derived from manual transcriptions to those obtained using automatic speech recognition enhanced with machine learning classifiers.
UNASSIGNED: Using Amazon Web Services, we automatically transcribed verbal fluency recordings from 1400 individuals who performed both animal and letter F verbal fluency tasks. We manually adjusted timings and contents of the automatic transcriptions to obtain \"gold standard\" transcriptions. To make automatic scoring possible, we trained machine learning classifiers to discern between valid and invalid utterances. We then calculated and compared verbal fluency scores from the manual and automatic transcriptions.
UNASSIGNED: For both animal and letter fluency tasks, we achieved good separation of valid versus invalid utterances. Verbal fluency scores calculated based on automatic transcriptions showed high correlation with those calculated after manual correction.
UNASSIGNED: Many techniques for scoring verbal fluency word lists require accurate transcriptions with word timings. We show that machine learning methods can be applied to improve off-the-shelf ASR for this purpose. These automatically derived scores may be satisfactory for some applications. Low correlations among some of the scores indicate the need for improvement in automatic speech recognition before a fully automatic approach can be reliably implemented.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
8 Customized deep learning based Turkish automatic speech recognition system supported by language model.

语言模型支持的定制的基于深度学习的土耳其语自动语音识别系统。影响指数 : 2.411
发表时间：2024
来源期刊：PeerJ Comput Sci PMID：38660198

DOI：10.7717/peerj-cs.1981
文章类型： Journal Article

■在当今世界，许多应用集成到日常生活的各个方面，包括自动语音识别方法。因此，一个成功的自动语音识别系统的开发可以显著增加人们日常生活的便利性。虽然许多自动语音识别系统已经建立了广泛使用的语言，如英语，在为土耳其语等较不常见的语言开发此类系统方面进展不足。此外，由于其凝集结构，与其他语言组相比，为土耳其语设计语音识别系统面临更大的挑战。因此,我们的研究重点是提出用于土耳其语自动语音识别的深度学习模型，辅之以语言模型的集成。
■在我们的研究中，通过结合卷积神经网络来制定深度学习模型，门控经常性单位，长期的短期记忆，和变压器层。使用Zemberek库制作语言模型以提高系统性能。此外，采用贝叶斯优化方法对深度学习模型的超参数进行微调。要评估模型的性能，在自动语音识别系统中广泛使用的标准指标，特别是单词错误率和字符错误率分数，被雇用。
■在查看实验结果后，很明显，当最佳超参数应用于使用不同层开发的模型时，分数如下：不使用语言模型，土耳其麦克风语音语料库数据集产生22.2字错误率和14.05字符错误率的分数，而土耳其语语料库数据集的得分为11.5字错误率和4.15字符错误率。在合并语言模型后，观察到显著的改善。具体来说,对于土耳其麦克风语音语料库数据集，单词错误率得分降至9.85,字符错误率得分降至5.35。同样，对于土耳其语音语料库数据集，单词错误率得分提高到8.4，字符错误率得分降低到2.7。这些结果表明，我们的模型优于现有文献中的研究。
UNASSIGNED: In today\'s world, numerous applications integral to various facets of daily life include automatic speech recognition methods. Thus, the development of a successful automatic speech recognition system can significantly augment the convenience of people\'s daily routines. While many automatic speech recognition systems have been established for widely spoken languages like English, there has been insufficient progress in developing such systems for less common languages such as Turkish. Moreover, due to its agglutinative structure, designing a speech recognition system for Turkish presents greater challenges compared to other language groups. Therefore, our study focused on proposing deep learning models for automatic speech recognition in Turkish, complemented by the integration of a language model.
UNASSIGNED: In our study, deep learning models were formulated by incorporating convolutional neural networks, gated recurrent units, long short-term memories, and transformer layers. The Zemberek library was employed to craft the language model to improve system performance. Furthermore, the Bayesian optimization method was applied to fine-tune the hyper-parameters of the deep learning models. To evaluate the model\'s performance, standard metrics widely used in automatic speech recognition systems, specifically word error rate and character error rate scores, were employed.
UNASSIGNED: Upon reviewing the experimental results, it becomes evident that when optimal hyper-parameters are applied to models developed with various layers, the scores are as follows: Without the use of a language model, the Turkish Microphone Speech Corpus dataset yields scores of 22.2 -word error rate and 14.05-character error rate, while the Turkish Speech Corpus dataset results in scores of 11.5 -word error rate and 4.15 character error rate. Upon incorporating the language model, notable improvements were observed. Specifically, for the Turkish Microphone Speech Corpus dataset, the word error rate score decreased to 9.85, and the character error rate score lowered to 5.35. Similarly, the word error rate score improved to 8.4, and the character error rate score decreased to 2.7 for the Turkish Speech Corpus dataset. These results demonstrate that our model outperforms the studies found in the existing literature.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
9 Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation.

基于 Whisper 分割的实时多语言语音识别与说话人二化系统 [J]. 影响指数 : 2.411
发表时间：2024
来源期刊：PeerJ Comput Sci PMID：38660177

DOI：10.7717/peerj-cs.1973
文章类型： Journal Article

这项研究提出了一种先进的实时多语言语音识别和说话者二值化系统的开发，该系统利用了OpenAI的Whisper模型。该系统专门解决了动态自动语音识别（ASR）和说话人二值化（SD）的挑战，多扬声器环境，专注于准确处理台湾口音的普通话语音，并管理频繁的说话者切换。传统的语音识别系统通常在如此复杂的多语言和多说话者环境中不足，尤其是SD。这项研究,因此,将高级语音识别与针对实时应用优化的说话者二值化技术集成在一起。这些优化包括有效地处理模型输出并结合扬声器嵌入技术。使用台湾脱口秀和政治评论节目的数据对系统进行了评估，有46个不同的演讲者。结果显示，在两个说话者场景中，单词二化错误率（WDER）为2.68％，在三个说话者场景中为11.65％，整体WDER为6.96%。这种性能与非实时基线模型相当，突出系统适应各种复杂的会话动态的能力，在实时多语言语音处理领域取得了重大进展。
This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI\'s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system\'s ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
10 Identifying depression-related topics in smartphone-collected free-response speech recordings using an automatic speech recognition system and a deep learning topic model.

使用自动语音识别系统和深度学习主题模型，在智能手机收集的自由响应语音录音中识别与抑郁相关的主题。影响指数 : 6.533
发表时间：Jun 2024 15
来源期刊：J Affect Disord PMID：38552911

DOI：10.1016/j.jad.2024.03.106
文章类型： Journal Article

背景：先前的研究将口语使用与抑郁症联系在一起，然而，研究通常涉及小样本或非临床样本，并且在手动语音转录方面面临挑战。本文旨在从临床样本中收集的语音录音中自动识别与抑郁症相关的主题。
方法：数据包括通过智能手机从265名有抑郁史的参与者中收集的3919个英语自由反应语音录音。我们通过自动语音识别转录语音录音(低语工具，OpenAI），并使用深度学习主题模型（BERTopic）从转录中确定主要主题。为了确定抑郁风险主题并了解背景，我们比较了参与者的抑郁严重程度和行为特征(从可穿戴设备中提取)和语言特征(从转录文本中提取)。
结果：从确定的29个主题中，我们确定了6个抑郁症的风险主题：“没有期望”，\'睡眠\',\'心理治疗\',\'理发\',\'学习\',和“课程”。提到抑郁风险主题的参与者表现出更高的睡眠变异性，后期睡眠发作，减少日常步骤，使用更少的单词，更消极的语言，在他们的演讲录音中减少了与休闲相关的单词。
结论：我们的发现来自一个有特定言语任务的抑郁队列，可能限制了非临床人群或其他言语任务的普遍性。此外,一些主题的样本量很小，需要在更大的数据集中进一步验证。
结论：这项研究表明，特定的演讲主题可以指示抑郁症的严重程度。所采用的数据驱动工作流提供了一种用于分析从现实世界设置收集的大规模语音数据的实用方法。
BACKGROUND: Prior research has associated spoken language use with depression, yet studies often involve small or non-clinical samples and face challenges in the manual transcription of speech. This paper aimed to automatically identify depression-related topics in speech recordings collected from clinical samples.
METHODS: The data included 3919 English free-response speech recordings collected via smartphones from 265 participants with a depression history. We transcribed speech recordings via automatic speech recognition (Whisper tool, OpenAI) and identified principal topics from transcriptions using a deep learning topic model (BERTopic). To identify depression risk topics and understand the context, we compared participants\' depression severity and behavioral (extracted from wearable devices) and linguistic (extracted from transcribed texts) characteristics across identified topics.
RESULTS: From the 29 topics identified, we identified 6 risk topics for depression: \'No Expectations\', \'Sleep\', \'Mental Therapy\', \'Haircut\', \'Studying\', and \'Coursework\'. Participants mentioning depression risk topics exhibited higher sleep variability, later sleep onset, and fewer daily steps and used fewer words, more negative language, and fewer leisure-related words in their speech recordings.
CONCLUSIONS: Our findings were derived from a depressed cohort with a specific speech task, potentially limiting the generalizability to non-clinical populations or other speech tasks. Additionally, some topics had small sample sizes, necessitating further validation in larger datasets.
CONCLUSIONS: This study demonstrates that specific speech topics can indicate depression severity. The employed data-driven workflow provides a practical approach for analyzing large-scale speech data collected from real-world settings.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文

automatic speech recognition 关注

1 The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation.

2 Assessing the Effectiveness of Automatic Speech Recognition Technology in Emergency Medicine Settings: A Comparative Study of Four AI-powered Engines.

3 Automatic speech analysis for detecting cognitive decline of older adults.

4 Automatic speech recognition (ASR) for the diagnosis of pronunciation of speech sound disorders in Korean children.

5 Enhancing Air Traffic Control Communication Systems with Integrated Automatic Speech Recognition: Models, Applications and Performance Evaluation.

6 Co-designing the integration of voice-based conversational AI and web augmentation to amplify web inclusivity.

7 Post-Processing Automatic Transcriptions with Machine Learning for Verbal Fluency Scoring.

8 Customized deep learning based Turkish automatic speech recognition system supported by language model.

9 Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation.

10 Identifying depression-related topics in smartphone-collected free-response speech recordings using an automatic speech recognition system and a deep learning topic model.