Speech recognition

语音识别
  • 文章类型: Journal Article
    汉语作为第二语言的教学对于促进全球跨文化交流和相互学习变得越来越重要。然而,传统的国际汉语教学方法有局限性,阻碍了它们的有效性,例如过时的教材,缺乏合格的教练,有限的学习设施。为了克服这些挑战,必须开发智能和视觉化的方法来教授国际汉语学习者。在这篇文章中,我们建议利用人工智能中的语音识别技术来创建一个口头辅助平台,为学习者提供可视化的拼音格式的反馈。此外,该系统可以识别口音错误,并提供职业技能培训,以提高学习者的沟通能力。为了实现这一点,我们提出了注意力-连接时态分类(CTC)模型,它利用特定的时间卷积神经网络来捕获准确语音识别所需的位置信息。我们的实验结果表明,该模型优于类似的方法,验证集和测试集的错误率显着降低,与最初的注意力模型相比,索赔,证据,推理(CER)减少了0.67%。总的来说,我们提出的方法在提高国际汉语学习者职业技能培训的效率和有效性方面具有巨大潜力。
    The teaching of Chinese as a second language has become increasingly crucial for promoting cross-cultural exchange and mutual learning worldwide. However, traditional approaches to international Chinese language teaching have limitations that hinder their effectiveness, such as outdated teaching materials, lack of qualified instructors, and limited access to learning facilities. To overcome these challenges, it is imperative to develop intelligent and visually engaging methods for teaching international Chinese language learners. In this article, we propose leveraging speech recognition technology within artificial intelligence to create an oral assistance platform that provides visualized pinyin-formatted feedback to learners. Additionally, this system can identify accent errors and provide vocational skills training to improve learners\' communication abilities. To achieve this, we propose the Attention-Connectionist Temporal Classification (CTC) model, which utilizes a specific temporal convolutional neural network to capture the location information necessary for accurate speech recognition. Our experimental results demonstrate that this model outperforms similar approaches, with significant reductions in error rates for both validation and test sets, compared with the original Attention model, Claim, Evidence, Reasoning (CER) is reduced by 0.67%. Overall, our proposed approach has significant potential for enhancing the efficiency and effectiveness of vocational skills training for international Chinese language learners.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在人工耳蜗(CI)治疗中,结果有很大的可变性。我们研究的目的是确定与描述CI接受者结局特征的这种变异性最直接相关的独立听力测量。对选定的成年患者使用扩展的听力测量测试电池,以表征CI结果的全部范围。方法:根据术后结果招募CI使用者,将其分为三组:低(第1四分位数),中等(中小数),和高听力性能(第四四分位数)。通过使用(i)单音节单词(40-80dBSPL)在安静中测量语音识别,(ii)数字的语音接收阈值(SRT),和(Iii)德国矩阵噪声测试。为了在诊所重建苛刻的日常听力情况,背景噪声的时间特征和信号源的空间排列变化,以进行噪声测试。此外,使用演讲进行了一项调查,Spatial,和质量(SSQ)问卷和倾听努力(LE)问卷。结果:每组15名受试者(总N=45),在年龄方面没有显着差异,CI手术后的时间,orCI使用行为。两组主要在言语测听结果上有所不同。对于语音识别,在安静的单音节测试和静止(S0°N0°)和波动(S0°NCI)噪声中的句子之间,三组之间存在显着差异。安静中的单词理解和句子理解都与噪声中的SRT密切相关。该观察结果也通过因素分析得到证实。对于SSQ问卷和LE问卷结果,三组之间没有发现显着差异。因子分析的结果表明,噪声中的语音识别提供的信息与安静中语音清晰度的信息具有高度可比性。结论:因素分析强调了描述CI患者术后结局的三个组成部分。这些是(i)听力测量的超阈值语音识别和(ii)近阈值可听度,以及(iii)问卷调查确定的对与现实生活的关系的主观评估。这些参数似乎非常适合为测试电池建立框架以评估CI结果。
    Background: In cochlear implant (CI) treatment, there is a large variability in outcome. The aim of our study was to identify the independent audiometric measures that are most directly relevant for describing this variability in outcome characteristics of CI recipients. An extended audiometric test battery was used with selected adult patients in order to characterize the full range of CI outcomes. Methods: CI users were recruited for this study on the basis of their postoperative results and divided into three groups: low (1st quartile), moderate (medium decentile), and high hearing performance (4th quartile). Speech recognition was measured in quiet by using (i) monosyllabic words (40-80 dB SPL), (ii) speech reception threshold (SRT) for numbers, and (iii) the German matrix test in noise. In order to reconstruct demanding everyday listening situations in the clinic, the temporal characteristics of the background noise and the spatial arrangements of the signal sources were varied for tests in noise. In addition, a survey was conducted using the Speech, Spatial, and Qualities (SSQ) questionnaire and the Listening Effort (LE) questionnaire. Results: Fifteen subjects per group were examined (total N = 45), who did not differ significantly in terms of age, time after CI surgery, or CI use behavior. The groups differed mainly in the results of speech audiometry. For speech recognition, significant differences were found between the three groups for the monosyllabic tests in quiet and for the sentences in stationary (S0°N0°) and fluctuating (S0°NCI) noise. Word comprehension and sentence comprehension in quiet were both strongly correlated with the SRT in noise. This observation was also confirmed by a factor analysis. No significant differences were found between the three groups for the SSQ questionnaire and the LE questionnaire results. The results of the factor analysis indicate that speech recognition in noise provides information highly comparable to information from speech intelligibility in quiet. Conclusions: The factor analysis highlighted three components describing the postoperative outcome of CI patients. These were (i) the audiometrically measured supra-threshold speech recognition and (ii) near-threshold audibility, as well as (iii) the subjective assessment of the relationship to real life as determined by the questionnaires. These parameters appear well suited to setting up a framework for a test battery to assess CI outcomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于多种因素,言语理解可能具有挑战性,给演讲者和听众带来不便。在这种情况下,使用人形机器人,Pepper,可以是有益的,因为它可以在其屏幕上显示相应的文本。然而,在此之前,仔细评估Pepper捕获的录音的准确性至关重要。因此,在这项研究中,对八名参与者进行了一项实验,其主要目的是借助Mel-FrequencyCepstral系数等音频特征来检查Pepper的语音识别系统,光谱质心,光谱平整度,零交叉率,螺距,和能量。此外,K-means算法用于创建基于这些特征的聚类,目的是在语音到文本转换工具Whisper的帮助下选择最合适的聚类。最佳聚类的选择是通过找到位于聚类中的最大精度数据点来实现的。为了实现这一点,施加丢弃WER值大于0.3的数据点的标准。这项研究的结果表明,与人形机器人Pepper相距一米的距离适合捕获最佳语音记录。相比之下,年龄和性别不影响语音记录的准确性。拟议的系统将在需要字幕以提高对口语陈述的理解的环境中提供显着的优势。
    Speech comprehension can be challenging due to multiple factors, causing inconvenience for both the speaker and the listener. In such situations, using a humanoid robot, Pepper, can be beneficial as it can display the corresponding text on its screen. However, prior to that, it is essential to carefully assess the accuracy of the audio recordings captured by Pepper. Therefore, in this study, an experiment is conducted with eight participants with the primary objective of examining Pepper\'s speech recognition system with the help of audio features such as Mel-Frequency Cepstral Coefficients, spectral centroid, spectral flatness, the Zero-Crossing Rate, pitch, and energy. Furthermore, the K-means algorithm was employed to create clusters based on these features with the aim of selecting the most suitable cluster with the help of the speech-to-text conversion tool Whisper. The selection of the best cluster is accomplished by finding the maximum accuracy data points lying in a cluster. A criterion of discarding data points with values of WER above 0.3 is imposed to achieve this. The findings of this study suggest that a distance of up to one meter from the humanoid robot Pepper is suitable for capturing the best speech recordings. In contrast, age and gender do not influence the accuracy of recorded speech. The proposed system will provide a significant strength in settings where subtitles are required to improve the comprehension of spoken statements.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    语音识别测试广泛用于临床和研究听力学。这项研究的目的是开发一种新颖的语音识别测试,该测试结合了不同语音识别测试的概念,以减少训练效果,并允许大量的语音材料。新测试由每个试验中的四个不同的单词组成,具有固定结构的有意义的结构,所谓的短语。使用各种免费数据库来选择单词并确定其频率。频繁使用的名词被分为主题类别,并与相关的形容词和不定式相结合。丢弃不适当和不自然的组合后,并消除(子)短语的重复,总共有772个短语。随后,这些短语是使用文本到语音系统合成的。与使用真实扬声器的录音相比,合成显着减少了工作量。排除异常值后,在固定的信噪比(SNR)下,对31名正常听力参与者的短语测得的语音识别得分显示,每个短语的语音识别阈值(SRT)变化高达4dB。中值SRT为-9.1dBSNR,因此与现有的句子测试相当。心理测量功能的斜率为每dB15个百分点,也具有可比性,可以有效地用于听力学。总结,在模块化系统中创建语音材料的原理具有许多潜在的应用。
    Speech-recognition tests are widely used in both clinical and research audiology. The purpose of this study was the development of a novel speech-recognition test that combines concepts of different speech-recognition tests to reduce training effects and allows for a large set of speech material. The new test consists of four different words per trial in a meaningful construct with a fixed structure, the so-called phrases. Various free databases were used to select the words and to determine their frequency. Highly frequent nouns were grouped into thematic categories and combined with related adjectives and infinitives. After discarding inappropriate and unnatural combinations, and eliminating duplications of (sub-)phrases, a total number of 772 phrases remained. Subsequently, the phrases were synthesized using a text-to-speech system. The synthesis significantly reduces the effort compared to recordings with a real speaker. After excluding outliers, measured speech-recognition scores for the phrases with 31 normal-hearing participants at fixed signal-to-noise ratios (SNR) revealed speech-recognition thresholds (SRT) for each phrase varying up to 4 dB. The median SRT was -9.1 dB SNR and thus comparable to existing sentence tests. The psychometric function\'s slope of 15 percentage points per dB is also comparable and enables efficient use in audiology. Summarizing, the principle of creating speech material in a modular system has many potential applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项试点研究解决了护士和健康学科中普遍存在的职业倦怠问题,使用电子健康记录(EHR)系统往往会加剧这种情况。认识到听写减轻文件负担的潜力,该研究的重点是在加拿大大型城市心理健康和成瘾教学医院中采用语音识别技术(SRT)。参与试点的临床医生通过调查提供了他们的经验反馈,和分析数据进行了检查,以衡量使用和采用模式。初步反馈显示,一部分参与者迅速接受了这项技术,报告减少了文档记录时间,提高了效率。然而,一些临床医生经历了与初始设置时间和适应新文档方法的努力相关的挑战.
    This pilot study addresses the pervasive issue of burnout among nurses and health disciplines, often exacerbated by the use of electronic health record (EHR) systems. Recognizing the potential of dictation to alleviate documentation burden, the study focuses on the adoption of speech recognition technology (SRT) in a large Canadian urban mental health and addiction teaching hospital. Clinicians who participated in the pilot provided feedback on their experiences via a survey, and analytics data were examined to measure usage and adoption patterns. Preliminary feedback reveals a subset of participants rapidly embracing the technology, reporting decreased documentation times and increased efficiency. However, some clinicians experienced challenges related to initial setup time and the effort of adjusting to a novel documentation approach.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    近年来,传感器网络和用于监测人们活动和健康的可穿戴设备的嵌入式系统技术和产品已成为全球IT行业关注的焦点。为了增强可穿戴设备的语音识别能力,本文讨论了使用用于方向检测和混合源分离的嵌入式算法在嵌入式系统中实现音频定位和增强。这两种算法是使用不同的嵌入式系统实现的:使用TITMS320C6713DSK开发的方向检测和使用RaspberryPi2开发的混合源分离。对于混合源分离,在第一个实验中,1m和2m距离处的平均信噪比(SIR)分别为16.72和15.76。在第二个实验中,当使用语音识别进行评估时,该算法将语音识别准确率提高到95%。
    In recent years, embedded system technologies and products for sensor networks and wearable devices used for monitoring people\'s activities and health have become the focus of the global IT industry. In order to enhance the speech recognition capabilities of wearable devices, this article discusses the implementation of audio positioning and enhancement in embedded systems using embedded algorithms for direction detection and mixed source separation. The two algorithms are implemented using different embedded systems: direction detection developed using TI TMS320C6713 DSK and mixed source separation developed using Raspberry Pi 2. For mixed source separation, in the first experiment, the average signal-to-interference ratio (SIR) at 1 m and 2 m distances was 16.72 and 15.76, respectively. In the second experiment, when evaluated using speech recognition, the algorithm improved speech recognition accuracy to 95%.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    简介在临床实践中,具有相同程度和构型的听力损失的患者,甚至正常的听力阈值,在言语感知方面表现出明显不同的表现。这可能是因为其他因素,除了听觉灵敏度,干扰言语感知。因此,需要进行研究以调查听众在不利的听力条件下的表现,以确定干扰这些受试者的言语感知的过程。目的验证年龄的影响,时间处理,和噪声中语音识别的工作记忆。方法38例听力阈值正常的成人和老年人参加研究。参与者分为两组:成人组(G1),由10名21至33岁的人组成,和老年组(G2),28名年龄在60至81岁之间的参与者。他们接受了葡萄牙语句子列表测试的听力学评估,噪声间隙测试,数字跨度内存测试,运行范围任务,Corsi块攻丝试验,和视觉图案测试。结果跑步跨度任务得分被证明是听噪声变量的统计学显着预测因子。这一结果表明,G1和G2组之间的性能差异与听噪声不仅是由于老化,还有工作记忆的变化。结论该研究表明,工作记忆是听力正常的人在噪声中的听力表现的预测指标,这项任务可以为在不利环境中听力困难的个人的调查提供重要信息。
    Introduction  In clinical practice, patients with the same degree and configuration of hearing loss, or even with normal audiometric thresholds, present substantially different performances in terms of speech perception. This probably happens because other factors, in addition to auditory sensitivity, interfere with speech perception. Thus, studies are needed to investigate the performance of listeners in unfavorable listening conditions to identify the processes that interfere in the speech perception of these subjects. Objective  To verify the influence of age, temporal processing, and working memory on speech recognition in noise. Methods  Thirty-eight adult and elderly individuals with normal hearing thresholds participated in the study. Participants were divided into two groups: The adult group (G1), composed of 10 individuals aged 21 to 33 years, and the elderly group (G2), with 28 participants aged 60 to 81 years. They underwent audiological assessment with the Portuguese Sentence List Test, Gaps-in-Noise test, Digit Span Memory test, Running Span Task, Corsi Block-Tapping test, and Visual Pattern test. Results  The Running Span Task score proved to be a statistically significant predictor of the listening-in-noise variable. This result showed that the difference in performance between groups G1 and G2 in relation to listening in noise is due not only to aging, but also to changes in working memory. Conclusion  The study showed that working memory is a predictor of listening performance in noise in individuals with normal hearing, and that this task can provide important information for investigation in individuals who have difficulty hearing in unfavorable environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    现有的端到端语音识别方法通常采用基于CTC和Transformer的混合解码器。然而,这些混合解码器中的误差累积问题阻碍了精度的进一步提高。此外,大多数现有模型都建立在Transformer架构上,这往往是复杂和不友好的小数据集。因此,提出了一种用于语音识别的非线性正则化解码方法。首先,我们介绍了非线性变换器解码器,打破传统的从左到右或从右到左的解码顺序,并实现任何字符之间的关联,减轻小数据集上Transformer体系结构的限制。其次,我们提出了一种新颖的正则化注意力模块来优化注意力得分矩阵,减少早期错误对后期输出的影响。最后,我们引入微小模型来解决模型参数过大的挑战。实验结果表明,我们的模型表现出良好的性能。与基线相比,我们的模型实现了0.12%的识别改进,0.54%,0.51%,和1.2%的Aishell1,Primewords,免费ST中文语料库,和维吾尔语的普通语音16.1数据集,分别。
    Existing end-to-end speech recognition methods typically employ hybrid decoders based on CTC and Transformer. However, the issue of error accumulation in these hybrid decoders hinders further improvements in accuracy. Additionally, most existing models are built upon Transformer architecture, which tends to be complex and unfriendly to small datasets. Hence, we propose a Nonlinear Regularization Decoding Method for Speech Recognition. Firstly, we introduce the nonlinear Transformer decoder, breaking away from traditional left-to-right or right-to-left decoding orders and enabling associations between any characters, mitigating the limitations of Transformer architectures on small datasets. Secondly, we propose a novel regularization attention module to optimize the attention score matrix, reducing the impact of early errors on later outputs. Finally, we introduce the tiny model to address the challenge of overly large model parameters. The experimental results indicate that our model demonstrates good performance. Compared to the baseline, our model achieves recognition improvements of 0.12%, 0.54%, 0.51%, and 1.2% on the Aishell1, Primewords, Free ST Chinese Corpus, and Common Voice 16.1 datasets of Uyghur, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    通过微结构设计增强电容式压力传感器的灵敏度可能会损害器件的可靠性并依赖于复杂的制造工艺。通过平衡电介质层材料的固有特性(弹性模量和介电常数)是解决该问题的有效途径。这里,我们介绍了一种由无扩链聚氨酯(PU)和LM制备的液态金属(LM)杂化弹性体。无增量剂和LM掺杂的协同策略有效降低了LM杂化弹性体的弹性模量(7.6±0.2-2.1±0.3MPa)并提高了介电常数(5.12-8.17@1kHz)。有趣的是,LM混合弹性体结合了可再加工性,可回收性,和光热转换。获得的柔性压力传感器可用于检测手和喉咙肌肉运动,和七个单词的高精度语音识别一直在深度学习中使用卷积神经网络(CNN)。这项工作为设计和制造可穿戴设备提供了一种思路,可回收,和智能控制压力传感器。
    Enhancing the sensitivity of capacitive pressure sensors through microstructure design may compromise the reliability of the device and rely on intricate manufacturing processes. It is an effective way to solve this issue by balancing the intrinsic properties (elastic modulus and dielectric constant) of the dielectric layer materials. Here, we introduce a liquid metal (LM) hybrid elastomer prepared by a chain-extension-free polyurethane (PU) and LM. The synergistic strategies of extender-free and LM doping effectively reduce the elastic modulus (7.6 ± 0.2-2.1 ± 0.3 MPa) and enhance the dielectric constant (5.12-8.17 @1 kHz) of LM hybrid elastomers. Interestingly, the LM hybrid elastomer combines reprocessability, recyclability, and photothermal conversion. The obtained flexible pressure sensor can be used for detecting hand and throat muscle movements, and high-precision speech recognition of seven words has been using a convolutional neural network (CNN) in deep learning. This work provides an idea for designing and manufacturing wearable, recyclable, and intelligent control pressure sensors.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    言语障碍通过阻碍社会运作和阻碍有效沟通而深刻影响整体生活质量。这项研究解决了针对言语障碍患者的基于机器学习的辅助技术的系统评价中的差距。首要目的是通过系统文献综述(SLR)提供该领域的全面概述,并为基于ML的解决方案和相关研究提供有价值的见解。
    这项研究采用了系统的方法,利用系统文献综述(SLR)方法。该研究广泛考察了基于机器学习的语音障碍辅助技术的现有文献。特别关注ML技术,训练阶段被利用数据集的特征,说话者语言,特征提取技术,以及ML算法采用的功能。
    这项研究通过系统地探索语音障碍辅助技术中的机器学习景观,为现有文献做出了贡献。独创性在于对超过十年(2014-2023年)的受损语音障碍用户的ML语音识别进行了重点调查。强调与ML技术相关的系统研究问题,数据集特征,语言,特征提取技术,和功能集为当前的话语增添了独特而全面的视角。
    系统文献综述确定了2014年至2023年之间发表的重要趋势和关键研究。在对来自著名期刊的65篇论文的分析中,支持向量机和神经网络(CNN,DNN)是最常用的ML技术(20%,16.92%),研究最多的疾病是构音障碍(35/65,54%的研究)。此外,使用基于神经网络的架构的热潮,主要是CNN和DNN,是在2018年之后观察到的。几乎一半的纳入研究是在2021年至2022年之间发表的)。
    UNASSIGNED: Speech disorders profoundly impact the overall quality of life by impeding social operations and hindering effective communication. This study addresses the gap in systematic reviews concerning machine learning-based assistive technology for individuals with speech disorders. The overarching purpose is to offer a comprehensive overview of the field through a Systematic Literature Review (SLR) and provide valuable insights into the landscape of ML-based solutions and related studies.
    UNASSIGNED: The research employs a systematic approach, utilizing a Systematic Literature Review (SLR) methodology. The study extensively examines the existing literature on machine learning-based assistive technology for speech disorders. Specific attention is given to ML techniques, characteristics of exploited datasets in the training phase, speaker languages, feature extraction techniques, and the features employed by ML algorithms.
    UNASSIGNED: This study contributes to the existing literature by systematically exploring the machine learning landscape in assistive technology for speech disorders. The originality lies in the focused investigation of ML-speech recognition for impaired speech disorder users over ten years (2014-2023). The emphasis on systematic research questions related to ML techniques, dataset characteristics, languages, feature extraction techniques, and feature sets adds a unique and comprehensive perspective to the current discourse.
    UNASSIGNED: The systematic literature review identifies significant trends and critical studies published between 2014 and 2023. In the analysis of the 65 papers from prestigious journals, support vector machines and neural networks (CNN, DNN) were the most utilized ML technique (20%, 16.92%), with the most studied disease being Dysarthria (35/65, 54% studies). Furthermore, an upsurge in using neural network-based architectures, mainly CNN and DNN, was observed after 2018. Almost half of the included studies were published between 2021 and 2022).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号