Speech Acoustics

语音声学
  • 文章类型: Journal Article
    背景:数字语音评估最早具有潜在的相关性,阿尔茨海默病(AD)的临床前阶段。我们评估了可行性,测试-重测可靠性,以及与AD相关的β淀粉样蛋白(Aβ)病理学相关的语音声学在远程环境中进行多次评估。
    方法:50名认知未受损的成年人(年龄68±6.2岁,58%女性,46%Aβ阳性)完成远程,基于平板电脑的语音评估(即,图片描述,日记提示讲故事,口头流利的任务)五天。在2-3周后重复测试范例。从录音中自动提取声学语音特征,并计算5天期间的平均得分.我们通过系统可用性量表(SUS)问卷的依从率和可用性评级来评估可行性。采用组内相关系数(ICC)检查重测信度。我们调查了声学特征与Aβ病理学之间的关联,使用线性回归模型,根据年龄调整,性和教育。
    结果:语音评估是可行的,91.6%的依从性和可用性评分为86.0±9.9。在平均语音样本中发现高可靠性(ICC≥0.75)。Aβ阳性个体在图片描述(B=-0.05,p=0.040)和日记提示讲故事(B=-0.07,p=0.032)中显示出比Aβ阴性个体更高的停顿与单词比率,尽管这种影响在多次测试校正后失去了意义。
    结论:我们的研究结果支持对有和没有Aβ病理学的认知未受损个体进行语音声学的多日远程评估的可行性和可靠性,这为在早期AD中使用语音生物标志物奠定了基础。
    BACKGROUND: Digital speech assessment has potential relevance in the earliest, preclinical stages of Alzheimer\'s disease (AD). We evaluated the feasibility, test-retest reliability, and association with AD-related amyloid-beta (Aβ) pathology of speech acoustics measured over multiple assessments in a remote setting.
    METHODS: Fifty cognitively unimpaired adults (Age 68 ± 6.2 years, 58% female, 46% Aβ-positive) completed remote, tablet-based speech assessments (i.e., picture description, journal-prompt storytelling, verbal fluency tasks) for five days. The testing paradigm was repeated after 2-3 weeks. Acoustic speech features were automatically extracted from the voice recordings, and mean scores were calculated over the 5-day period. We assessed feasibility by adherence rates and usability ratings on the System Usability Scale (SUS) questionnaire. Test-retest reliability was examined with intraclass correlation coefficients (ICCs). We investigated the associations between acoustic features and Aβ-pathology, using linear regression models, adjusted for age, sex and education.
    RESULTS: The speech assessment was feasible, indicated by 91.6% adherence and usability scores of 86.0 ± 9.9. High reliability (ICC ≥ 0.75) was found across averaged speech samples. Aβ-positive individuals displayed a higher pause-to-word ratio in picture description (B = -0.05, p = 0.040) and journal-prompt storytelling (B = -0.07, p = 0.032) than Aβ-negative individuals, although this effect lost significance after correction for multiple testing.
    CONCLUSIONS: Our findings support the feasibility and reliability of multi-day remote assessment of speech acoustics in cognitively unimpaired individuals with and without Aβ-pathology, which lays the foundation for the use of speech biomarkers in the context of early AD.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    一种计算神经肌肉控制系统,可产生肺压和三个内在的喉部肌肉激活(环甲,甲状腺样,和外侧环状突)来控制声源。在目前的研究中,LeTalker,声乐系统的生物物理计算模型被用作物理植物。在LeTalker中,使用三质量声带模型来模拟自持声带振荡。声道形状使用恒定的//元音。在MRI测量后对气管进行建模。神经肌肉控制系统生成控制参数,以实现四个声学目标(基频,声压级,归一化光谱质心,和信噪比)和四个体感目标(声带长度,和三个声带层中的纵向纤维应力)。基于深度学习的控制系统包括一个声学前馈控制器和两个反馈(声学和体感)控制器。使用LeTalker生成了5万个稳定的语音信号,用于训练控制系统。结果表明,控制系统能够产生肺压和三个肌肉激活,从而高精度地达到四个声学和四个体感目标。培训后,与前馈控制器相比,来自反馈控制器的运动指令校正最小,除了甲状腺样肌腱肌肉激活.
    A computational neuromuscular control system that generates lung pressure and three intrinsic laryngeal muscle activations (cricothyroid, thyroarytenoid, and lateral cricoarytenoid) to control the vocal source was developed. In the current study, LeTalker, a biophysical computational model of the vocal system was used as the physical plant. In the LeTalker, a three-mass vocal fold model was used to simulate self-sustained vocal fold oscillation. A constant/ǝ/vowel was used for the vocal tract shape. The trachea was modeled after MRI measurements. The neuromuscular control system generates control parameters to achieve four acoustic targets (fundamental frequency, sound pressure level, normalized spectral centroid, and signal-to-noise ratio) and four somatosensory targets (vocal fold length, and longitudinal fiber stress in the three vocal fold layers). The deep-learning-based control system comprises one acoustic feedforward controller and two feedback (acoustic and somatosensory) controllers. Fifty thousand steady speech signals were generated using the LeTalker for training the control system. The results demonstrated that the control system was able to generate the lung pressure and the three muscle activations such that the four acoustic and four somatosensory targets were reached with high accuracy. After training, the motor command corrections from the feedback controllers were minimal compared to the feedforward controller except for thyroarytenoid muscle activation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    自闭症患者的言语韵律异常已被广泛报道。许多关于自闭症谱系障碍儿童和成年人说非音调语言的研究表明,使用韵律线索来标记焦点的缺陷。然而,很少检查自闭症儿童说一种音调语言的重点标记。说广东话的孩子可能会面临额外的困难,因为音调语言要求他们使用韵律提示来同时实现多种功能,例如词汇对比和焦点标记。这项研究通过在声学上评估使用粤语语音韵律来标记患有和不患有自闭症谱系障碍的粤语儿童的信息结构,从而弥合了这一研究差距。我们设计了语音制作任务,以在具有不同音调组合的句子中在这些孩子中引起自然的广泛和狭窄的焦点制作。分析了韵律焦点标记的声学相关性,如f0,每个音节的持续时间和强度,以检查参与者组的效果,焦点条件和词汇音调。我们的结果表明,有和没有自闭症谱系障碍的说广东话的儿童之间的焦点标记模式存在差异。自闭症儿童在标记焦点时,不仅在f0范围和持续时间方面表现出焦点扩展不足,但通常也产生不太独特的色调形状。没有证据表明韵律复杂性(即单音或组合的句子)显着影响这些自闭症儿童及其典型发育(TD)同伴的焦点标记。
    Abnormal speech prosody has been widely reported in individuals with autism. Many studies on children and adults with autism spectrum disorder speaking a non-tonal language showed deficits in using prosodic cues to mark focus. However, focus marking by autistic children speaking a tonal language is rarely examined. Cantonese-speaking children may face additional difficulties because tonal languages require them to use prosodic cues to achieve multiple functions simultaneously such as lexical contrasting and focus marking. This study bridges this research gap by acoustically evaluating the use of Cantonese speech prosody to mark information structure by Cantonese-speaking children with and without autism spectrum disorder. We designed speech production tasks to elicit natural broad and narrow focus production among these children in sentences with different tone combinations. Acoustic correlates of prosodic focus marking like f0, duration and intensity of each syllable were analyzed to examine the effect of participant group, focus condition and lexical tones. Our results showed differences in focus marking patterns between Cantonese-speaking children with and without autism spectrum disorder. The autistic children not only showed insufficient on-focus expansion in terms of f0 range and duration when marking focus, but also produced less distinctive tone shapes in general. There was no evidence that the prosodic complexity (i.e. sentences with single tones or combinations of tones) significantly affected focus marking in these autistic children and their typically-developing (TD) peers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们如何产生和感知声音受到喉生理学和生物力学的限制。这样的约束可以将其自身呈现为在说话者之间共享的语音结果空间中的主要维度。本研究试图在语音产生的三维计算模型中识别语音结果空间中的此类主要维度以及潜在的喉部控制机制。使用声带几何形状和刚度的参数变化进行了大规模语音模拟,声门间隙,声道形状,声门下压.主成分分析应用于结合生理控制参数和语音结果测量的数据。结果表明,三个主要维度至少占总方差的50%。前两个维度描述了呼吸-喉部协调在控制产生的声音中低频和高频谐波之间的能量平衡。第三个维度描述了基频的控制。这三个维度的优势表明,沿着这些主要维度的语音变化可能比其他语音变化更一致地产生和被大多数说话者感知,因此更有可能在进化过程中出现并被用来传达重要的个人信息,如情绪和喉的大小。
    How we produce and perceive voice is constrained by laryngeal physiology and biomechanics. Such constraints may present themselves as principal dimensions in the voice outcome space that are shared among speakers. This study attempts to identify such principal dimensions in the voice outcome space and the underlying laryngeal control mechanisms in a three-dimensional computational model of voice production. A large-scale voice simulation was performed with parametric variations in vocal fold geometry and stiffness, glottal gap, vocal tract shape, and subglottal pressure. Principal component analysis was applied to data combining both the physiological control parameters and voice outcome measures. The results showed three dominant dimensions accounting for at least 50% of the total variance. The first two dimensions describe respiratory-laryngeal coordination in controlling the energy balance between low- and high-frequency harmonics in the produced voice, and the third dimension describes control of the fundamental frequency. The dominance of these three dimensions suggests that voice changes along these principal dimensions are likely to be more consistently produced and perceived by most speakers than other voice changes, and thus are more likely to have emerged during evolution and be used to convey important personal information, such as emotion and larynx size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    年龄的增长与对词段中时间线索的敏感性降低有关,特别是当目标单词遵循非信息载体句子或光谱退化时(例如,声编码以模拟人工耳蜗刺激)。这项研究调查了年龄,载体句子,和频谱退化相互作用,导致处理语音时间线索的困难。听力正常的年轻人和老年人在两个连续数上执行了音素分类任务:Buy/Pie对比度与单词初始停止的声音发作时间变化,以及Dish/Ditch对比度与单词最终摩擦音之前的无声间隔变化。目标词是孤立地或在非信息载体句之后呈现的,并且未经处理或通过正弦波声编码(2、4和8通道)降级。与年轻的听众相比,年龄较大的听众对两种时间线索的敏感性均降低。对于购买/馅饼的对比,年龄,承运人句子,和频谱退化相互作用,使得在载体句子条件下,未处理的单词的年龄效应最大。这种模式与碟子/沟渠对比不同,降低光谱分辨率夸大了年龄影响,但是引入载体句子在很大程度上使模式保持不变。这些结果表明,某些时间线索在句子中放置时特别容易老化,可能导致老年人工耳蜗使用者在日常环境中的困难。
    Advancing age is associated with decreased sensitivity to temporal cues in word segments, particularly when target words follow non-informative carrier sentences or are spectrally degraded (e.g., vocoded to simulate cochlear-implant stimulation). This study investigated whether age, carrier sentences, and spectral degradation interacted to cause undue difficulty in processing speech temporal cues. Younger and older adults with normal hearing performed phonemic categorization tasks on two continua: a Buy/Pie contrast with voice onset time changes for the word-initial stop and a Dish/Ditch contrast with silent interval changes preceding the word-final fricative. Target words were presented in isolation or after non-informative carrier sentences, and were unprocessed or degraded via sinewave vocoding (2, 4, and 8 channels). Older listeners exhibited reduced sensitivity to both temporal cues compared to younger listeners. For the Buy/Pie contrast, age, carrier sentence, and spectral degradation interacted such that the largest age effects were seen for unprocessed words in the carrier sentence condition. This pattern differed from the Dish/Ditch contrast, where reducing spectral resolution exaggerated age effects, but introducing carrier sentences largely left the patterns unchanged. These results suggest that certain temporal cues are particularly susceptible to aging when placed in sentences, likely contributing to the difficulties of older cochlear-implant users in everyday environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在接受甲状腺手术的患者中经常观察到单侧声带麻痹。本研究探讨了声学语音分析(客观测量)与语音障碍指数(VHI,自我评估工具)。纳入了有或没有术后单侧声带麻痹(PVCP和NPVCP)的甲状腺手术患者。通过VHI和发音障碍严重程度指数(DSI)工具对患者进行评估。PVCP患者的VHI评分明显高于NPVCP患者。抖动(%)和微光(%)显著增加,而PVCP患者的DSI显著降低。受试者工作特征曲线显示VHI评分与PVCP的诊断相关,其中VHI总分的曲线下面积(AUC)为0.81。在声学参数中,DSI与PVCP高度相关(AUC=0.82,95CI=0.75至0.89)。此外,我们发现VHI评分与语音声学参数之间存在相关性.其中,DSI与功能和VHI评分有中等相关性,R值分别为0.41和0.49。VHI评分和声学参数与PVCP的诊断相关。
    Unilateral vocal cord paralysis is frequently observed in patients who undergo thyroid surgery. This study explored the correlation between acoustic voice analysis (objective measure) and Voice Handicap Index (VHI, a self-assessment tool). One hundred and forty patients who had thyroid surgery with or without postoperative unilateral vocal cord paralysis (PVCP and NPVCP) were included. The patients were evaluated by the VHI and Dysphonia Severity Index (DSI) tools. VHI scores were significantly higher in PVCP patients than in NPVCP patients. Jitter (%) and shimmer (%) were significantly increased, whereas DSI was significantly decreased in PVCP patients. Receiver operating characteristics curve revealed that VHI scores were associated with the diagnosis of PVCP, of which VHI total score yielded an area under the curve (AUC) of 0.81. Among acoustic parameters, DSI was highly associated to PVCP (AUC=0.82, 95%CI=0.75 to 0.89). Moreover, we found a correlation between VHI scores and voice acoustic parameters. Among them, DSI had a moderate correlation with functional and VHI scores, as suggested by an R value of 0.41 and 0.49, respectively. VHI scores and acoustic parameters were associated with the diagnosis of PVCP.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究旨在探讨戒毒对普通话语音声学特征的影响。收集了66名戒毒不同时间戒毒的男性海洛因使用者的言语录音,特别是排毒时间少于2年的早期戒毒使用者,持续戒毒2年的使用者,以及排毒时间超过2年的长期戒毒使用者。声学分析的结果表明,早期戒断用户的响度较低,F1、F2和F3的相对能量,较高的H1-A3和较少的响度峰值/秒,以及更长的无声片段平均持续时间,与持续和长期禁欲的用户相比。研究结果表明,戒毒可能会导致戒断海洛因使用者的言语康复过程(例如,声音嘶哑较少)。本研究不仅为戒毒对言语产生的影响提供了有价值的见解,而且为海洛因使用者的言语康复和戒毒治疗提供了理论依据。
    This study aims to investigate the effect of detoxification on acoustic features of Mandarin speech. Speech recordings were collected from 66 male abstinent heroin users with different durations of drug detoxification, specifically early abstinent users with a detoxification duration of less than 2 years, sustained abstinent users with 2 years of detoxification, and long-term abstinent users with a detoxification duration of more than 2 years. The results of the acoustic analyses showed that early abstinent users exhibited lower loudness, relative energies of F1, F2, and F3, higher H1-A3, and fewer loudness peaks per second, as well as a longer average duration of unvoiced segments, compared to the sustained and long-term abstinent users. The findings suggest that detoxification may lead to a rehabilitation process in the speech production of abstinent heroin users (e.g., less vocal hoarseness). This study not only provides valuable insights into the effect of detoxification on speech production but also provides a theoretical basis for the speech rehabilitation and detoxification treatment of heroin users.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    尽管不同的因素和声音测量与声带损伤声带功能亢进(PVH)有关,目前尚不清楚有多大比例的PVH患者在日常生活中表现出这种差异.这项研究使用机器学习方法来量化PVH根据动态语音测量所表现出的一致性。分析包括发声的声学参数以及发声和休息的时间方面,目的是确定PVH的最佳一致特征。
    在1周内记录了116名诊断为PVH和年龄的女性参与者的动态颈部表面加速度信号,sex-,和职业匹配的健康对照。PVH表现的一致性定义为每组中基于目标语音测量表现出非典型特征的参与者的百分比。每个机器学习模型的评估都使用嵌套的10倍交叉验证来提高结果的泛化性。在实验1中,我们根据14种语音度量的分布特征以及语音和静息段的持续时间来训练单独的逻辑回归模型。在实验2和3中,发声和静息持续时间的特征增强了现有的分布特征,以检查是否会产生更一致的签名。
    实验1表明,前两个谐波(H1-H2)的幅度差异表现出最一致的特征(69.4%的PVH参与者和20.4%的对照组具有非典型的H1-H2特征),其次是8个谐波的频谱倾斜(73.6%的PVH患者和32.1%的对照患者具有非典型的频谱倾斜特征)和估计的声压级(SPL;66.9%的PVH患者和27.6%的对照患者具有非典型的SPL特征).此外,77.6%的PVH患者有不典型的静息时间,68.9%表现出非典型发声持续时间。实验2和3表明,利用发声或静息持续时间的单变量特征来增强表现最佳的语音测量仅在分类器的性能方面产生增量改进。
    患有PVH的女性更有可能使用更突然的声带闭合(下H1-H2),更大声(更高的SPL),并采取较短的声音休息。他们在日常活动中也不太可能使用更高的基本频率。PVH参与者和对照组之间的发声持续时间特征差异具有较大的效应大小,为语音使用在PVH发展中的作用提供了强有力的经验证据。
    UNASSIGNED: Although different factors and voice measures have been associated with phonotraumatic vocal hyperfunction (PVH), it is unclear what percentage of individuals with PVH exhibit such differences during their daily lives. This study used a machine learning approach to quantify the consistency with which PVH manifests according to ambulatory voice measures. Analyses included acoustic parameters of phonation as well as temporal aspects of phonation and rest, with the goal of determining optimally consistent signatures of PVH.
    UNASSIGNED: Ambulatory neck-surface acceleration signals were recorded over 1 week from 116 female participants diagnosed with PVH and age-, sex-, and occupation-matched vocally healthy controls. The consistency of the manifestation of PVH was defined as the percentage of participants in each group that exhibited an atypical signature based on a target voice measure. Evaluation of each machine learning model used nested 10-fold cross-validation to improve the generalizability of findings. In Experiment 1, we trained separate logistic regression models based on the distributional characteristics of 14 voice measures and durations of voicing and resting segments. In Experiments 2 and 3, features of voicing and resting duration augmented the existing distributional characteristics to examine whether more consistent signatures would result.
    UNASSIGNED: Experiment 1 showed that the difference in the magnitude of the first two harmonics (H1-H2) exhibited the most consistent signature (69.4% of participants with PVH and 20.4% of controls had an atypical H1-H2 signature), followed by spectral tilt over eight harmonics (73.6% participants with PVH and 32.1% of controls had an atypical spectral tilt signature) and estimated sound pressure level (SPL; 66.9% participants with PVH and 27.6% of controls had an atypical SPL signature). Additionally, 77.6% of participants with PVH had atypical resting duration, with 68.9% exhibiting atypical voicing duration. Experiments 2 and 3 showed that augmenting the best-performing voice measures with univariate features of voicing or resting durations yielded only incremental improvement in the classifier\'s performance.
    UNASSIGNED: Females with PVH were more likely to use more abrupt vocal fold closure (lower H1-H2), phonate louder (higher SPL), and take shorter vocal rests. They were also less likely to use higher fundamental frequency during their daily activities. The difference in the voicing duration signature between participants with PVH and controls had a large effect size, providing strong empirical evidence regarding the role of voice use in the development of PVH.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    与正常听力同龄人相比,人工耳蜗使用者很难控制他们的发声。然而,对他们的声音质量知之甚少。本研究的主要目的是通过声学语音质量指数(AVQI)和平滑倒谱峰突出度(CPPS)确定是否将人工耳蜗使用者的语音质量归类为呼吸困难。次要目的是确定与仅使用一个植入物相比,使用双侧植入物是否会进一步影响人声质量。最终目的是确定残余听力如何影响语音质量。27名人工耳蜗使用者参加了本研究,并在维持元音和阅读标准化段落时进行了记录。分析这些记录以计算AVQI和CPPS。结果表明,CI用户的语音质量因使用其CI而受到不利影响,从而提高到了发音障碍的程度。具体来说,当使用他们的CI时,平均AVQI分数为4.0,平均CPPS值为11.4dB,这表明发声障碍。将双侧植入物的参与者与使用一个植入物的参与者进行比较时,语音质量没有显着差异。最后,对于有剩余听力的参与者,随着听力阈值的恶化,发音障碍的可能性降低。
    Cochlear implant users experience difficulties controlling their vocalizations compared to normal hearing peers. However, less is known about their voice quality. The primary aim of the present study was to determine if cochlear implant users\' voice quality would be categorized as dysphonic by the Acoustic Voice Quality Index (AVQI) and smoothed cepstral peak prominence (CPPS). A secondary aim was to determine if vocal quality is further impacted when using bilateral implants compared to using only one implant. The final aim was to determine how residual hearing impacts voice quality. Twenty-seven cochlear implant users participated in the present study and were recorded while sustaining a vowel and while reading a standardized passage. These recordings were analyzed to calculate the AVQI and CPPS. The results indicate that CI users\' voice quality was detrimentally affected by using their CI, raising to the level of a dysphonic voice. Specifically, when using their CI, mean AVQI scores were 4.0 and mean CPPS values were 11.4 dB, which indicates dysphonia. There were no significant differences in voice quality when comparing participants with bilateral implants to those with one implant. Finally, for participants with residual hearing, as hearing thresholds worsened, the likelihood of a dysphonic voice decreased.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    语音是由非线性产生的,动态声道(VT)系统,并通过多个(空气,骨骼和皮肤传导)模式,被空中捕获,骨和喉咙麦克风分别。捕获这种非线性的说话者特定特征很少用作说话者建模的独立特征,并且充其量与众所周知的线性光谱特征一起使用以产生有形的结果。本文提出了递归图(RP)嵌入作为独立的,非线性说话人辨别特征。两个数据集,连续多模态TIMIT语音语料库和辅音元音单峰音节数据集,在这项研究中用于进行闭集说话者识别实验。单峰说话人识别系统的实验表明,RP嵌入捕获了每个说话人独有的VT系统的非线性动力学,在所有的语音模式中。空气(A)骨骼(B)和喉咙(T)麦克风系统,纯粹在RP嵌入上训练,准确率为95.81%,98.18%和99.74%,分别。使用双峰组合RP嵌入的联合特征空间的实验(A-T,A-B,B-T)和三峰(A-B-T)系统表明,最佳的三峰系统(99.84%的精度)与使用频谱图(99.45%)和MFCC(99.98%)的三峰系统相当。B-T双模系统的98.84%性能显示了完全基于替代(骨骼和喉咙)语音的说话人识别系统的功效,在没有标准(空中)演讲的情况下。结果强调了RP嵌入的重要性,作为动态VT系统的非线性特征表示,可以独立地进行说话人识别。可以设想,语音识别也将受益于该非线性特征。
    Speech is produced by a nonlinear, dynamical Vocal Tract (VT) system, and is transmitted through multiple (air, bone and skin conduction) modes, as captured by the air, bone and throat microphones respectively. Speaker specific characteristics that capture this nonlinearity are rarely used as stand-alone features for speaker modeling, and at best have been used in tandem with well known linear spectral features to produce tangible results. This paper proposes Recurrent Plot (RP) embeddings as stand-alone, non-linear speaker-discriminating features. Two datasets, the continuous multimodal TIMIT speech corpus and the consonant-vowel unimodal syllable dataset, are used in this study for conducting closed-set speaker identification experiments. Experiments with unimodal speaker recognition systems show that RP embeddings capture the nonlinear dynamics of the VT system which are unique to every speaker, in all the modes of speech. The Air (A), Bone (B) and Throat (T) microphone systems, trained purely on RP embeddings perform with an accuracy of 95.81%, 98.18% and 99.74%, respectively. Experiments using the joint feature space of combined RP embeddings for bimodal (A-T, A-B, B-T) and trimodal (A-B-T) systems show that the best trimodal system (99.84% accuracy) performs on par with trimodal systems using spectrogram (99.45%) and MFCC (99.98%). The 98.84% performance of the B-T bimodal system shows the efficacy of a speaker recognition system based entirely on alternate (bone and throat) speech, in the absence of the standard (air) speech. The results underscore the significance of the RP embedding, as a nonlinear feature representation of the dynamical VT system that can act independently for speaker recognition. It is envisaged that speech recognition too will benefit from this nonlinear feature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号