text to speech

文本到语音
  • 文章类型: Journal Article
    背景:数字时代见证了对新闻和信息的数字平台的日益依赖,再加上“deepfake”技术的出现。Deepfakes,利用语音记录和图像的大量数据集的深度学习模型,对媒体真实性构成重大威胁,可能导致不道德的滥用,如冒充和传播虚假信息。
    目标:为了应对这一挑战,这项研究旨在引入先天生物过程的概念,以区分真实的人类声音和克隆的声音。我们建议存在或不存在某些感知特征,比如讲话中的停顿,可以有效区分克隆和真实的音频。
    方法:共招募了49名具有不同种族背景和口音的成年参与者。每个参与者贡献语音样本,用于训练多达3个不同的语音克隆文本到语音模型和3个控制段落。随后,克隆模型生成了控制段落的合成版本,产生由每个参与者多达9个克隆音频样本和3个对照样本组成的数据集。我们分析了呼吸等生物行为引起的语音停顿,吞咽,和认知过程。计算了对应于语音暂停简档的五个音频特征。评估了这些特征的真实音频和克隆音频之间的差异,和5个经典的机器学习算法实现了使用这些特征来创建预测模型。通过对看不见的数据进行测试,评估了最优模型的泛化能力,结合了一个朴素的生成器,一个模型天真的段落,和幼稚的参与者。
    结果:克隆音频显示暂停之间的时间显着增加(P<.001),语音段长度的变化减少(P=0.003),发言时间的总比例增加(P=.04),语音中的micro和macropauses比率降低(P=0.01)。使用这些功能实现了五个机器学习模型,AdaBoost模型展示了最高的性能,实现5倍交叉验证平衡精度为0.81(SD0.05)。其他模型包括支持向量机(平衡精度0.79,SD0.03),随机森林(平衡精度0.78,SD0.04),逻辑回归,和决策树(平衡精度0.76,SD0.10和0.72,SD0.06)。在评估最优AdaBoost模型时,在预测未知数据时,它实现了0.79的总体测试准确性。
    结论:引入感知,机器学习模型中的生物特征在区分真实的人类声音和克隆音频方面显示出有希望的结果。
    BACKGROUND: The digital era has witnessed an escalating dependence on digital platforms for news and information, coupled with the advent of \"deepfake\" technology. Deepfakes, leveraging deep learning models on extensive data sets of voice recordings and images, pose substantial threats to media authenticity, potentially leading to unethical misuse such as impersonation and the dissemination of false information.
    OBJECTIVE: To counteract this challenge, this study aims to introduce the concept of innate biological processes to discern between authentic human voices and cloned voices. We propose that the presence or absence of certain perceptual features, such as pauses in speech, can effectively distinguish between cloned and authentic audio.
    METHODS: A total of 49 adult participants representing diverse ethnic backgrounds and accents were recruited. Each participant contributed voice samples for the training of up to 3 distinct voice cloning text-to-speech models and 3 control paragraphs. Subsequently, the cloning models generated synthetic versions of the control paragraphs, resulting in a data set consisting of up to 9 cloned audio samples and 3 control samples per participant. We analyzed the speech pauses caused by biological actions such as respiration, swallowing, and cognitive processes. Five audio features corresponding to speech pause profiles were calculated. Differences between authentic and cloned audio for these features were assessed, and 5 classical machine learning algorithms were implemented using these features to create a prediction model. The generalization capability of the optimal model was evaluated through testing on unseen data, incorporating a model-naive generator, a model-naive paragraph, and model-naive participants.
    RESULTS: Cloned audio exhibited significantly increased time between pauses (P<.001), decreased variation in speech segment length (P=.003), increased overall proportion of time speaking (P=.04), and decreased rates of micro- and macropauses in speech (both P=.01). Five machine learning models were implemented using these features, with the AdaBoost model demonstrating the highest performance, achieving a 5-fold cross-validation balanced accuracy of 0.81 (SD 0.05). Other models included support vector machine (balanced accuracy 0.79, SD 0.03), random forest (balanced accuracy 0.78, SD 0.04), logistic regression, and decision tree (balanced accuracies 0.76, SD 0.10 and 0.72, SD 0.06). When evaluating the optimal AdaBoost model, it achieved an overall test accuracy of 0.79 when predicting unseen data.
    CONCLUSIONS: The incorporation of perceptual, biological features into machine learning models demonstrates promising results in distinguishing between authentic human voices and cloned audio.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:与非阅读障碍的同龄人相比,阅读障碍的学生在需要阅读技能的定时评估中可能处于不利地位,即使他们不一定比这些学生聪明或准备不足。
    目的:该研究旨在分析在需要阅读技能的评估中使用辅助工具时,阅读障碍学生的阅读理解和阅读流畅性可能带来的好处。
    方法:每位患有阅读障碍的参与者在两种不同的阅读界面中进行了阅读理解评估:在白皮书和我们以前开发的工具中,以评估和比较他们的阅读理解和阅读流畅性。
    结果:获得的结果表明,通过使用补偿软件帮助他们在焦虑的情况下阅读,阅读障碍的学生可以从流利和阅读理解的角度受益。
    结论:设计得当的辅助工具可以提高阅读困难青少年的阅读能力。考虑到当前研究中提出的改进,应改进已开发的工具,以同样的方式,建议与更多的参与者进行研究,并对他们进行单独评估。
    BACKGROUND: Students with dyslexia may be at a disadvantage on timed assessments that require reading skills compared to their non-dyslexic peers, even though they are not necessarily less intelligent or less prepared than these students.
    OBJECTIVE: The study aims to analyze the possible benefits in reading comprehension and reading fluency of students with Dyslexia when using an assistance tool in an evaluation that requires reading skills.
    METHODS: Each participant with dyslexia did a reading comprehension assessment in two different reading interfaces: on a white paper and in our previously developed tool, in order to assess and compare their reading comprehension and reading fluency.
    RESULTS: The results obtained show that students with dyslexia could benefit from the point of view of fluency and reading comprehension by using compensation software that helps them read in situations of anxiety.
    CONCLUSIONS: Properly designed assistive tools can enhance the reading skills of young people with dyslexia. The developed tool should be improved considering the improvements proposed in the present studies, in the same way it is suggested to carry out a study with a larger number of participants and evaluate them individually.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    当盲人和聋人乘坐全自动驾驶汽车时,应该为聋人提供直观准确的可视化屏幕,并且应该为盲人提供具有语音到文本(STT)和文本到语音(TTS)功能的听觉系统。然而,这些系统无法知道故障自诊断信息和指示车辆行驶时当前状态的组合仪表信息。本文提出了一种基于深度学习的盲人和聋人自主车辆的听觉和可视化系统(AVS)来解决这一问题。AVS由三个模块组成。数据收集和管理模块(DCMM)存储和管理从车辆收集的数据。音频转换模块(ACM)有一个语音到文本子模块(STS),用于识别用户的语音并将其转换为文本数据,以及将文本数据转换为语音的文本到波形子模块(TWS)。数据可视化模块(DVM)将收集的传感器数据可视化,故障自诊断数据,等。,并根据车辆显示的大小放置可视化数据。实验表明,在车载诊断(OBD)中调整可视化图形组件所需的时间比云服务器所需的时间快约2.5倍。此外,AVS系统的总计算时间比现有组合仪表快约2ms。因此,因为本文提出的AVS可以使盲人和聋人只选择他们想听到和看到的东西,它降低了变速器的过载,大大提高了车辆的安全性。如果在真实车辆中引入AVS,它可以提前防止残疾人和其他乘客发生事故。
    When blind and deaf people are passengers in fully autonomous vehicles, an intuitive and accurate visualization screen should be provided for the deaf, and an audification system with speech-to-text (STT) and text-to-speech (TTS) functions should be provided for the blind. However, these systems cannot know the fault self-diagnosis information and the instrument cluster information that indicates the current state of the vehicle when driving. This paper proposes an audification and visualization system (AVS) of an autonomous vehicle for blind and deaf people based on deep learning to solve this problem. The AVS consists of three modules. The data collection and management module (DCMM) stores and manages the data collected from the vehicle. The audification conversion module (ACM) has a speech-to-text submodule (STS) that recognizes a user\'s speech and converts it to text data, and a text-to-wave submodule (TWS) that converts text data to voice. The data visualization module (DVM) visualizes the collected sensor data, fault self-diagnosis data, etc., and places the visualized data according to the size of the vehicle\'s display. The experiment shows that the time taken to adjust visualization graphic components in on-board diagnostics (OBD) was approximately 2.5 times faster than the time taken in a cloud server. In addition, the overall computational time of the AVS system was approximately 2 ms faster than the existing instrument cluster. Therefore, because the AVS proposed in this paper can enable blind and deaf people to select only what they want to hear and see, it reduces the overload of transmission and greatly increases the safety of the vehicle. If the AVS is introduced in a real vehicle, it can prevent accidents for disabled and other passengers in advance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号