关键词: artificial intelligence audio cloned cloning deep learning deepfake deepfakes machine learning model-naive sound sounds speech text to speech vocal vocal biomarkers voice

来  源:   DOI:10.2196/56245   PDF(Pubmed)

Abstract:
BACKGROUND: The digital era has witnessed an escalating dependence on digital platforms for news and information, coupled with the advent of \"deepfake\" technology. Deepfakes, leveraging deep learning models on extensive data sets of voice recordings and images, pose substantial threats to media authenticity, potentially leading to unethical misuse such as impersonation and the dissemination of false information.
OBJECTIVE: To counteract this challenge, this study aims to introduce the concept of innate biological processes to discern between authentic human voices and cloned voices. We propose that the presence or absence of certain perceptual features, such as pauses in speech, can effectively distinguish between cloned and authentic audio.
METHODS: A total of 49 adult participants representing diverse ethnic backgrounds and accents were recruited. Each participant contributed voice samples for the training of up to 3 distinct voice cloning text-to-speech models and 3 control paragraphs. Subsequently, the cloning models generated synthetic versions of the control paragraphs, resulting in a data set consisting of up to 9 cloned audio samples and 3 control samples per participant. We analyzed the speech pauses caused by biological actions such as respiration, swallowing, and cognitive processes. Five audio features corresponding to speech pause profiles were calculated. Differences between authentic and cloned audio for these features were assessed, and 5 classical machine learning algorithms were implemented using these features to create a prediction model. The generalization capability of the optimal model was evaluated through testing on unseen data, incorporating a model-naive generator, a model-naive paragraph, and model-naive participants.
RESULTS: Cloned audio exhibited significantly increased time between pauses (P<.001), decreased variation in speech segment length (P=.003), increased overall proportion of time speaking (P=.04), and decreased rates of micro- and macropauses in speech (both P=.01). Five machine learning models were implemented using these features, with the AdaBoost model demonstrating the highest performance, achieving a 5-fold cross-validation balanced accuracy of 0.81 (SD 0.05). Other models included support vector machine (balanced accuracy 0.79, SD 0.03), random forest (balanced accuracy 0.78, SD 0.04), logistic regression, and decision tree (balanced accuracies 0.76, SD 0.10 and 0.72, SD 0.06). When evaluating the optimal AdaBoost model, it achieved an overall test accuracy of 0.79 when predicting unseen data.
CONCLUSIONS: The incorporation of perceptual, biological features into machine learning models demonstrates promising results in distinguishing between authentic human voices and cloned audio.
摘要:
背景:数字时代见证了对新闻和信息的数字平台的日益依赖,再加上“deepfake”技术的出现。Deepfakes,利用语音记录和图像的大量数据集的深度学习模型,对媒体真实性构成重大威胁,可能导致不道德的滥用,如冒充和传播虚假信息。
目标:为了应对这一挑战,这项研究旨在引入先天生物过程的概念,以区分真实的人类声音和克隆的声音。我们建议存在或不存在某些感知特征,比如讲话中的停顿,可以有效区分克隆和真实的音频。
方法:共招募了49名具有不同种族背景和口音的成年参与者。每个参与者贡献语音样本,用于训练多达3个不同的语音克隆文本到语音模型和3个控制段落。随后,克隆模型生成了控制段落的合成版本,产生由每个参与者多达9个克隆音频样本和3个对照样本组成的数据集。我们分析了呼吸等生物行为引起的语音停顿,吞咽,和认知过程。计算了对应于语音暂停简档的五个音频特征。评估了这些特征的真实音频和克隆音频之间的差异,和5个经典的机器学习算法实现了使用这些特征来创建预测模型。通过对看不见的数据进行测试,评估了最优模型的泛化能力,结合了一个朴素的生成器,一个模型天真的段落,和幼稚的参与者。
结果:克隆音频显示暂停之间的时间显着增加(P<.001),语音段长度的变化减少(P=0.003),发言时间的总比例增加(P=.04),语音中的micro和macropauses比率降低(P=0.01)。使用这些功能实现了五个机器学习模型,AdaBoost模型展示了最高的性能,实现5倍交叉验证平衡精度为0.81(SD0.05)。其他模型包括支持向量机(平衡精度0.79,SD0.03),随机森林(平衡精度0.78,SD0.04),逻辑回归,和决策树(平衡精度0.76,SD0.10和0.72,SD0.06)。在评估最优AdaBoost模型时,在预测未知数据时,它实现了0.79的总体测试准确性。
结论:引入感知,机器学习模型中的生物特征在区分真实的人类声音和克隆音频方面显示出有希望的结果。
公众号