关键词: amino acid substitution models maximum likelihood estimation methods simulated amino acid data time-nonreversible models time-reversible models

Mesh : Algorithms Amino Acid Substitution Phylogeny Computer Simulation Genome Models, Genetic

来  源:   DOI:10.1093/jeb/voad017

Abstract:
Estimating parameters of amino acid substitution models is a crucial task in bioinformatics. The maximum likelihood (ML) approach has been proposed to estimate amino acid substitution models from large datasets. The quality of newly estimated models is normally assessed by comparing with the existing models in building ML trees. Two important questions remained are the correlation of the estimated models with the true models and the required size of the training datasets to estimate reliable models. In this article, we performed a simulation study to answer these two questions based on simulated data. We simulated genome datasets with different numbers of genes/alignments based on predefined models (called true models) and predefined trees (called true trees). The simulated datasets were used to estimate amino acid substitution model using the ML estimation methods. Our experiments showed that models estimated by the ML methods from simulated datasets with more than 100 genes have high correlations with the true models. The estimated models performed well in building ML trees in comparison with the true models. The results suggest that amino acid substitution models estimated by the ML methods from large genome datasets are a reliable tool for analyzing amino acid sequences.
摘要:
氨基酸替代模型的参数估计是生物信息学的一项重要任务。已经提出了最大似然(ML)方法来估计来自大型数据集的氨基酸取代模型。新估计模型的质量通常通过与构建ML树的现有模型进行比较来评估。剩下的两个重要问题是估计模型与真实模型的相关性以及估计可靠模型所需的训练数据集的大小。在这篇文章中,我们进行了模拟研究,以根据模拟数据回答这两个问题。我们基于预定义的模型(称为真实模型)和预定义的树(称为真实树)模拟具有不同数量的基因/比对的基因组数据集。模拟数据集用于使用ML估计方法估计氨基酸取代模型。我们的实验表明,通过ML方法从具有100多个基因的模拟数据集中估计的模型与真实模型具有高度相关性。与真实模型相比,估计模型在构建ML树中表现良好。结果表明,通过ML方法从大基因组数据集中估计的氨基酸替换模型是分析氨基酸序列的可靠工具。
公众号