time-reversible models

  • 文章类型: Journal Article
    单矩阵氨基酸(AA)取代模型广泛用于系统发育分析;然而,他们无法正确模拟站点之间AA替代率的异质性。多矩阵混合模型可以处理站点速率异质性并且优于单矩阵模型。估计多矩阵混合模型是一个复杂的过程,没有计算机程序可用于此任务。在这项研究中,我们基于LG4X和LG4M算法实现了所谓的QMix的计算机程序,并进行了一些增强,可以从大型数据集中自动估计多矩阵混合模型。QMix采用QMaker算法而不是XRATE算法来准确快速地估计模型的参数。它能够估计具有不同数量矩阵的混合模型,并支持多线程计算,以有效地估计来自数千个基因的模型。我们从1471个HSSP比对中重新估计了混合模型LG4X和LG4M。在从HSSP和TreeBASE数据集构建最大似然树方面,重新估计的模型(HP4X和HP4M)略优于LG4X和LG4M。QMix程序需要在具有18个核心的计算机上大约10个小时来估计具有来自200个HSSP比对的四个矩阵的混合模型。它易于使用,可供研究人员免费使用。
    The single-matrix amino acid (AA) substitution models are widely used in phylogenetic analyses; however, they are unable to properly model the heterogeneity of AA substitution rates among sites. The multi-matrix mixture models can handle the site rate heterogeneity and outperform the single-matrix models. Estimating multi-matrix mixture models is a complex process and no computer program is available for this task. In this study, we implemented a computer program of the so-called QMix based on the algorithm of LG4X and LG4M with several enhancements to automatically estimate multi-matrix mixture models from large datasets. QMix employs QMaker algorithm instead of XRATE algorithm to accurately and rapidly estimate the parameters of models. It is able to estimate mixture models with different number of matrices and supports multi-threading computing to efficiently estimate models from thousands of genes. We re-estimate mixture models LG4X and LG4M from 1471 HSSP alignments. The re-estimated models (HP4X and HP4M) are slightly better than LG4X and LG4M in building maximum likelihood trees from HSSP and TreeBASE datasets. QMix program required about 10 hours on a computer with 18 cores to estimate a mixture model with four matrices from 200 HSSP alignments. It is easy to use and freely available for researchers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    氨基酸替代模型的参数估计是生物信息学的一项重要任务。已经提出了最大似然(ML)方法来估计来自大型数据集的氨基酸取代模型。新估计模型的质量通常通过与构建ML树的现有模型进行比较来评估。剩下的两个重要问题是估计模型与真实模型的相关性以及估计可靠模型所需的训练数据集的大小。在这篇文章中,我们进行了模拟研究,以根据模拟数据回答这两个问题。我们基于预定义的模型(称为真实模型)和预定义的树(称为真实树)模拟具有不同数量的基因/比对的基因组数据集。模拟数据集用于使用ML估计方法估计氨基酸取代模型。我们的实验表明,通过ML方法从具有100多个基因的模拟数据集中估计的模型与真实模型具有高度相关性。与真实模型相比,估计模型在构建ML树中表现良好。结果表明,通过ML方法从大基因组数据集中估计的氨基酸替换模型是分析氨基酸序列的可靠工具。
    Estimating parameters of amino acid substitution models is a crucial task in bioinformatics. The maximum likelihood (ML) approach has been proposed to estimate amino acid substitution models from large datasets. The quality of newly estimated models is normally assessed by comparing with the existing models in building ML trees. Two important questions remained are the correlation of the estimated models with the true models and the required size of the training datasets to estimate reliable models. In this article, we performed a simulation study to answer these two questions based on simulated data. We simulated genome datasets with different numbers of genes/alignments based on predefined models (called true models) and predefined trees (called true trees). The simulated datasets were used to estimate amino acid substitution model using the ML estimation methods. Our experiments showed that models estimated by the ML methods from simulated datasets with more than 100 genes have high correlations with the true models. The estimated models performed well in building ML trees in comparison with the true models. The results suggest that amino acid substitution models estimated by the ML methods from large genome datasets are a reliable tool for analyzing amino acid sequences.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Amino acid substitution models represent the substitution rates among amino acids during the evolution of protein sequences. The models are a prerequisite for maximum likelihood or Bayesian methods to analyse the phylogenetic relationships among species based on their protein sequences. Estimating amino acid substitution models requires large protein datasets and intensive computation. In this paper, we presented the estimation of both time-reversible model (Q.met) and time non-reversible model (NQ.met) for multicellular animals (Metazoa). Analyses showed that the Q.met and NQ.met models were significantly better than existing models in analysing metazoan protein sequences. Moreover, the time non-reversible model NQ.met enables us to reconstruct the rooted phylogenetic tree for Metazoa. We recommend researchers to employ the Q.met and NQ.met models in analysing metazoan protein sequences.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号