Multiple sequence alignment

多序列比对
  • 文章类型: Journal Article
    酶4-羟基苯基丙酮酸双加氧酶(4-HPPD)参与细菌等生物体中氨基酸酪氨酸的分解代谢,植物,和动物。在分子氧和Fe(II)作为辅因子的存在下,它催化4-羟基苯基丙酮酸转化为匀浆。这种酶代表了重要化合物生物合成的关键步骤,它的活性缺乏导致严重的,罕见的常染色体隐性遗传疾病,比如III型酪氨酸血症和霍金蛋白尿,目前尚无治愈方法。4-HPPDC末端尾部在酶催化/门控机制中起着至关重要的作用,通过对C端尾部构象的精细调节来确保催化活性位点的完整性。然而,尽管人们对4-HPPD的催化机理和结构越来越感兴趣,门控机制尚不清楚.此外,整个3D结构的缺乏使得生物信息学方法成为定义酶结构/分子机制的唯一可能的研究。这里,通过应用全面的生物信息学/进化研究,对野生型4-HPPD及其突变体进行了深入解剖,我们首次展示了酶门控过程的整个分子机制和调控,提出了人4-HPPD的全长3D结构和涉及4-HPPDC末端尾部构象变化的两个新的关键残基。
    The enzyme 4-hydroxyphenylpyruvate dioxygenase (4-HPPD) is involved in the catabolism of the amino acid tyrosine in organisms such as bacteria, plants, and animals. It catalyzes the conversion of 4-hydroxyphenylpyruvate to a homogenisate in the presence of molecular oxygen and Fe(II) as a cofactor. This enzyme represents a key step in the biosynthesis of important compounds, and its activity deficiency leads to severe, rare autosomal recessive disorders, like tyrosinemia type III and hawkinsinuria, for which no cure is currently available. The 4-HPPD C-terminal tail plays a crucial role in the enzyme catalysis/gating mechanism, ensuring the integrity of the active site for catalysis through fine regulation of the C-terminal tail conformation. However, despite growing interest in the 4-HPPD catalytic mechanism and structure, the gating mechanism remains unclear. Furthermore, the absence of the whole 3D structure makes the bioinformatic approach the only possible study to define the enzyme structure/molecular mechanism. Here, wild-type 4-HPPD and its mutants were deeply dissected by applying a comprehensive bioinformatics/evolution study, and we showed for the first time the entire molecular mechanism and regulation of the enzyme gating process, proposing the full-length 3D structure of human 4-HPPD and two novel key residues involved in the 4-HPPD C-terminal tail conformational change.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    酶在各种工业生产和药物开发中起着至关重要的作用,作为众多生化反应的催化剂。确定酶的最佳催化温度(Topt)对于优化反应条件至关重要。提高催化效率,加快工业进程。然而,由于实验确定的Topt数据的可用性有限,以及现有计算方法在预测Topt时的准确性不足,迫切需要一种计算方法来准确预测酶的Topt值。在这项研究中,使用磷酸酶(EC3.1.3。X)作为一个例子,我们构建了一个机器学习模型,利用氨基酸频率和蛋白质分子量信息作为特征,并采用K-最近邻回归算法预测酶的Topt.通常,在进行酶热稳定性工程时,研究人员倾向于不修饰保守的氨基酸。因此,我们利用这个机器学习模型来预测去除保守氨基酸后磷酸酶序列的Topt。我们发现,与基于完整序列的模型相比,预测模型的平均决定系数(R2)值从0.599增加到0.755。随后,对10种磷酸酶的最佳催化温度未确定的实验验证表明,大多数磷酸酶基于不含保守氨基酸的序列的预测值更接近实验最佳催化温度值。本研究为快速筛选适合工业条件的酶奠定了基础。
    Enzymes play a crucial role in various industrial production and pharmaceutical developments, serving as catalysts for numerous biochemical reactions. Determining the optimal catalytic temperature (Topt) of enzymes is crucial for optimizing reaction conditions, enhancing catalytic efficiency, and accelerating the industrial processes. However, due to the limited availability of experimentally determined Topt data and the insufficient accuracy of existing computational methods in predicting Topt, there is an urgent need for a computational approach to predict the Topt values of enzymes accurately. In this study, using phosphatase (EC 3.1.3.X) as an example, we constructed a machine learning model utilizing amino acid frequency and protein molecular weight information as features and employing the K-nearest neighbors regression algorithm to predict the Topt of enzymes. Usually, when conducting engineering for enzyme thermostability, researchers tend not to modify conserved amino acids. Therefore, we utilized this machine learning model to predict the Topt of phosphatase sequences after removing conserved amino acids. We found that the predictive model\'s mean coefficient of determination (R2) value increased from 0.599 to 0.755 compared to the model based on the complete sequences. Subsequently, experimental validation on 10 phosphatase enzymes with undetermined optimal catalytic temperatures shows that the predicted values of most phosphatase enzymes based on the sequence without conservative amino acids are closer to the experimental optimal catalytic temperature values. This study lays the foundation for the rapid selection of enzymes suitable for industrial conditions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    尽管具有重要的生物学意义,在系统发育推断过程中,插入和缺失(indel)事件经常被忽略或处理不当。在多序列比对中,indel表示为缺口,并在不考虑插入和删除的独特进化史的情况下进行估计。因此,indel通常被排除在后续的推理步骤之外,例如祖先序列重建和系统发育树搜索。这里,我们引入了indel感知简约(indelMaP),一种新颖的方法,通过将插入和删除视为单独的进化事件并解释长indel来处理简约标准下的差距。通过识别进化事件在树上的精确位置,我们可以分离重叠的indel事件,并对长indel建模使用仿射间隙惩罚。我们的indel感知方法利用了来自indel的系统发育信号,将它们纳入所有推理阶段。对模拟数据的最新推断工具的验证和比较表明,indelMaP最适合于具有紧密到中等相关序列的密集采样数据集,它可以达到与概率方法相当的比对质量,并准确推断祖先序列,包括indel模式。由于其惊人的速度,我们的方法非常适合流行病学数据集,消除了向下采样的需要,并能够利用密集分类采样提供的额外信息。此外,indelMaP提供了对生物学重要序列的indel模式的新见解,并通过将缺口视为关键的进化信号而不仅仅是人工制品来提高我们对遗传变异性的理解。
    Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    病毒基因组研究领域经历了前所未有的数据量增长。已知病毒的新毒株不断被添加到GenBank数据库中,因此是与我们的序列数据库几乎没有相似之处的全新物种。除此之外,宏基因组技术有可能进一步增加测序基因组的数量和速率。此外,重要的是要考虑到病毒具有一系列独特的特征,这些特征通常会破坏分子生物学的教条,例如,逆转录病毒中从RNA到DNA的信息流动以及RNA分子作为基因组的使用。因此,从病毒基因组中提取有意义的信息仍然是一个挑战,比较未知和我们的特征序列数据库的标准方法可能需要调整。因此,已经创建了几种生物信息学方法和工具来应对分析病毒数据的挑战。本章提供了一些用于病毒比较分析的最重要的生物信息学技术的描述和方案。作者还提供了关于病毒的独特功能如何影响标准分析以及如何克服一些主要问题来源的评论和讨论。协议和主题强调在线工具(用户更容易访问),并提供大多数生物信息学家在日常工作中使用命令行管道所做的真实体验。讨论的主题包括(1)聚类相关基因组,(2)小RNA病毒的全基因组多序列比对,(3)标记基因和物种隶属度的蛋白质比对,(4)变体调用和注释,(5)病毒学分析和病原鉴定。
    The field of viral genomic studies has experienced an unprecedented increase in data volume. New strains of known viruses are constantly being added to the GenBank database and so are completely new species with little or no resemblance to our databases of sequences. In addition to this, metagenomic techniques have the potential to further increase the number and rate of sequenced genomes. Besides, it is important to consider that viruses have a set of unique features that often break down molecular biology dogmas, e.g., the flux of information from RNA to DNA in retroviruses and the use of RNA molecules as genomes. As a result, extracting meaningful information from viral genomes remains a challenge and standard methods for comparing the unknown and our databases of characterized sequences may need adaptations. Thus, several bioinformatic approaches and tools have been created to address the challenge of analyzing viral data. This chapter offers descriptions and protocols of some of the most important bioinformatic techniques for comparative analysis of viruses. The authors also provide comments and discussion on how viruses\' unique features can affect standard analyses and how to overcome some of the major sources of problems. Protocols and topics emphasize online tools (which are more accessible to users) and give the real experience of what most bioinformaticians do in day-by-day work with command-line pipelines. The topics discussed include (1) clustering related genomes, (2) whole genome multiple sequence alignments for small RNA viruses, (3) protein alignment for marker genes and species affiliation, (4) variant calling and annotation, and (5) virome analyses and pathogen identification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    非编码RNA的有效同源性搜索通常不可能仅通过序列相似性进行。当前的方法利用进化信息,如结构保守性或协方差评分,以识别系统发育较远的生物中的同源物。在这一章中,我们介绍了进化结构守恒和协方差得分的理论背景,我们展示了如何在示例数据集上应用该领域的当前方法。
    Effective homology search for non-coding RNAs is frequently not possible via sequence similarity alone. Current methods leverage evolutionary information like structure conservation or covariance scores to identify homologs in organisms that are phylogenetically more distant. In this chapter, we introduce the theoretical background of evolutionary structure conservation and covariance score, and we show hands-on how current methods in the field are applied on example datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    生成非编码RNA序列的准确比对在寻求理解RNA功能时是必不可少的。然而,比对RNA仍然是一项具有挑战性的计算任务。在低序列相似性的RNA序列的黄昏区,序列同源性和相容性,有利的(先验未知)结构只能在彼此依赖的情况下推断出来。因此,同时比对和折叠(SA&F)仍然是比较RNA分析的金标准,即使这种方法在计算上要求很高。本文介绍了最近发布的软件包LocARNA2.0,注重其实际应用。该包装使多才多艺,快速准确地分析多种RNA。为此,它在一个特定的、轻巧的味道,使它们通常适用于大规模。它的高性能是通过结合基于集成的结构空间稀疏化和条带策略来实现的。概率分带大大提高了LocARNA2.0的性能,即使比以前的版本,同时简化其有效使用。为各种用例启用灵活的应用程序,LocARNA提供了全球和本地比较工具,群集,并基于SA&F的优化和概率变异对RNA进行多重比对,可选地集成先验知识,可通过锚和结构约束来表达。
    Generating accurate alignments of non-coding RNA sequences is indispensable in the quest for understanding RNA function. Nevertheless, aligning RNAs remains a challenging computational task. In the twilight-zone of RNA sequences with low sequence similarity, sequence homologies and compatible, favorable (a priori unknown) structures can be inferred only in dependency of each other. Thus, simultaneous alignment and folding (SA&F) remains the gold-standard of comparative RNA analysis, even if this method is computationally highly demanding. This text introduces to the recent release 2.0 of the software package LocARNA, focusing on its practical application. The package enables versatile, fast and accurate analysis of multiple RNAs. For this purpose, it implements SA&F algorithms in a specific, lightweight flavor that makes them routinely applicable in large scale. Its high performance is achieved by combining ensemble-based sparsification of the structure space and banding strategies. Probabilistic banding strongly improves the performance of LocARNA 2.0 even over previous releases, while simplifying its effective use. Enabling flexible application to various use cases, LocARNA provides tools to globally and locally compare, cluster, and multiply aligned RNAs based on optimization and probabilistic variants of SA&F, which optionally integrate prior knowledge, expressible by anchor and structure constraints.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    DNA地铁使DNA条形码的生物信息学分析教室友好,消除了对软件安装或命令行工具的需求。Subway通过易于使用的界面将研究级生物信息学软件捆绑到工作流程中。本章涵盖DNASubway的DNA条形码分析工作流程(蓝线),从一个或多个Sanger序列读取开始。在分析过程中,用户可以查看跟踪文件和序列质量,配对并对齐正向和反向读取,创建和修剪共有序列,执行BLAST搜索,选择参考数据,比对多个序列,计算系统发育树。具有所需元数据的高质量序列也可以作为条形码序列提交给NCBIGenBank。
    DNA Subway makes bioinformatic analysis of DNA barcodes classroom friendly, eliminating the need for software installations or command line tools. Subway bundles research-grade bioinformatics software into workflows with an easy-to-use interface. This chapter covers DNA Subway\'s DNA barcoding analysis workflow (Blue Line) starting with one or more Sanger sequence reads. During analysis, users can view trace files and sequence quality, pair and align forward and reverse reads, create and trim consensus sequences, perform BLAST searches, select reference data, align multiple sequences, and compute phylogenetic trees. High-quality sequences with the required metadata can also be submitted as barcode sequences to NCBI GenBank.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    结论:SIMSAPiper是一个Nextflow管道,它可以创建可靠的,与标准的基于结构的比对方法相比,在时间范围内具有数千个蛋白质序列的结构信息MSA。结构信息可以由用户提供或由管道从在线资源收集。可以激活具有基于序列同一性的子集的并行化以显著加速比对过程。最后,通过利用守恒二级结构元素的位置,可以减少最终对齐中的间隙数量。
    方法:管道是使用Nextflow实现的,Python3和Bash。它可在github.com/Bio2Byte/simpsapiper上公开获得。
    背景:所有数据在GitHub上都可用。
    CONCLUSIONS: SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements.
    METHODS: The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on github.com/Bio2Byte/simsapiper.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在快速发展的计算生物学领域,准确预测蛋白质二级结构对于理解蛋白质功能至关重要,促进药物发现,推进疾病诊断。在本文中,我们提议MFTrans,基于深度学习的多特征融合网络,旨在提高蛋白质二级结构预测(PSSP)的精度和效率。该模型采用多序列比对(MSA)转换器与多视图深度学习架构相结合,以有效捕获蛋白质序列的全局和局部特征。MFTrans整合了蛋白质序列产生的不同特征,包括MSA,序列信息,进化信息,和隐藏的状态信息,采用多特征融合策略。MSA转换器用于在输入MSA中交错注意行和列,同时引入了Transformer编码器和解码器来增强提取的高级特征。混合网络架构,将卷积神经网络与双向门控递归单元(BiGRU)网络相结合,用于在特征融合后进一步提取高级特征。在独立测试中,我们的实验结果表明,MFTrans具有优越的泛化能力,在包括CASP12,CASP13,CASP14,TEST2016,TEST2018和CB513在内的公共基准上,平均表现优于其他最先进的PSSP模型3%。案例研究进一步强调了其在预测突变位点方面的先进性能。MFTrans为蛋白质科学领域做出了重要贡献,为药物发现开辟新的途径,疾病诊断,和蛋白质。
    In the rapidly evolving field of computational biology, accurate prediction of protein secondary structures is crucial for understanding protein functions, facilitating drug discovery, and advancing disease diagnostics. In this paper, we propose MFTrans, a deep learning-based multi-feature fusion network aimed at enhancing the precision and efficiency of Protein Secondary Structure Prediction (PSSP). This model employs a Multiple Sequence Alignment (MSA) Transformer in combination with a multi-view deep learning architecture to effectively capture both global and local features of protein sequences. MFTrans integrates diverse features generated by protein sequences, including MSA, sequence information, evolutionary information, and hidden state information, using a multi-feature fusion strategy. The MSA Transformer is utilized to interleave row and column attention across the input MSA, while a Transformer encoder and decoder are introduced to enhance the extracted high-level features. A hybrid network architecture, combining a convolutional neural network with a bidirectional Gated Recurrent Unit (BiGRU) network, is used to further extract high-level features after feature fusion. In independent tests, our experimental results show that MFTrans has superior generalization ability, outperforming other state-of-the-art PSSP models by 3 % on average on public benchmarks including CASP12, CASP13, CASP14, TEST2016, TEST2018, and CB513. Case studies further highlight its advanced performance in predicting mutation sites. MFTrans contributes significantly to the protein science field, opening new avenues for drug discovery, disease diagnosis, and protein.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    肌球蛋白,一个运动蛋白的超家族,通过与肌动蛋白丝结合,获得ATP水解运动所需的能量,以执行各种功能。广泛的研究已经阐明了肌球蛋白的不同同工型所执行的不同功能。然而,解析结构的不可用性使人们难以理解它们的机械化学循环和结构多样性产生不同功能特性的方式。通过这项研究,我们试图通过基于肌球蛋白7A的一级序列对其三级结构进行建模来进一步了解肌球蛋白7A运动域的结构组织。多序列比对以及不同肌球蛋白同工型和肌球蛋白7A的模型的比较不仅使我们能够鉴定高度保守的核苷酸结合位点,而且还预测肌动蛋白结合位点。此外,从蛋白质-蛋白质相互作用模型预测肌动球蛋白-7A复合物,从中定义了肌动蛋白和肌球蛋白7A运动结构域的核心界面位点。最后,序列比对和模型比较用于暗示肌球蛋白7A的转换域和杠杆臂之间存在柔韧区域的可能性。这项研究的结果提供了对肌球蛋白7A结构的见解,可以作为未来更高分辨率研究的框架。
    Myosin, a superfamily of motor proteins, obtain the energy they require for movement from ATP hydrolysis to perform various functions by binding to actin filaments. Extensive studies have clarified the diverse functions performed by the different isoforms of myosin. However, the unavailability of resolved structures has made it difficult to understand the way in which their mechanochemical cycle and structural diversity give rise to distinct functional properties. With this study, we seek to further our understanding of the structural organization of the myosin 7A motor domain by modeling the tertiary structure of myosin 7A based on its primary sequence. Multiple sequence alignment and a comparison of the models of different myosin isoforms and myosin 7A not only enabled us to identify highly conserved nucleotide binding sites but also to predict actin binding sites. In addition, the actomyosin-7A complex was predicted from the protein-protein interaction model, from which the core interface sites of actin and the myosin 7A motor domain were defined. Finally, sequence alignment and the comparison of models were used to suggest the possibility of a pliant region existing between the converter domain and lever arm of myosin 7A. The results of this study provide insights into the structure of myosin 7A that could serve as a framework for higher resolution studies in future.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号