Multiple sequence alignment

  • 文章类型: Journal Article
    The enzyme 4-hydroxyphenylpyruvate dioxygenase (4-HPPD) is involved in the catabolism of the amino acid tyrosine in organisms such as bacteria, plants, and animals. It catalyzes the conversion of 4-hydroxyphenylpyruvate to a homogenisate in the presence of molecular oxygen and Fe(II) as a cofactor. This enzyme represents a key step in the biosynthesis of important compounds, and its activity deficiency leads to severe, rare autosomal recessive disorders, like tyrosinemia type III and hawkinsinuria, for which no cure is currently available. The 4-HPPD C-terminal tail plays a crucial role in the enzyme catalysis/gating mechanism, ensuring the integrity of the active site for catalysis through fine regulation of the C-terminal tail conformation. However, despite growing interest in the 4-HPPD catalytic mechanism and structure, the gating mechanism remains unclear. Furthermore, the absence of the whole 3D structure makes the bioinformatic approach the only possible study to define the enzyme structure/molecular mechanism. Here, wild-type 4-HPPD and its mutants were deeply dissected by applying a comprehensive bioinformatics/evolution study, and we showed for the first time the entire molecular mechanism and regulation of the enzyme gating process, proposing the full-length 3D structure of human 4-HPPD and two novel key residues involved in the 4-HPPD C-terminal tail conformational change.






  • 文章类型: Journal Article
    Enzymes play a crucial role in various industrial production and pharmaceutical developments, serving as catalysts for numerous biochemical reactions. Determining the optimal catalytic temperature (Topt) of enzymes is crucial for optimizing reaction conditions, enhancing catalytic efficiency, and accelerating the industrial processes. However, due to the limited availability of experimentally determined Topt data and the insufficient accuracy of existing computational methods in predicting Topt, there is an urgent need for a computational approach to predict the Topt values of enzymes accurately. In this study, using phosphatase (EC 3.1.3.X) as an example, we constructed a machine learning model utilizing amino acid frequency and protein molecular weight information as features and employing the K-nearest neighbors regression algorithm to predict the Topt of enzymes. Usually, when conducting engineering for enzyme thermostability, researchers tend not to modify conserved amino acids. Therefore, we utilized this machine learning model to predict the Topt of phosphatase sequences after removing conserved amino acids. We found that the predictive model\'s mean coefficient of determination (R2) value increased from 0.599 to 0.755 compared to the model based on the complete sequences. Subsequently, experimental validation on 10 phosphatase enzymes with undetermined optimal catalytic temperatures shows that the predicted values of most phosphatase enzymes based on the sequence without conservative amino acids are closer to the experimental optimal catalytic temperature values. This study lays the foundation for the rapid selection of enzymes suitable for industrial conditions.






  • 文章类型: Journal Article
    Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.






  • 文章类型: Journal Article
    The field of viral genomic studies has experienced an unprecedented increase in data volume. New strains of known viruses are constantly being added to the GenBank database and so are completely new species with little or no resemblance to our databases of sequences. In addition to this, metagenomic techniques have the potential to further increase the number and rate of sequenced genomes. Besides, it is important to consider that viruses have a set of unique features that often break down molecular biology dogmas, e.g., the flux of information from RNA to DNA in retroviruses and the use of RNA molecules as genomes. As a result, extracting meaningful information from viral genomes remains a challenge and standard methods for comparing the unknown and our databases of characterized sequences may need adaptations. Thus, several bioinformatic approaches and tools have been created to address the challenge of analyzing viral data. This chapter offers descriptions and protocols of some of the most important bioinformatic techniques for comparative analysis of viruses. The authors also provide comments and discussion on how viruses\' unique features can affect standard analyses and how to overcome some of the major sources of problems. Protocols and topics emphasize online tools (which are more accessible to users) and give the real experience of what most bioinformaticians do in day-by-day work with command-line pipelines. The topics discussed include (1) clustering related genomes, (2) whole genome multiple sequence alignments for small RNA viruses, (3) protein alignment for marker genes and species affiliation, (4) variant calling and annotation, and (5) virome analyses and pathogen identification.






  • 文章类型: Journal Article
    Effective homology search for non-coding RNAs is frequently not possible via sequence similarity alone. Current methods leverage evolutionary information like structure conservation or covariance scores to identify homologs in organisms that are phylogenetically more distant. In this chapter, we introduce the theoretical background of evolutionary structure conservation and covariance score, and we show hands-on how current methods in the field are applied on example datasets.






  • 文章类型: Journal Article
    Generating accurate alignments of non-coding RNA sequences is indispensable in the quest for understanding RNA function. Nevertheless, aligning RNAs remains a challenging computational task. In the twilight-zone of RNA sequences with low sequence similarity, sequence homologies and compatible, favorable (a priori unknown) structures can be inferred only in dependency of each other. Thus, simultaneous alignment and folding (SA&F) remains the gold-standard of comparative RNA analysis, even if this method is computationally highly demanding. This text introduces to the recent release 2.0 of the software package LocARNA, focusing on its practical application. The package enables versatile, fast and accurate analysis of multiple RNAs. For this purpose, it implements SA&F algorithms in a specific, lightweight flavor that makes them routinely applicable in large scale. Its high performance is achieved by combining ensemble-based sparsification of the structure space and banding strategies. Probabilistic banding strongly improves the performance of LocARNA 2.0 even over previous releases, while simplifying its effective use. Enabling flexible application to various use cases, LocARNA provides tools to globally and locally compare, cluster, and multiply aligned RNAs based on optimization and probabilistic variants of SA&F, which optionally integrate prior knowledge, expressible by anchor and structure constraints.






  • 文章类型: Journal Article
    DNA Subway makes bioinformatic analysis of DNA barcodes classroom friendly, eliminating the need for software installations or command line tools. Subway bundles research-grade bioinformatics software into workflows with an easy-to-use interface. This chapter covers DNA Subway\'s DNA barcoding analysis workflow (Blue Line) starting with one or more Sanger sequence reads. During analysis, users can view trace files and sequence quality, pair and align forward and reverse reads, create and trim consensus sequences, perform BLAST searches, select reference data, align multiple sequences, and compute phylogenetic trees. High-quality sequences with the required metadata can also be submitted as barcode sequences to NCBI GenBank.






  • 文章类型: Journal Article
    CONCLUSIONS: SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements.
    METHODS: The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on






  • 文章类型: Journal Article
    In the rapidly evolving field of computational biology, accurate prediction of protein secondary structures is crucial for understanding protein functions, facilitating drug discovery, and advancing disease diagnostics. In this paper, we propose MFTrans, a deep learning-based multi-feature fusion network aimed at enhancing the precision and efficiency of Protein Secondary Structure Prediction (PSSP). This model employs a Multiple Sequence Alignment (MSA) Transformer in combination with a multi-view deep learning architecture to effectively capture both global and local features of protein sequences. MFTrans integrates diverse features generated by protein sequences, including MSA, sequence information, evolutionary information, and hidden state information, using a multi-feature fusion strategy. The MSA Transformer is utilized to interleave row and column attention across the input MSA, while a Transformer encoder and decoder are introduced to enhance the extracted high-level features. A hybrid network architecture, combining a convolutional neural network with a bidirectional Gated Recurrent Unit (BiGRU) network, is used to further extract high-level features after feature fusion. In independent tests, our experimental results show that MFTrans has superior generalization ability, outperforming other state-of-the-art PSSP models by 3 % on average on public benchmarks including CASP12, CASP13, CASP14, TEST2016, TEST2018, and CB513. Case studies further highlight its advanced performance in predicting mutation sites. MFTrans contributes significantly to the protein science field, opening new avenues for drug discovery, disease diagnosis, and protein.






  • 文章类型: Journal Article
    Myosin, a superfamily of motor proteins, obtain the energy they require for movement from ATP hydrolysis to perform various functions by binding to actin filaments. Extensive studies have clarified the diverse functions performed by the different isoforms of myosin. However, the unavailability of resolved structures has made it difficult to understand the way in which their mechanochemical cycle and structural diversity give rise to distinct functional properties. With this study, we seek to further our understanding of the structural organization of the myosin 7A motor domain by modeling the tertiary structure of myosin 7A based on its primary sequence. Multiple sequence alignment and a comparison of the models of different myosin isoforms and myosin 7A not only enabled us to identify highly conserved nucleotide binding sites but also to predict actin binding sites. In addition, the actomyosin-7A complex was predicted from the protein-protein interaction model, from which the core interface sites of actin and the myosin 7A motor domain were defined. Finally, sequence alignment and the comparison of models were used to suggest the possibility of a pliant region existing between the converter domain and lever arm of myosin 7A. The results of this study provide insights into the structure of myosin 7A that could serve as a framework for higher resolution studies in future.





