bioinformatics tool

  • 文章类型: Journal Article
    The rapid expansion of biological sequence databases due to high-throughput genomic and proteomic sequencing methods has left a considerable number of identified protein sequences with unclear or incomplete functional annotations. Domains of unknown function (DUFs) are protein domains that lack functional annotations but are present in numerous proteins. To address the challenge of finding functional annotations for DUFs, we have developed a computational method that efficiently identifies and annotates these enigmatic protein domains by utilizing the position-specific iterative basic local alignment search tool (PSI-BLAST) and data mining techniques. Our pipeline identifies putative potential functionalities of DUFs, thereby decreasing the gap between known sequences and functions. The tool can also take user input sequences to annotate. We executed our pipeline on 5111 unique DUF sequences obtained from Pfam, resulting in putative annotations for 2007 of these. These annotations were subsequently incorporated into a comprehensive database and interfaced with a web-based server named \"AnnoDUF\". AnnoDUF is freely accessible to both academic and industrial users, via the World Wide Web at the link All scripts used in this study are uploaded to the GitHub repository, and these can be accessed from






  • 文章类型: Journal Article
    Gene expression profiling technologies have revolutionized cell biology, enabling researchers to identify gene signatures linked to various biological attributes of melanomas, such as pigmentation status, differentiation state, proliferative versus invasive capacity, and disease progression. Although the discovery of gene signatures has significantly enhanced our understanding of melanocytic phenotypes, reconciling the numerous signatures reported across independent studies and different profiling platforms remains a challenge. Current methods for classifying melanocytic gene signatures depend on exact gene overlap and comparison with unstandardized baseline transcriptomes. In this study, we aimed to categorize published gene signatures into clusters based on their similar patterns of expression across clinical cutaneous melanoma specimens. We analyzed nearly 800 melanoma samples from six gene expression repositories and developed a classification framework for gene signatures that is resilient against biases in gene identification across profiling platforms and inconsistencies in baseline standards. Using 39 frequently cited published gene signatures, our analysis revealed seven principal classes of gene signatures that correlate with previously identified phenotypes: Differentiated, Mitotic/MYC, AXL, Amelanotic, Neuro, Hypometabolic, and Invasive. Each class is consistent with the phenotypes that the constituent gene signatures represent, and our classification method does not rely on overlapping genes between signatures. To facilitate broader application, we created WIMMS (what is my melanocytic signature, available at, a user-friendly web application. WIMMS allows users to categorize any gene signature, determining its relationship to predominantly cited signatures and its representation within the seven principal classes.






  • 文章类型: Journal Article
    CONCLUSIONS: Mfind is a tool to analyze the impact of microsatellite presence on DNA barcode specificity. We found a significant correlation between barcode entropy and microsatellite count in angiosperm. Genetic barcodes and microsatellites are some of the identification methods in taxonomy and biodiversity research. It is important to establish a relationship between microsatellite quantification and genetic information in barcodes. In order to clarify the association between the genetic information in barcodes (expressed as Shannon\'s Measure of Information, SMI) and microsatellites count, a total of 330,809 DNA barcodes from the BOLD database (Barcode of Life Data System) were analyzed. A parallel sliding-window algorithm was developed to compute the Shannon entropy of the barcodes, and this was compared with the quantification of microsatellites like (AT)n, (AC)n, and (AG)n. The microsatellite search method utilized an algorithm developed in the Java programming language, which systematically examined the genetic barcodes from an angiosperm database. For this purpose, a computational tool named Mfind was developed, and its search methodology is detailed. This comprehensive study revealed a broad overview of microsatellites within barcodes, unveiling an inverse correlation between the sumz of microsatellites count and barcodes information. The utilization of the Mfind tool demonstrated that the presence of microsatellites impacts the barcode information when considering entropy as a metric. This effect might be attributed to the concise length of DNA barcodes and the repetitive nature of microsatellites, resulting in a direct influence on the entropy of the barcodes.






  • 文章类型: Journal Article
    With the explosion of available genomic information, comparative genomics has become a central approach to understanding microbial ecology and evolution. We developed DiGAlign (, a web server that provides versatile functionality for comparative genomics with an intuitive interface. It allows the user to perform the highly customizable visualization of a synteny map by simply uploading nucleotide sequences of interest, ranging from a specific region to the whole genome landscape of microorganisms and viruses. DiGAlign will serve a wide range of biological researchers, particularly experimental biologists, with multifaceted features that allow the rapid characterization of genomic sequences of interest and the generation of a publication-ready figure.






  • 文章类型: Journal Article
    Macrohaplotype combines multiple types of phased DNA variants, increasing forensic discrimination power. High-quality long-sequencing reads, for example, PacBio HiFi reads, provide data to detect macrohaplotypes in multiploidy and DNA mixtures. However, the bioinformatics tools for detecting macrohaplotypes are lacking. In this study, we developed a bioinformatics software, MacroHapCaller, in which targeted loci (i.e., short TRs [STRs], single nucleotide polymorphisms, and insertion and deletions) are genotyped and combined with novel algorithms to call macrohaplotypes from long reads. MacroHapCaller uses physical phasing (i.e., read-backed phasing) to identify macrohaplotypes, and thus it can detect multi-allelic macrohaplotypes for a given sample. MacroHapCaller was validated with data generated from our designed targeted PacBio HiFi sequencing pipeline, which sequenced ∼8-kb amplicon regions harboring 20 core forensic STR loci in human benchmark samples HG002 and HG003. MacroHapCaller also was validated in whole-genome long-read sequencing data. Robust and accurate genotyping and phased macrohaplotypes were obtained with MacroHapCaller compared with the known ground truth. MacroHapCaller achieved a higher or consistent genotyping accuracy and faster speed than existing tools HipSTR and DeepVar. MacroHapCaller enables efficient macrohaplotype analysis from high-throughput sequencing data and supports applications using discriminating macrohaplotypes.






  • 文章类型: Journal Article
    Detecting copy number variations (CNVs) and alterations (CNAs) in the BRCA1 and BRCA2 genes is essential for testing patients for targeted therapy applicability. However, the available bioinformatics tools were initially designed for identifying CNVs/CNAs in whole-genome or -exome (WES) NGS data or targeted NGS data without adaptation to the BRCA1/2 genes. Most of these tools were tested on sample cohorts of limited size, with their use restricted to specific library preparation kits or sequencing platforms. We developed BRACNAC, a new tool for detecting CNVs and CNAs in the BRCA1 and BRCA2 genes in NGS data of different origin. The underlying mechanism of this tool involves various coverage normalization steps complemented by CNV probability evaluation. We estimated the sensitivity and specificity of our tool to be 100% and 94%, respectively, with an area under the curve (AUC) of 94%. The estimation was performed using the NGS data obtained from 213 ovarian and prostate cancer samples tested with in-house and commercially available library preparation kits and additionally using multiplex ligation-dependent probe amplification (MLPA) (12 CNV-positive samples). Using freely available WES and targeted NGS data from other research groups, we demonstrated that BRACNAC could also be used for these two types of data, with an AUC of up to 99.9%. In addition, we determined the limitations of the tool in terms of the minimum number of samples per NGS run (≥20 samples) and the minimum expected percentage of CNV-negative samples (≥80%). We expect that our findings will improve the efficacy of BRCA1/2 diagnostics. BRACNAC is freely available at the GitHub server.






  • 文章类型: Journal Article
    Genome-wide association studies (GWAS) have been widely used to identify genetic variation associated with complex traits. Despite its success and popularity, the traditional GWAS approach comes with a variety of limitations. For this reason, newer methods for GWAS have been developed, including the use of pan-genomes instead of a reference genome and the utilization of markers beyond single-nucleotide polymorphisms, such as structural variations and k-mers. The k-mers-based GWAS approach has especially gained attention from researchers in recent years. However, these new methodologies can be complicated and challenging to implement. Here, we present kGWASflow, a modular, user-friendly, and scalable workflow to perform GWAS using k-mers. We adopted an existing kmersGWAS method into an easier and more accessible workflow using management tools like Snakemake and Conda and eliminated the challenges caused by missing dependencies and version conflicts. kGWASflow increases the reproducibility of the kmersGWAS method by automating each step with Snakemake and using containerization tools like Docker. The workflow encompasses supplemental components such as quality control, read-trimming procedures, and generating summary statistics. kGWASflow also offers post-GWAS analysis options to identify the genomic location and context of trait-associated k-mers. kGWASflow can be applied to any organism and requires minimal programming skills. kGWASflow is freely available on GitHub ( and Bioconda (






  • 文章类型: Journal Article
    Bioinformatics has been playing a crucial role in the scientific progress to fight against the pandemic of the coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The advances in novel algorithms, mega data technology, artificial intelligence and deep learning assisted the development of novel bioinformatics tools to analyze daily increasing SARS-CoV-2 data in the past years. These tools were applied in genomic analyses, evolutionary tracking, epidemiological analyses, protein structure interpretation, studies in virus-host interaction and clinical performance. To promote the in-silico analysis in the future, we conducted a review which summarized the databases, web services and software applied in SARS-CoV-2 research. Those digital resources applied in SARS-CoV-2 research may also potentially contribute to the research in other coronavirus and non-coronavirus viruses.






  • 文章类型: Journal Article
    BACKGROUND: The visual sequence logo has been a hot area in the development of bioinformatics tools. ggseqlogo written in R language has been the most popular API since it was published. With the popularity of artificial intelligence and deep learning, Python is currently the most popular programming language. The programming language used by bioinformaticians began to shift to Python. Providing APIs in Python that are similar to those in R can reduce the learning cost of relearning a programming language. And compared to ggplot2 in R, drawing framework is not as easy to use in Python. The appearance of plotnine (ggplot2 in Python version) makes it possible to unify the programming methods of bioinformatics visualization tools between R and Python.
    RESULTS: Here, we introduce plotnineSeqSuite, a new plotnine-based Python package provides a ggseqlogo-like API for programmatic drawing of sequence logos, sequence alignment diagrams and sequence histograms. To be more precise, it supports custom letters, color themes, and fonts. Moreover, the class for drawing layers is based on object-oriented design so that users can easily encapsulate and extend it.
    CONCLUSIONS: plotnineSeqSuite is the first ggplot2-style package to implement visualization of sequence -related graphs in Python. It enhances the uniformity of programmatic plotting between R and Python. Compared with tools appeared already, the categories supported by plotnineSeqSuite are much more complete. The source code of plotnineSeqSuite can be obtained on GitHub ( ) and PyPI ( ), and the documentation homepage is freely available on GitHub at ( ).






  • 文章类型: Journal Article
    Mucopolysaccharidoses VI (Maroteaux Lamy syndrome) is a metabolic disorder due to the loss of enzyme activity of N-acetyl galactosamine-4-sulphatase arising from mutations in the ARSB gene. The mutated ARSB is the origin for the accumulation of GAGs within the lysosome leading to severe growth deformities, causing lysosomal storage disease. The main focus of this study is to identify the deleterious variants by applying bioinformatics tools to predict the conservation, pathogenicity, stability, and effect of the ARSB variants. We examined 170 missense variants, of which G137V and G144R were the resultant variants predicted detrimental to the progression of the disease. The native along with G137V and G144R structures were fixed as the receptors and subjected to Molecular docking with the small molecule Odiparcil to analyze the binding efficiency and the varied interactions of the receptors towards the drug. The interaction resulted in similar docking scores of - 7.3 kcal/mol indicating effective binding and consistent interactions of the drug with residues CYS117, GLN118, THR182, and GLN517 for native, along with G137V and G144R structures. Molecular Dynamics were conducted to validate the stability and flexibility of the native and variant structures on ligand binding. The overall study indicates that the drug has similar therapeutic towards the native and variant based on the higher binding affinity and also the complexes show stability with an average of 0.2 nm RMS value. This can aid in the future development therapeutics for the Maroteaux Lamy syndrome.





