
  • 文章类型: Journal Article
    Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI\'s Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at or .






  • 文章类型: Preprint
    Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI\'s Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 minutes. Testing FCS-GX on artificially fragmented genomes demonstrates sensitivity >95% for diverse contaminant species and specificity >99.93%. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination (0.16% of total bases), with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at






  • 文章类型: Journal Article
    The present research aimed to evaluate the diversity of all monkeypox virus strains with a special focus on recently isolated ones by a comprehensive phylogenetic analysis of all available sequences, based on the concatenate of four viral genes. Almost all current strains from 2022 showed a high level of similarity to each other on the analyzed stretches: 218 strains shared identical sequence. Among all analyzed strains, the highest number of differences was counted compared to a RefSeq strain (Zaire-96-I-16) on the whole concatenate. Our analysis supported the distinction between Clade I (formerly Congo Basin clade), IIa and IIb (together formerly West African clade) strains and classified all 2022 strains in the last one. The high number of differences and long branch observable concerning strain Zaire-96-I-16 is most probably caused by a sequencing error. As this strain represents one of the two available reference sequences in GenBank, it is recommendable to confirm or exclude the concerning mutation. The developed method, based on four gene sequences, reflected the established whole-genome-based intraspecies classification. Although this method provides significantly less information about the strains compared to whole genome analyses, since its resolution is much lower, it still enables the rapid subspecies classification of the strains into the established clades. The genes in the analyzed concatenate are so conserved that further differentiation of contemporary strains is impossible; these strains are identical in the analyzed sections. On the other hand, since whole genome analyses are compute-intensive, the described method offers a simpler and more accessible alternative for monitoring and preliminary typing of newly sequenced monkeypox virus strains.






  • 文章类型: Journal Article
    Accelerating breeding efforts for developing biofortified bread wheat varieties necessitates understanding the genetic control of grain zinc concentration (GZnC) and grain iron concentration (GFeC). Hence, the major objective of this study was to perform genome-wide association mapping to identify consistently significant genotyping-by-sequencing markers associated with GZnC and GFeC using a large panel of 5,585 breeding lines from the International Maize and Wheat Improvement Center. These lines were grown between 2018 and 2021 in an optimally irrigated environment at Obregon, Mexico, while some of them were also grown in a water-limiting drought-stressed environment and a space-limiting small plot environment and evaluated for GZnC and GFeC. The lines showed a large and continuous variation for GZnC ranging from 27 to 74.5 ppm and GFeC ranging from 27 to 53.4 ppm. We performed 742,113 marker-traits association tests in 73 datasets and identified 141 markers consistently associated with GZnC and GFeC in three or more datasets, which were located on all wheat chromosomes except 3A and 7D. Among them, 29 markers were associated with both GZnC and GFeC, indicating a shared genetic basis for these micronutrients and the possibility of simultaneously improving both. In addition, several significant GZnC and GFeC associated markers were common across the irrigated, water-limiting drought-stressed, and space-limiting small plots environments, thereby indicating the feasibility of indirect selection for these micronutrients in either of these environments. Moreover, the many significant markers identified had minor effects on GZnC and GFeC, suggesting a quantitative genetic control of these traits. Our findings provide important insights into the complex genetic basis of GZnC and GFeC in bread wheat while implying limited prospects for marker-assisted selection and the need for using genomic selection.






  • 文章类型: Journal Article
    Publicly available and validated DNA reference sequences useful for phylogeny estimation and identification of fungal pathogens are an increasingly important resource in the efforts of plant protection organizations to facilitate safe international trade of agricultural commodities. Colletotrichum species are among the most frequently encountered and regulated plant pathogens at U.S. ports-of-entry. The RefSeq Targeted Loci (RTL) project at NCBI (BioProject no. PRJNA177353) contains a database of curated fungal internal transcribed spacer (ITS) sequences that interact extensively with NCBI Taxonomy, resulting in verified name-strain-sequence type associations for >12,000 species. We present a publicly available dataset of verified and curated name-type strain-sequence associations for all available Colletotrichum species. This includes an updated GenBank Taxonomy for 238 species associated with up to 11 protein coding loci and an updated RTL ITS dataset for 226 species. We demonstrate that several marker loci are well suited for phylogenetic inference and identification. We improve understanding of phylogenetic relationships among verified species, verify or improve phylogenetic circumscriptions of 14 species complexes, and reveal that determining relationships among these major clades will require additional data. We present detailed comparisons between phylogenetic and similarity-based approaches to species identification, revealing complex patterns among single marker loci that often lead to misidentification when based on single-locus similarity approaches. We also demonstrate that species-level identification is elusive for a subset of samples regardless of analytical approach, which may be explained by novel species diversity in our dataset and incomplete lineage sorting and lack of accumulated synapomorphies at these loci.






  • 文章类型: Journal Article
    This paper describes the microbial community composition and genes for key metabolic genes, particularly the nitrogen fixation of the mucous-enveloped gut digesta of green (Lytechinus variegatus) and purple (Strongylocentrotus purpuratus) sea urchins by using the shotgun metagenomics approach. Both green and purple urchins showed high relative abundances of Gammaproteobacteria at 30% and 60%, respectively. However, Alphaproteobacteria in the green urchins had higher relative abundances (20%) than the purple urchins (2%). At the genus level, Vibrio was dominant in both green (~9%) and purple (~10%) urchins, whereas Psychromonas was prevalent only in purple urchins (~24%). An enrichment of Roseobacter and Ruegeria was found in the green urchins, whereas purple urchins revealed a higher abundance of Shewanella, Photobacterium, and Bacteroides (q-value < 0.01). Analysis of key metabolic genes at the KEGG-Level-2 categories revealed genes for amino acids (~20%), nucleotides (~5%), cofactors and vitamins (~6%), energy (~5%), carbohydrates (~13%) metabolisms, and an abundance of genes for assimilatory nitrogen reduction pathway in both urchins. Overall, the results from this study revealed the differences in the microbial community and genes designated for the metabolic processes in the nutrient-rich sea urchin gut digesta, suggesting their likely importance to the host and their environment.






  • 文章类型: Journal Article
    Whole genome sequencing has become a powerful tool in modern microbiology. Especially bacterial genomes are sequenced in high numbers. Whole genome sequencing is not only used in research projects, but also in surveillance projects and outbreak investigations. Many whole genome analysis workflows begins with the production of a genome assembly. To accomplish this, a number of different sequencing technologies and assembly methods are available. Here, a summarization is provided over the most frequently used sequence technology and genome assembly approaches reported for the bacterial RefSeq genomes and for the bacterial genomes submitted as belonging to a surveillance project. The data is presented both in total and broken up on a per year basis. Information associated with over 400,000 publically available genomes dated April 2020 and prior were used. The information summarized include (i) the most frequently used sequencing technologies, (ii) the most common combinations of sequencing technologies, (iii) the most reported sequencing depth, and (iv) the most frequently used assembly software solutions. In all, this mini review provides an overview of the currently most common workflows for producing bacterial whole genome sequence assemblies.







  • 文章类型: Journal Article
    Continued influx of metagenome-derived proteins with misannotated taxonomy into conventional databases, including RefSeq, threatens to eliminate the value of taxonomy identifiers. To prevent this, urgent efforts should be undertaken by submitters of metagenomic data sets as well as by database managers.







  • 文章类型: Journal Article
    BACKGROUND: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of \"incremental learning\" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data.
    RESULTS: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model\'s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss.
    CONCLUSIONS: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.







  • 文章类型: Journal Article
    Mycobacterium avium comprises four subspecies that contain both human and veterinary pathogens. At the inception of this study, twenty-eight M. avium genomes had been annotated as RefSeq genomes, facilitating direct comparisons. These genomes represent strains from around the world and provided a unique opportunity to examine genome dynamics in this species. Each genome was confirmed to be classified correctly based on SNP genotyping, nucleotide identity and presence/absence of repetitive elements or other typing methods. The Mycobacterium avium subspecies paratuberculosis (Map) genome size and organization was remarkably consistent, averaging 4.8 Mb with a variance of only 29.6 kb among the 13 strains. Comparing recombination events along with the larger genome size and variance observed among Mycobacterium avium subspecies avium (Maa) and Mycobacterium avium subspecies hominissuis (Mah) strains (collectively termed non-Map) suggests horizontal gene transfer occurs in non-Map, but not in Map strains. Overall, M. avium subspecies could be divided into two major sub-divisions, with the Map type II (bovine strains) clustering tightly on one end of a phylogenetic spectrum and Mah strains clustering more loosely together on the other end. The most evolutionarily distinct Map strain was an ovine strain, designated Telford, which had >1,000 SNPs and showed large rearrangements compared to the bovine type II strains. The Telford strain clustered with Maa strains as an intermediate between Map type II and Mah. SNP analysis and genome organization analyses repeatedly demonstrated the conserved nature of Map versus the mosaic nature of non-Map M. avium strains. Finally, core and pangenomes were developed for Map and non-Map strains. A total of 80% Map genes belonged to the Map core genome, while only 40% of non-Map genes belonged to the non-Map core genome. These genomes provide a more complete and detailed comparison of these subspecies strains as well as a blueprint for how genetic diversity originated.






