RefSeq

RefSeq
  • 文章类型: Journal Article
    以指数速率产生组装的基因组序列。这里我们介绍FCS-GX,NCBI的外来污染屏幕(FCS)工具套件的一部分,优化以识别和去除新基因组中的污染物序列。FCS-GX在0.1-10分钟内筛选大多数基因组。在人工片段化的基因组上测试FCS-GX证明了对多种污染物物种的高灵敏度和特异性。我们使用FCS-GX筛选了160万个GenBank组件,并确定了36.8Gbp的污染,占总基数的0.16%,161个组件中的一半。我们更新了NCBIRefSeq中的组件,以将检测到的污染减少到0.01%的碱基。FCS-GX可在https://github.com/ncbi/fcs/或https://doi.org/10.5281/zenodo.10651084获得。
    Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI\'s Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084 .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Preprint
    以指数速率产生组装的基因组序列。这里我们介绍FCS-GX,NCBI的外来污染屏幕(FCS)工具套件的一部分,优化以识别和去除新基因组中的污染物序列。FCS-GX在0.1-10分钟内筛选大多数基因组。在人工片段化的基因组上测试FCS-GX表明对多种污染物物种的敏感性>95%,特异性>99.93%。我们使用FCS-GX筛选了160万个GenBank组件,并确定了36.8Gbp的污染(占总碱基的0.16%),161个组件中的一半。我们更新了NCBIRefSeq中的组件,以将检测到的污染减少到0.01%的碱基。FCS-GX可在https://github.com/ncbi/fcs/获得。
    Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI\'s Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 minutes. Testing FCS-GX on artificially fragmented genomes demonstrates sensitivity >95% for diverse contaminant species and specificity >99.93%. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination (0.16% of total bases), with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究旨在通过对所有可用序列进行全面的系统发育分析,评估所有猴痘病毒株的多样性,特别关注最近分离出的病毒株。基于四个病毒基因的连接。几乎所有2022年的当前菌株在分析的片段上显示出彼此的高度相似性:218个菌株共享相同的序列。在所有分析的菌株中,与RefSeq菌株(Zaire-96-I-16)相比,在整个串联上计算出最高数量的差异。我们的分析支持了CladeI(以前的刚果盆地进化枝)之间的区别,IIa和IIb(以前是西非进化枝)菌株,并在最后一个菌株中对所有2022菌株进行了分类。有关Zaire-96-I-16菌株的大量差异和可观察到的长分支很可能是由测序错误引起的。由于该菌株代表GenBank中两个可用的参考序列之一,建议确认或排除相关突变。所开发的方法,基于四个基因序列,反映了已建立的基于全基因组的种内分类。尽管与全基因组分析相比,这种方法提供的有关菌株的信息要少得多,因为它的分辨率低得多,它仍然可以将菌株的亚种快速分类为已建立的进化枝。所分析的连接物中的基因是如此保守,以至于不可能进一步区分当代菌株;这些菌株在所分析的切片中是相同的。另一方面,因为全基因组分析是计算密集型的,所描述的方法为新测序的猴痘病毒株的监测和初步分型提供了一种更简单,更容易获得的替代方法。
    The present research aimed to evaluate the diversity of all monkeypox virus strains with a special focus on recently isolated ones by a comprehensive phylogenetic analysis of all available sequences, based on the concatenate of four viral genes. Almost all current strains from 2022 showed a high level of similarity to each other on the analyzed stretches: 218 strains shared identical sequence. Among all analyzed strains, the highest number of differences was counted compared to a RefSeq strain (Zaire-96-I-16) on the whole concatenate. Our analysis supported the distinction between Clade I (formerly Congo Basin clade), IIa and IIb (together formerly West African clade) strains and classified all 2022 strains in the last one. The high number of differences and long branch observable concerning strain Zaire-96-I-16 is most probably caused by a sequencing error. As this strain represents one of the two available reference sequences in GenBank, it is recommendable to confirm or exclude the concerning mutation. The developed method, based on four gene sequences, reflected the established whole-genome-based intraspecies classification. Although this method provides significantly less information about the strains compared to whole genome analyses, since its resolution is much lower, it still enables the rapid subspecies classification of the strains into the established clades. The genes in the analyzed concatenate are so conserved that further differentiation of contemporary strains is impossible; these strains are identical in the analyzed sections. On the other hand, since whole genome analyses are compute-intensive, the described method offers a simpler and more accessible alternative for monitoring and preliminary typing of newly sequenced monkeypox virus strains.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    加快开发生物强化面包小麦品种的育种工作需要了解谷物锌浓度(GZnC)和谷物铁浓度(GFeC)的遗传控制。因此,本研究的主要目的是使用来自国际玉米和小麦改良中心的5,585个育种系,进行全基因组关联作图,通过测序鉴定与GZnC和GFeC相关的一致显著的基因分型标记.这些品系在2018年至2021年之间在Obregon的最佳灌溉环境中生长,墨西哥,而其中一些也生长在限水干旱胁迫环境和限空间小地块环境中,并对GZnC和GFeC进行了评估。对于从27到74.5ppm的范围内的GZnC和从27到53.4ppm的范围内的GFeC,线显示大的和连续的变化。我们在73个数据集中进行了742,113个标记-性状关联测试,并在三个或更多个数据集中确定了与GZnC和GFeC一致相关的141个标记。除3A和7D外,它们位于所有小麦染色体上。其中,29个标记与GZnC和GFeC相关,表明这些微量营养素的共同遗传基础以及同时改善两者的可能性。此外,几个重要的GZnC和GFeC相关标记在灌溉中很常见,限水干旱胁迫,和限制空间的小地块环境,从而表明在这些环境中间接选择这些微量营养素的可行性。此外,鉴定的许多重要标记对GZnC和GFeC影响较小,表明了对这些性状的定量遗传控制。我们的发现为面包小麦中GZnC和GFeC的复杂遗传基础提供了重要见解,同时暗示标记辅助选择的前景有限,并且需要使用基因组选择。
    Accelerating breeding efforts for developing biofortified bread wheat varieties necessitates understanding the genetic control of grain zinc concentration (GZnC) and grain iron concentration (GFeC). Hence, the major objective of this study was to perform genome-wide association mapping to identify consistently significant genotyping-by-sequencing markers associated with GZnC and GFeC using a large panel of 5,585 breeding lines from the International Maize and Wheat Improvement Center. These lines were grown between 2018 and 2021 in an optimally irrigated environment at Obregon, Mexico, while some of them were also grown in a water-limiting drought-stressed environment and a space-limiting small plot environment and evaluated for GZnC and GFeC. The lines showed a large and continuous variation for GZnC ranging from 27 to 74.5 ppm and GFeC ranging from 27 to 53.4 ppm. We performed 742,113 marker-traits association tests in 73 datasets and identified 141 markers consistently associated with GZnC and GFeC in three or more datasets, which were located on all wheat chromosomes except 3A and 7D. Among them, 29 markers were associated with both GZnC and GFeC, indicating a shared genetic basis for these micronutrients and the possibility of simultaneously improving both. In addition, several significant GZnC and GFeC associated markers were common across the irrigated, water-limiting drought-stressed, and space-limiting small plots environments, thereby indicating the feasibility of indirect selection for these micronutrients in either of these environments. Moreover, the many significant markers identified had minor effects on GZnC and GFeC, suggesting a quantitative genetic control of these traits. Our findings provide important insights into the complex genetic basis of GZnC and GFeC in bread wheat while implying limited prospects for marker-assisted selection and the need for using genomic selection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    可用于系统发育估计和真菌病原体鉴定的公开可用和经过验证的DNA参考序列在植物保护组织促进农产品安全国际贸易的努力中越来越重要。炭疽菌是美国入境口岸最常见和受管制的植物病原体之一。NCBI的RefSeq目标位置(RTL)项目(BioProject编号PRJNA177353)包含与NCBI分类学广泛相互作用的精选真菌内部转录间隔区(ITS)序列的数据库,导致>12,000种的已验证的名称-菌株-序列类型关联。我们提供了所有可用Colletotrichum物种的经过验证和策划的名称类型菌株序列关联的公开可用数据集。这包括与多达11个蛋白质编码基因座相关的238个物种的更新的GenBank分类学和226个物种的更新的RTLITS数据集。我们证明了几个标记基因座非常适合系统发育推断和鉴定。我们提高了对已验证物种之间的系统发育关系的理解,验证或改善14种复合物的系统发育限制,并揭示确定这些主要分支之间的关系将需要额外的数据。我们提供了系统发育和基于相似性的物种鉴定方法之间的详细比较,揭示了单标记基因座之间的复杂模式,当基于单基因座相似性方法时,这些模式通常会导致错误识别。我们还证明,无论采用何种分析方法,对于样本的子集,物种水平的鉴定都是难以捉摸的。这可能是由于我们数据集中的新物种多样性和不完整的谱系分类以及这些基因座上缺乏积累的突触。
    Publicly available and validated DNA reference sequences useful for phylogeny estimation and identification of fungal pathogens are an increasingly important resource in the efforts of plant protection organizations to facilitate safe international trade of agricultural commodities. Colletotrichum species are among the most frequently encountered and regulated plant pathogens at U.S. ports-of-entry. The RefSeq Targeted Loci (RTL) project at NCBI (BioProject no. PRJNA177353) contains a database of curated fungal internal transcribed spacer (ITS) sequences that interact extensively with NCBI Taxonomy, resulting in verified name-strain-sequence type associations for >12,000 species. We present a publicly available dataset of verified and curated name-type strain-sequence associations for all available Colletotrichum species. This includes an updated GenBank Taxonomy for 238 species associated with up to 11 protein coding loci and an updated RTL ITS dataset for 226 species. We demonstrate that several marker loci are well suited for phylogenetic inference and identification. We improve understanding of phylogenetic relationships among verified species, verify or improve phylogenetic circumscriptions of 14 species complexes, and reveal that determining relationships among these major clades will require additional data. We present detailed comparisons between phylogenetic and similarity-based approaches to species identification, revealing complex patterns among single marker loci that often lead to misidentification when based on single-locus similarity approaches. We also demonstrate that species-level identification is elusive for a subset of samples regardless of analytical approach, which may be explained by novel species diversity in our dataset and incomplete lineage sorting and lack of accumulated synapomorphies at these loci.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    This paper describes the microbial community composition and genes for key metabolic genes, particularly the nitrogen fixation of the mucous-enveloped gut digesta of green (Lytechinus variegatus) and purple (Strongylocentrotus purpuratus) sea urchins by using the shotgun metagenomics approach. Both green and purple urchins showed high relative abundances of Gammaproteobacteria at 30% and 60%, respectively. However, Alphaproteobacteria in the green urchins had higher relative abundances (20%) than the purple urchins (2%). At the genus level, Vibrio was dominant in both green (~9%) and purple (~10%) urchins, whereas Psychromonas was prevalent only in purple urchins (~24%). An enrichment of Roseobacter and Ruegeria was found in the green urchins, whereas purple urchins revealed a higher abundance of Shewanella, Photobacterium, and Bacteroides (q-value < 0.01). Analysis of key metabolic genes at the KEGG-Level-2 categories revealed genes for amino acids (~20%), nucleotides (~5%), cofactors and vitamins (~6%), energy (~5%), carbohydrates (~13%) metabolisms, and an abundance of genes for assimilatory nitrogen reduction pathway in both urchins. Overall, the results from this study revealed the differences in the microbial community and genes designated for the metabolic processes in the nutrient-rich sea urchin gut digesta, suggesting their likely importance to the host and their environment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Whole genome sequencing has become a powerful tool in modern microbiology. Especially bacterial genomes are sequenced in high numbers. Whole genome sequencing is not only used in research projects, but also in surveillance projects and outbreak investigations. Many whole genome analysis workflows begins with the production of a genome assembly. To accomplish this, a number of different sequencing technologies and assembly methods are available. Here, a summarization is provided over the most frequently used sequence technology and genome assembly approaches reported for the bacterial RefSeq genomes and for the bacterial genomes submitted as belonging to a surveillance project. The data is presented both in total and broken up on a per year basis. Information associated with over 400,000 publically available genomes dated April 2020 and prior were used. The information summarized include (i) the most frequently used sequencing technologies, (ii) the most common combinations of sequencing technologies, (iii) the most reported sequencing depth, and (iv) the most frequently used assembly software solutions. In all, this mini review provides an overview of the currently most common workflows for producing bacterial whole genome sequence assemblies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Continued influx of metagenome-derived proteins with misannotated taxonomy into conventional databases, including RefSeq, threatens to eliminate the value of taxonomy identifiers. To prevent this, urgent efforts should be undertaken by submitters of metagenomic data sets as well as by database managers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:对于当前的宏基因组分类器来说,跟上基因组测序项目生成的训练数据的步伐是一个计算挑战,例如指数增长的NCBIRefSeq细菌基因组数据库。当将新的参考序列添加到训练数据时,静态训练的分类器必须在所有数据上重新运行,导致一个非常低效的过程。与使用所有数据重新训练分类器相比,“增量学习”的丰富文献解决了更新现有分类器以适应新数据的需求,而不会牺牲太多准确性。
    结果:我们通过在渐进式RefSeq快照上递增地训练分类器并在(a)所有已知的当前基因组(作为地面实况集)和(b)真实的实验性宏基因组肠道样本上进行测试,证明了分类如何随着时间的推移而改善。我们证明,随着分类器模型的基因组知识的增长,分类精度提高。概念验证朴素贝叶斯实现,当每年更新时,现在运行在1/4的非增量时间没有精度损失。
    结论:很明显,通过掌握最新知识,分类得到了改善。因此,最重要的是使分类器易于计算以跟上数据泛滥。增量学习分类器可以被有效地更新,而无需重新处理的成本,也无需访问现有数据库,因此节省了存储和计算资源。
    BACKGROUND: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of \"incremental learning\" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data.
    RESULTS: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model\'s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss.
    CONCLUSIONS: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Mycobacterium avium comprises four subspecies that contain both human and veterinary pathogens. At the inception of this study, twenty-eight M. avium genomes had been annotated as RefSeq genomes, facilitating direct comparisons. These genomes represent strains from around the world and provided a unique opportunity to examine genome dynamics in this species. Each genome was confirmed to be classified correctly based on SNP genotyping, nucleotide identity and presence/absence of repetitive elements or other typing methods. The Mycobacterium avium subspecies paratuberculosis (Map) genome size and organization was remarkably consistent, averaging 4.8 Mb with a variance of only 29.6 kb among the 13 strains. Comparing recombination events along with the larger genome size and variance observed among Mycobacterium avium subspecies avium (Maa) and Mycobacterium avium subspecies hominissuis (Mah) strains (collectively termed non-Map) suggests horizontal gene transfer occurs in non-Map, but not in Map strains. Overall, M. avium subspecies could be divided into two major sub-divisions, with the Map type II (bovine strains) clustering tightly on one end of a phylogenetic spectrum and Mah strains clustering more loosely together on the other end. The most evolutionarily distinct Map strain was an ovine strain, designated Telford, which had >1,000 SNPs and showed large rearrangements compared to the bovine type II strains. The Telford strain clustered with Maa strains as an intermediate between Map type II and Mah. SNP analysis and genome organization analyses repeatedly demonstrated the conserved nature of Map versus the mosaic nature of non-Map M. avium strains. Finally, core and pangenomes were developed for Map and non-Map strains. A total of 80% Map genes belonged to the Map core genome, while only 40% of non-Map genes belonged to the non-Map core genome. These genomes provide a more complete and detailed comparison of these subspecies strains as well as a blueprint for how genetic diversity originated.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号