GenBank

GenBank
  • 文章类型: Journal Article
    测序技术的快速发展在有效且及时地管理大量和指数增长的序列数据方面提出了挑战。为了解决这个问题,我们介绍GenBase(https://ngdc。cncb.AC.cn/genbase),遵循国际核苷酸序列数据库协作(INSDC)数据标准和结构的开放存取数据存储库,用于高效的核苷酸序列归档,搜索,和分享。作为国家基因组学数据中心(NGDC)的核心资源,中国国家生物信息中心(CNCB;https://ngdc。cncb.AC.cn),GenBase提供双语提交管道和服务,以及中国当地的提交协助。GenBase还为核苷酸序列的元数据描述和特征注释提供了独特的Excel格式,以及实时数据验证系统,以简化序列提交。截至2024年4月23日,GenBase收到了来自2319个提交的414个物种的68,251个核苷酸序列和689,574个注释的蛋白质序列。在这些中,63,614(93%)个核苷酸序列和620,640(90%)个带注释的蛋白质序列已发布,可通过GenBase的网络搜索系统公开访问。文件传输协议(FTP),和应用程序编程接口(API)。此外,与INSDC合作,GenBase已经与GenBank构建了有效的数据交换机制,并开始共享已发布的核苷酸序列。此外,GenBase将GenBank的所有序列与每日更新整合在一起,表明其致力于为全球序列数据管理和共享做出积极贡献。
    The rapid advancement of sequencing technologies poses challenges in managing the large volume and exponential growth of sequence data efficiently and on time. To address this issue, we present GenBase (https://ngdc.cncb.ac.cn/genbase), an open-access data repository that follows the International Nucleotide Sequence Database Collaboration (INSDC) data standards and structures, for efficient nucleotide sequence archiving, searching, and sharing. As a core resource within the National Genomics Data Center (NGDC), of the China National Center for Bioinformation (CNCB; https://ngdc.cncb.ac.cn), GenBase offers bilingual submission pipeline and services, as well as local submission assistance in China. GenBase also provides a unique Excel format for metadata description and feature annotation of nucleotide sequences, along with a real-time data validation system to streamline sequence submissions. As of April 23, 2024, GenBase received 68,251 nucleotide sequences and 689,574 annotated protein sequences across 414 species from 2319 submissions. Out of these, 63,614 (93%) nucleotide sequences and 620,640 (90%) annotated protein sequences have been released and are publicly accessible through GenBase\'s web search system, File Transfer Protocol (FTP), and Application Programming Interface (API). Additionally, in collaboration with INSDC, GenBase has constructed an effective data exchange mechanism with GenBank and started sharing released nucleotide sequences. Furthermore, GenBase integrates all sequences from GenBank with daily updates, demonstrating its commitment to actively contributing to global sequence data management and sharing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本章介绍了使用DNA序列数据获取和比较使用公共数据库GenBank和BarcodeofLifeDataSystem(BOLD)进行分类鉴定的程序。本章首先描述了用于准备上传到GenBank和BOLD的质量序列的程序。接下来,使用GenBankBLAST和BOLD识别引擎描述了用于针对公共数据库查询DNA序列的步骤。提出了分类识别分配的解释指南。最后,提供了用于评估来自GenBank和BOLD的序列的准确性和可靠性的程序。
    This chapter describes procedures for the use of DNA sequence data to obtain and compare taxonomic identification using the public databases GenBank and Barcode of Life Data System (BOLD). The chapter begins by describing procedures used to prepare quality sequences for uploading into GenBank and BOLD. Next, steps used to query the DNA sequences against the public databases are described using GenBank BLAST and BOLD identification engines. Interpretation guidelines for the taxonomic identification assignments are presented. Finally, a procedure for evaluating the accuracy and reliability of sequences from GenBank and BOLD is provided.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:分子技术的出现极大地影响了生物体进化史的重建,导致利用来自不同物种的基因组数据的研究显着增加。然而,基因命名法缺乏标准化对数据库搜索和进化分析提出了挑战,影响所获得结果的准确性。
    结果:要解决此问题,用于标准化基因命名的Python类,合成基因,已经开发了。它自动识别并将不同的术语变体转换为标准化形式,促进全面和准确的搜索。此外,SynGenes提供了一个网络表单,用于使用与同一基因相关的不同名称进行个人搜索。SynGenes数据库总共包含545个线粒体基因名称变异和2485个叶绿体基因,为研究人员提供了宝贵的资源。
    结论:SynGenes平台提供了一种解决方案,用于标准化线粒体和叶绿体基因的基因命名,并为GenBank中的特定标记提供了标准化的搜索解决方案。通过在GenBank和PubMedCentral上进行的研究,对SynGenes有效性的评估表明,与传统搜索相比,它能够产生更多的结果。确保更全面和准确的结果。此工具对于准确的数据库搜索至关重要,因此,进化分析,解决非标准化基因命名法带来的挑战。
    BACKGROUND: The reconstruction of the evolutionary history of organisms has been greatly influenced by the advent of molecular techniques, leading to a significant increase in studies utilizing genomic data from different species. However, the lack of standardization in gene nomenclature poses a challenge in database searches and evolutionary analyses, impacting the accuracy of results obtained.
    RESULTS: To address this issue, a Python class for standardizing gene nomenclatures, SynGenes, has been developed. It automatically recognizes and converts different nomenclature variations into a standardized form, facilitating comprehensive and accurate searches. Additionally, SynGenes offers a web form for individual searches using different names associated with the same gene. The SynGenes database contains a total of 545 gene name variations for mitochondrial and 2485 for chloroplasts genes, providing a valuable resource for researchers.
    CONCLUSIONS: The SynGenes platform offers a solution for standardizing gene nomenclatures of mitochondrial and chloroplast genes and providing a standardized search solution for specific markers in GenBank. Evaluation of SynGenes effectiveness through research conducted on GenBank and PubMedCentral demonstrated its ability to yield a greater number of outcomes compared to conventional searches, ensuring more comprehensive and accurate results. This tool is crucial for accurate database searches, and consequently, evolutionary analyses, addressing the challenges posed by non-standardized gene nomenclature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    以指数速率产生组装的基因组序列。这里我们介绍FCS-GX,NCBI的外来污染屏幕(FCS)工具套件的一部分,优化以识别和去除新基因组中的污染物序列。FCS-GX在0.1-10分钟内筛选大多数基因组。在人工片段化的基因组上测试FCS-GX证明了对多种污染物物种的高灵敏度和特异性。我们使用FCS-GX筛选了160万个GenBank组件,并确定了36.8Gbp的污染,占总基数的0.16%,161个组件中的一半。我们更新了NCBIRefSeq中的组件,以将检测到的污染减少到0.01%的碱基。FCS-GX可在https://github.com/ncbi/fcs/或https://doi.org/10.5281/zenodo.10651084获得。
    Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI\'s Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 min. Testing FCS-GX on artificially fragmented genomes demonstrates high sensitivity and specificity for diverse contaminant species. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination, comprising 0.16% of total bases, with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/ or https://doi.org/10.5281/zenodo.10651084 .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    葡萄球菌科,或者rove甲虫,是陆生节肢动物的庞大多样且丰富的家族之一,即使在与欧洲相邻的动物研究时间最长的地区,在分类学上也鲜为人知。由于DNA条形码是加速生物多样性研究的工具,在这里,我们探讨了当前可用的COI条形码库是否足以代表西伯利亚的rove甲虫的研究。这是一个与欧洲相邻的广阔地区,鲜为人知的rove甲虫动物区系,迄今为止,还没有为Staphylinidae产生单个DNA条形码。首先,我们调查了亚洲气候相容的西伯利亚西海岸的巡回甲虫动物区系之间的动物区系相似性,Fennoscandia在欧洲和加拿大和阿拉斯加在北美。第二,我们调查了来自BOLD和GenBank后两个地区的葡萄球菌科的条形码,世界上最大的DNA条形码库。我们得出的结论是,芬诺斯坎迪亚相当不同的巡回甲虫动物区系,一方面是加拿大和阿拉斯加,在两个互补的条形码库中都有很好的覆盖。我们还发现,即使没有来自西伯利亚西部标本的条形码,这一报道有助于研究那里的rove甲虫,因为西伯利亚西部和Fennoscandia之间共有大量广泛的物种,并且在所有三个调查区域中共有更多的属。第一次,我们编制了一份基于文献的检查表,对726种西西伯利亚葡萄球菌科进行了补充,并向GBIF提交了它们的发生数据集.我们为跨全球图书馆的给定地理区域挖掘独特(即非冗余)条形码而编写的脚本在此处可用,可用于任何其他区域。
    Staphylinidae, or rove beetles, are one of the mega-diverse and abundant families of the ground-living terrestrial arthropods that is taxonomically poorly known even in the regions adjacent to Europe where the fauna has been investigated for the longest time. Since DNA barcoding is a tool to accelerate biodiversity research, here we explored if the currently-available COI barcode libraries are representative enough for the study of rove beetles of West Siberia. This is a vast region adjacent to Europe with poorly-known fauna of rove beetles and from where not a single DNA barcode has hitherto been produced for Staphylinidae. First, we investigated the faunal similarity between the rove beetle faunas of the climatically compatible West Siberia in Asia, Fennoscandia in Europe and Canada and Alaska in North America. Second, we investigated barcodes available for Staphylinidae from the latter two regions in BOLD and GenBank, the world\'s largest DNA barcode libraries. We conclude that the rather different rove beetle faunas of Fennoscandia, on the one hand and Canada and Alaska on the other hand, are well covered in both barcode libraries that complement each other. We also find that even without any barcodes originating from specimens collected in West Siberia, this coverage is helpful for the study of rove beetles there due to the significant number of widespread species shared between West Siberia and Fennoscandia and due to the even larger number of shared genera amongst all three investigated regions. For the first time, we compiled a literature-based checklist for 726 species of the West Siberian Staphylinidae supplemented by their occurrence dataset submitted to GBIF. Our script written for mining unique (i.e. not redundant) barcodes for a given geographic area across global libraries is made available here and can be adopted for any other regions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    遗传元件装配是遗传电路仿真和实现的关键。自动化这个过程,从而加速原型制作,是一种需要。我们提供pyBrick-DNA,用Python编写的软件,组装构建遗传电路的组件。pybrick-DNA(可乐。pyBrick.com)是一个用户友好的环境,科学家可以选择遗传序列或输入自定义序列来构建遗传组装。将所有组分模块化地融合以产生现成的单个DNA片段。它包括成簇的定期间隔短回文重复(CRISPR)和植物基因编辑组件。因此,pyBrick-DNA可以产生一个功能性的CRISPR构建体,该构建体由整合有Cas9、启动子、和终止符元素。结果是一个DNA序列,连同图形表示,由用户选择的遗传部分组成,准备在体内合成和克隆。此外,序列可以导出为GenBank文件,允许其与其他合成生物学工具一起使用。
    Genetic component assembly is key in the simulation and implementation of genetic circuits. Automating this process, thus accelerating prototyping, is a necessity. We present pyBrick-DNA, a software written in Python, that assembles components for the construction of genetic circuits. pyBrick-DNA (colab.pyBrick.com) is a user-friendly environment where scientists can select genetic sequences or input custom sequences to build genetic assemblies. All components are modularly fused to generate a ready-to-go single DNA fragment. It includes Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and plant gene-editing components. Hence, pyBrick-DNA can generate a functional CRISPR construct composed of a single-guided RNA integrated with Cas9, promoters, and terminator elements. The outcome is a DNA sequence, along with a graphical representation, composed of user-selected genetic parts, ready to be synthesized and cloned in vivo. Moreover, the sequence can be exported as a GenBank file allowing its use with other synthetic biology tools.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    科学名称允许人类和搜索引擎获取有关我们周围生物多样性的知识,与DNA序列相关的名称在搜索和匹配识别程序中发挥着越来越大的作用。这里,我们分析了国家生物技术信息中心(NCBI)的用户和策展人如何标记和策划源自命名类型材料的序列,从长远来看,这是提高基于DNA的鉴定质量的唯一方法。对于原核生物来说,NCBI工作人员已经策划了18,281个来自类型菌株的基因组组装,并提高了原核生物命名的质量。对于真菌来说,现在,代表21,000多种物种的类型衍生序列对于真菌的命名和鉴定至关重要。对于剩下的真核生物,然而,可识别为类型衍生的序列的数量很少,只代表1000种节肢动物,8,441种脊椎动物,和430种胚胎植物。此类序列的生产和管理的增加将来自(i)博物馆藏品中类型或拓扑标本的测序,(ii)2023年3月,国际核苷酸序列数据库合作组织的规则更改要求更多的标本元数据,以及(iii)数据提交者为促进策展而作出的努力,包括通知NCBI策展人标本的类型状态。我们说明了不同的类型数据提交旅程,并提供了一系列生物的最佳实践示例。扩大DNA数据库中类型衍生序列的数量,尤其是真核生物,对捕获至关重要,记录,保护生物多样性。
    Scientific names permit humans and search engines to access knowledge about the biodiversity that surrounds us, and names linked to DNA sequences are playing an ever-greater role in search-and-match identification procedures. Here, we analyze how users and curators of the National Center for Biotechnology Information (NCBI) are flagging and curating sequences derived from nomenclatural type material, which is the only way to improve the quality of DNA-based identification in the long run. For prokaryotes, 18,281 genome assemblies from type strains have been curated by NCBI staff and improve the quality of prokaryote naming. For Fungi, type-derived sequences representing over 21,000 species are now essential for fungus naming and identification. For the remaining eukaryotes, however, the numbers of sequences identifiable as type-derived are minuscule, representing only 739 species of arthropods, 1542 vertebrates, and 125 embryophytes. An increase in the production and curation of such sequences will come from (i) sequencing of types or topotypic specimens in museum collections, (ii) the March 2023 rule changes at the International Nucleotide Sequence Database Collaboration requiring more metadata for specimens, and (iii) efforts by data submitters to facilitate curation, including informing NCBI curators about a specimen\'s type status. We illustrate different type-data submission journeys and provide best-practice examples from a range of organisms. Expanding the number of type-derived sequences in DNA databases, especially of eukaryotes, is crucial for capturing, documenting, and protecting biodiversity.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    来自全球北方和全球南方的科学家之间的合作(N-S合作)是“第四科学范式”的关键驱动力,并已被证明对解决COVID-19和气候变化等全球危机至关重要。然而,尽管发挥了关键作用,数据集上的N-S合作还没有得到很好的理解。科学研究倾向于依靠出版物和专利来检查N-S协作模式。为此,全球危机的兴起需要N-S合作来产生和共享数据,迫切需要了解流行情况,动力学,和研究数据集上N-S合作的政治经济学。在本文中,我们采用混合方法个案研究方法,对提交给GenBank29年(1992-2021年)的数据集进行N-S合作的频率和分工分析.我们发现:(1)在29年的时间里,N-S合作的代表性很低。当它们发生时,N-S协作显示“突发性”模式,表明在传染病爆发等全球健康危机之后,N-S在数据集上的合作是被动地形成和维持的;(2)数据集和出版物之间的分工在早年与全球南方不成比例,但在2003年之后变得更加重叠。科技能力较低但收入高的国家是一个例外,这些国家在数据集上的流行率较高(例如,阿拉伯联合酋长国)。我们定性地检查了N-S数据集合作的样本,以识别数据集和出版物作者身份中的领导模式。研究结果使我们认为,有必要将N-S数据集合作纳入研究输出的度量中,以细微差别N-S合作中当前的模型和公平评估工具。本文为SGD的目标做出了贡献,以开发数据驱动的指标,这些指标可以为研究数据集的科学合作提供信息。
    Collaborations between scientists from the global north and global south (N-S collaborations) are a key driver of the \"fourth paradigm of science\" and have proven crucial to addressing global crises like COVID-19 and climate change. However, despite their critical role, N-S collaborations on datasets are not well understood. Science of science studies tend to rely on publications and patents to examine N-S collaboration patterns. To this end, the rise of global crises requiring N-S collaborations to produce and share data presents an urgent need to understand the prevalence, dynamics, and political economy of N-S collaborations on research datasets. In this paper, we employ a mixed methods case study research approach to analyze the frequency of and division of labor in N-S collaborations on datasets submitted to GenBank over 29 years (1992-2021). We find: (1) there is a low representation of N-S collaborations over the 29-year period. When they do occur, N-S collaborations display \"burstiness\" patterns, suggesting that N-S collaborations on datasets are formed and maintained reactively in the wake of global health crises such as infectious disease outbreaks; (2) The division of labor between datasets and publications is disproportionate to the global south in the early years, but becomes more overlapping after 2003. An exception in the case of countries with lower S&T capacity but high income, where these countries have a higher prevalence on datasets (e.g., United Arab Emirates). We qualitatively inspect a sample of N-S dataset collaborations to identify leadership patterns in dataset and publication authorship. The findings lead us to argue there is a need to include N-S dataset collaborations in measures of research outputs to nuance the current models and assessment tools of equity in N-S collaborations. The paper contributes to the SGDs objectives to develop data-driven metrics that can inform scientific collaborations on research datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在生物科学中,DNA序列的研究被认为是一个重要的因素,因为它携带的基因组细节,可用于由研究人员和医生使用DNA分类疾病的早期预测。NCBI拥有世界上最大的基因序列数据库,但这海量数据的安全性是目前最大的问题。其中一种选择是使用区块链技术加密这些基因序列。因此,本文介绍了一项关于医疗保健数据泄露的调查,区块链在医疗保健中的必要性,以及在这一领域进行的研究数量。此外,该报告建议DNA序列分类用于早期疾病鉴定,并评估了该领域的先前工作。
    In biological science, the study of DNA sequences is considered an important factor because it carries the genomic details that can be used by researchers and doctors for the early prediction of disease using DNA classification. The NCBI has the world\'s largest database of genetic sequences, but the security of this massive amount of data is currently the greatest issue. One of the options is to encrypt these genetic sequences using blockchain technology. As a result, this paper presents a survey on healthcare data breaches, the necessity for blockchain in healthcare, and the number of research studies done in this area. In addition, the report suggests DNA sequence classification for earlier disease identification and evaluates previous work in the field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Preprint
    以指数速率产生组装的基因组序列。这里我们介绍FCS-GX,NCBI的外来污染屏幕(FCS)工具套件的一部分,优化以识别和去除新基因组中的污染物序列。FCS-GX在0.1-10分钟内筛选大多数基因组。在人工片段化的基因组上测试FCS-GX表明对多种污染物物种的敏感性>95%,特异性>99.93%。我们使用FCS-GX筛选了160万个GenBank组件,并确定了36.8Gbp的污染(占总碱基的0.16%),161个组件中的一半。我们更新了NCBIRefSeq中的组件,以将检测到的污染减少到0.01%的碱基。FCS-GX可在https://github.com/ncbi/fcs/获得。
    Assembled genome sequences are being generated at an exponential rate. Here we present FCS-GX, part of NCBI\'s Foreign Contamination Screen (FCS) tool suite, optimized to identify and remove contaminant sequences in new genomes. FCS-GX screens most genomes in 0.1-10 minutes. Testing FCS-GX on artificially fragmented genomes demonstrates sensitivity >95% for diverse contaminant species and specificity >99.93%. We used FCS-GX to screen 1.6 million GenBank assemblies and identified 36.8 Gbp of contamination (0.16% of total bases), with half from 161 assemblies. We updated assemblies in NCBI RefSeq to reduce detected contamination to 0.01% of bases. FCS-GX is available at https://github.com/ncbi/fcs/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号