short read

  • 文章类型: Journal Article
    基因组数据中结构变体(SV)的鉴定代表了持续的挑战,因为可靠的SV调用中的困难导致灵敏度和特异性降低。我们从9个亲子三重奏中制备了高质量的DNA,作为基因组英格兰100,000基因组项目的一部分,他以前接受了短阅读全基因组测序(Illumina平台)。我们使用Bionano光学基因组作图(OGM;8个先证者和一个三人组)和Nanopore长读测序(OxfordNanoporeTechnologies[ONT]平台;所有样品)重新分析了基因组。要建立“真相”数据集,我们询问了由BionanoAccess(1.6.1版)/Solve软件(3.6.1_11162020版)进行的罕见先证者SV调用(n=234)是否可以使用具有Illumina和ONT原始序列之一或两者的IntegrativeGenomicsViewer通过个体可视化进行验证。其中,222个电话被确认,表明BionanoOGM调用具有很高的精度(阳性预测值95%)。然后,我们询问了在其他两个数据集中,SV呼叫者识别出222个真正的BionanoSV的比例。在Illumina数据集中,灵敏度根据变体类型而变化,缺失高(115/134;86%),但插入差(13/58;22%)。在ONT数据集中,使用原始Sniffles变体调用器的灵敏度通常较差(总体为48%),但使用Sniffles2后有了很大提高(36/40;90%和17/23;74%的缺失和插入,分别)。总之,我们表明OGM的精度非常高。此外,应用Sniffles2调用者时,对于大多数SV类型,使用ONT长读序列数据进行SV调用的灵敏度优于Illumina测序.
    The identification of structural variants (SVs) in genomic data represents an ongoing challenge because of difficulties in reliable SV calling leading to reduced sensitivity and specificity. We prepared high-quality DNA from 9 parent-child trios, who had previously undergone short-read whole-genome sequencing (Illumina platform) as part of the Genomics England 100,000 Genomes Project. We reanalysed the genomes using both Bionano optical genome mapping (OGM; 8 probands and one trio) and Nanopore long-read sequencing (Oxford Nanopore Technologies [ONT] platform; all samples). To establish a \"truth\" dataset, we asked whether rare proband SV calls (n = 234) made by the Bionano Access (version 1.6.1)/Solve software (version 3.6.1_11162020) could be verified by individual visualisation using the Integrative Genomics Viewer with either or both of the Illumina and ONT raw sequence. Of these, 222 calls were verified, indicating that Bionano OGM calls have high precision (positive predictive value 95%). We then asked what proportion of the 222 true Bionano SVs had been identified by SV callers in the other two datasets. In the Illumina dataset, sensitivity varied according to variant type, being high for deletions (115/134; 86%) but poor for insertions (13/58; 22%). In the ONT dataset, sensitivity was generally poor using the original Sniffles variant caller (48% overall) but improved substantially with use of Sniffles2 (36/40; 90% and 17/23; 74% for deletions and insertions, respectively). In summary, we show that the precision of OGM is very high. In addition, when applying the Sniffles2 caller, the sensitivity of SV calling using ONT long-read sequence data outperforms Illumina sequencing for most SV types.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    IlluminaHiSeq的配对短读,MiSeq,和NovaSeq的模拟细菌群落来自新鲜菠菜和地表水在不同测序深度的计算机上产生。多药耐药的肠道沙门氏菌血清型印第安纳州被纳入菠菜社区,而水体中含有多重耐药的铜绿假单胞菌。
    Paired-end short reads of Illumina HiSeq, MiSeq, and NovaSeq of simulated bacterial communities from fresh spinach and surface water were generated in silico at various sequencing depths. Multidrug-resistant Salmonella enterica serotype Indiana was included in the spinach community, while the water community contained multidrug-resistant Pseudomonas aeruginosa.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    消化是由消化酶驱动的,消化酶基因拷贝数可以为饮食专业化的基因组基础提供见解。“适应性调节假说”(AMH)提出消化酶活性,随着基因拷贝数的增加,应与饮食中的底物数量相关。为了测试AMH并揭示食草动物与食肉动物的一些遗传学,我们测序了,组装,并注释了假人的基因组,Stichaeidae家族中的一种食肉刺头鱼,并比较了关键消化酶的基因拷贝数和仙人掌的基因拷贝数,来自同一个家庭的食草鱼。一个高度连续的高质量的基因组组装(N50=10.6Mb)产生的紫菜,使用组合的长读和短读技术,估计有33,842个蛋白质编码基因。我们检查的消化酶包括胰腺α-淀粉酶,羧酸酯脂肪酶,丙氨酰氨基肽酶,胰蛋白酶,还有胰凝乳蛋白酶.与紫罗兰梭菌相比,假肢的胰腺α-淀粉酶(碳水化合物消化)拷贝较少(1vs.3copies).此外,A.pururescens的羧基酯脂肪酶(植物脂质消化)比C.violaceus少一个拷贝(4vs.5).与紫罗兰梭菌相比,我们观察到紫罗兰梭菌中几种蛋白质消化基因的拷贝数增加,包括胰蛋白酶(5vs.3)和总氨肽酶(6vs.5).总的来说,这些基因组差异与两个物种中测量的消化酶活性(表型)一致,它们支持AMH。此外,这种基因组资源现在可以更好地了解鱼类生物学和饮食专业化。
    Digestion is driven by digestive enzymes and digestive enzyme gene copy number can provide insights on the genomic underpinnings of dietary specialization. The \"Adaptive Modulation Hypothesis\" (AMH) proposes that digestive enzyme activity, which increases with increased gene copy number, should correlate with substrate quantity in the diet. To test the AMH and reveal some of the genetics of herbivory vs carnivory, we sequenced, assembled, and annotated the genome of Anoplarchus purpurescens, a carnivorous prickleback fish in the family Stichaeidae, and compared the gene copy number for key digestive enzymes to that of Cebidichthys violaceus, a herbivorous fish from the same family. A highly contiguous genome assembly of high quality (N50 = 10.6 Mb) was produced for A. purpurescens, using combined long-read and short-read technology, with an estimated 33,842 protein-coding genes. The digestive enzymes that we examined include pancreatic α-amylase, carboxyl ester lipase, alanyl aminopeptidase, trypsin, and chymotrypsin. Anoplarchus purpurescens had fewer copies of pancreatic α-amylase (carbohydrate digestion) than C. violaceus (1 vs. 3 copies). Moreover, A. purpurescens had one fewer copy of carboxyl ester lipase (plant lipid digestion) than C. violaceus (4 vs. 5). We observed an expansion in copy number for several protein digestion genes in A. purpurescens compared to C. violaceus, including trypsin (5 vs. 3) and total aminopeptidases (6 vs. 5). Collectively, these genomic differences coincide with measured digestive enzyme activities (phenotypes) in the two species and they support the AMH. Moreover, this genomic resource is now available to better understand fish biology and dietary specialization.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因组结构变异(SV)影响不同生物的遗传和表型特征,但是缺乏可靠的SV检测方法阻碍了遗传分析。我们开发了一种计算算法(MOPline),其中包括缺失呼叫恢复与高置信度SV呼叫选择和使用短读取全基因组测序(WGS)数据的基因分型相结合。使用3,672个高覆盖WGS数据集,MOPline稳定检测到~每个个体16,000SV,比以前的大型项目高出1.7-3.3倍,同时表现出可比的统计质量指标水平。我们从181,622名日本人的SVs中估算了42种疾病和60种数量性状。与估算的SV进行的全基因组关联研究显示,有41个排名最高或接近排名最高的全基因组重要SV,包括8个外显子SV,具有5个新颖的关联和丰富的移动元素插入。这项研究表明,短读WGS数据可用于鉴定与多种性状相关的罕见和常见SV。
    Genomic structural variation (SV) affects genetic and phenotypic characteristics in diverse organisms, but the lack of reliable methods to detect SV has hindered genetic analysis. We developed a computational algorithm (MOPline) that includes missing call recovery combined with high-confidence SV call selection and genotyping using short-read whole-genome sequencing (WGS) data. Using 3,672 high-coverage WGS datasets, MOPline stably detected ∼16,000 SVs per individual, which is over ∼1.7-3.3-fold higher than previous large-scale projects while exhibiting a comparable level of statistical quality metrics. We imputed SVs from 181,622 Japanese individuals for 42 diseases and 60 quantitative traits. A genome-wide association study with the imputed SVs revealed 41 top-ranked or nearly top-ranked genome-wide significant SVs, including 8 exonic SVs with 5 novel associations and enriched mobile element insertions. This study demonstrates that short-read WGS data can be used to identify rare and common SVs associated with a variety of traits.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Video-Audio Media
    扩增子测序是用于分析微生物组的已建立且具有成本效益的方法。然而,许多可用的工具来处理这些数据需要生物信息学技能和高计算能力来处理大数据集。此外,只有很少的工具,允许长读扩增子数据分析。为了弥合这个差距,我们开发了LotuS2(更少的OTU脚本2)管道,启用用户友好,资源友好,和原始扩增子序列的通用分析。
    在LotuS2中,六种不同的序列聚类算法以及广泛的预处理和后处理选项允许两位专家进行灵活的数据分析,其中参数可以完全调整,和新手,其中为不同的场景提供默认值。我们对三个独立的肠道和土壤数据集进行了基准测试,其中LotuS2平均比其他管道快29倍,还可以更好地再现技术复制样本的α-和β-多样性。进一步对具有已知分类单元组成的模拟社区进行基准测试表明,与其他管道相比,LotuS2回收了较高比例的正确识别分类单元和较高比例的分配给真实分类单元的读数(物种分别为48%和57%;属水平为83%和98%,分别)。在ASV/OTU级别,LotuS2的精确度和F评分最高,正确报告的16S序列的分数也是如此.
    LotuS2是一个轻量级和用户友好的管道,速度快,精确,流线型,使用广泛的前和后ASV/OTU聚类步骤来进一步提高数据质量。高数据使用率和可靠性可在几分钟内实现高通量微生物组分析。
    LotuS2可从GitHub获得,康达,或者通过银河网络界面,记录在http://lotus2。earlham.AC.英国/。视频摘要。
    Amplicon sequencing is an established and cost-efficient method for profiling microbiomes. However, many available tools to process this data require both bioinformatics skills and high computational power to process big datasets. Furthermore, there are only few tools that allow for long read amplicon data analysis. To bridge this gap, we developed the LotuS2 (less OTU scripts 2) pipeline, enabling user-friendly, resource friendly, and versatile analysis of raw amplicon sequences.
    In LotuS2, six different sequence clustering algorithms as well as extensive pre- and post-processing options allow for flexible data analysis by both experts, where parameters can be fully adjusted, and novices, where defaults are provided for different scenarios. We benchmarked three independent gut and soil datasets, where LotuS2 was on average 29 times faster compared to other pipelines, yet could better reproduce the alpha- and beta-diversity of technical replicate samples. Further benchmarking a mock community with known taxon composition showed that, compared to the other pipelines, LotuS2 recovered a higher fraction of correctly identified taxa and a higher fraction of reads assigned to true taxa (48% and 57% at species; 83% and 98% at genus level, respectively). At ASV/OTU level, precision and F-score were highest for LotuS2, as was the fraction of correctly reported 16S sequences.
    LotuS2 is a lightweight and user-friendly pipeline that is fast, precise, and streamlined, using extensive pre- and post-ASV/OTU clustering steps to further increase data quality. High data usage rates and reliability enable high-throughput microbiome analysis in minutes.
    LotuS2 is available from GitHub, conda, or via a Galaxy web interface, documented at http://lotus2.earlham.ac.uk/ . Video Abstract.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    作为一种常见的结构变异,插入是指将DNA序列添加到个体基因组中,通常与一些遗传性疾病有关。近年来,已经提出了许多方法来检测插入。然而,插入的准确调用也是一项具有挑战性的任务。在这项研究中,我们提出了一种新的基于软剪切读取的插入检测方法,这叫做SIns。首先,基于配对读段和参考基因组之间的比对,SIns从软剪切读段中提取断点并确定插入位置。然后将有关配对读段的插入大小信息进一步聚类以确定基因型,SIns随后采用Minia来组装插入序列。实验结果表明,就模拟和真实数据集的F得分而言,SIns可以比其他方法获得更好的性能。
    As a common type of structural variation, an insertion refers to the addition of a DNA sequence into an individual genome and is usually associated with some inherited diseases. In recent years, many methods have been proposed for detecting insertions. However, the accurate calling of insertions is also a challenging task. In this study, we propose a novel insertion detection approach based on soft-clipped reads, which is called SIns. First, based on the alignments between paired reads and the reference genome, SIns extracts breakpoints from soft-clipped reads and determines insertion locations. The insert size information about paired reads is then further clustered to determine the genotype, and SIns subsequently adopts Minia to assemble the insertion sequences. Experimental results show that SIns can achieve better performance than other methods in terms of the F-score value for simulated and true datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    下一代测序技术对许多生物学科都非常重要;然而,由于技术和生物限制,由现代测序仪产生的短DNA序列需要许多质量控制(QC)措施来减少错误,去除技术污染物,或将配对末端读取合并到更长或更高质量的重叠群中。每个步骤都有许多工具,但是选择适当的方法和使用参数可能具有挑战性,因为每个步骤的参数化取决于所使用的测序技术的特殊性,被分析的样本类型,以及仪器和样品制备的随机性。此外,最终用户可能不知道有关其数据如何生成的所有相关信息,例如用于做出明智选择的配对末端序列或衔接子类型的预期重叠。这种日益增加的复杂性和细微差别需要一个管道,以用户友好的方式将现有步骤组合在一起,如果可能,从数据中自动学习合理的质量参数。我们提出了一个用户友好的质量控制管道,称为SHI7(规范发音为“shizen”),旨在通过预测常见测序衔接子的存在和/或类型,为最终用户简化短读数据的质量控制,要修剪什么质量分数,数据集是鸟枪还是扩增子测序,读段是双端还是单端,以及双是否可缝合,包括预期的配对重叠量。我们希望SHI7将使所有研究人员更容易,专家和新手一样,遵循合理的短读数据质量控制实践。重要性高通量DNA测序数据的质量控制是一项重要但有时费力的任务,需要所使用的测序协议的背景知识(例如衔接子类型,测序技术,插入尺寸/可缝合性,配对,等。).质量控制方案通常需要应用这种背景知识来选择和执行具有适当参数的许多质量控制步骤。这在处理公共数据或使用不同协议的协作者的数据时尤其困难。我们创建了一个简化的质量控制管道,旨在大大简化从原始机器输出文件到可操作序列数据的DNA质量控制过程。与其他方法相比,我们建议的管道易于安装和使用,并尝试使用单个命令自动从数据中学习必要的参数。
    Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced \"shizen\"), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    The name Alview is a contraction of the term Alignment Viewer. Alview is a compiled to native architecture software tool for visualizing the alignment of sequencing data. Inputs are files of short-read sequences aligned to a reference genome in the SAM/BAM format and files containing reference genome data. Outputs are visualizations of these aligned short reads. Alview is written in portable C with optional graphical user interface (GUI) code written in C, C++, and Objective-C. The application can run in three different ways: as a web server, as a command line tool, or as a native, GUI program. Alview is compatible with Microsoft Windows, Linux, and Apple OS X. It is available as a web demo at https://cgwb.nci.nih.gov/cgi-bin/alview. The source code and Windows/Mac/Linux executables are available via https://github.com/NCIP/alview.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    The assembly of multiple genomes from mixed sequence reads is a bottleneck in metagenomic analysis. A single-genome assembly program (assembler) is not capable of resolving metagenome sequences, so assemblers designed specifically for metagenomics have been developed. MetaVelvet is an extension of the single-genome assembler Velvet. It has been proved to generate assemblies with higher N50 scores and higher quality than single-genome assemblers such as Velvet and SOAPdenovo when applied to metagenomic sequence reads and is frequently used in this research community. One important open problem for MetaVelvet is its low accuracy and sensitivity in detecting chimeric nodes in the assembly (de Bruijn) graph, which prevents the generation of longer contigs and scaffolds. We have tackled this problem of classifying chimeric nodes using supervised machine learning to significantly improve the performance of MetaVelvet and developed a new tool, called MetaVelvet-SL. A Support Vector Machine is used for learning the classification model based on 94 features extracted from candidate nodes. In extensive experiments, MetaVelvet-SL outperformed the original MetaVelvet and other state-of-the-art metagenomic assemblers, IDBA-UD, Ray Meta and Omega, to reconstruct accurate longer assemblies with higher N50 scores for both simulated data sets and real data sets of human gut microbial sequences.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号