Linked Reads

链接读取
  • 文章类型: Journal Article
    远程测序赋予了对其他遗传信息的深入了解,除了可以通过短读和现代长读技术访问的遗传信息之外。一些新的测序技术可用于远程数据集,如“Hi-C”和“关联读取”,具有高通量和高分辨率的基因组分析,并迅速推进基因组组装领域,基因组支架,和更全面的变体识别。在这篇文章中,我们专注于五种主要的远程测序技术:高通量染色体构象捕获(Hi-C),10x基因组学关联阅读,单列标签,转座酶酶连接长读测序(TELL-seq),和单管长片段读数(stLFR)。我们详细介绍了五大平台的机制和数据产品,介绍了几个最重要的应用,评估了来自不同平台的测序数据的质量,并讨论了目前可用的生物信息学工具。我们希望这项工作将有助于为特定的生物学研究选择合适的远程技术。
    Long-range sequencing grants insight into additional genetic information beyond that which can be accessed by both short reads and modern long-read technology. Several new sequencing technologies are available for long-range datasets such as \"Hi-C\" and \"Linked Reads\" with high-throughput and high-resolution genome analysis, and are rapidly advancing the field of genome assembly, genome scaffolding, and more comprehensive variant identification. In this article, we focused on five major long-range sequencing technologies: high-throughput chromosome conformation capture (Hi-C), 10x Genomics Linked Reads, haplotagging, transposase enzyme linked long-read sequencing (TELL-seq), and single tube long fragment read (stLFR). We detailed the mechanisms and data products of the five platforms, introduced several of the most important applications, evaluated the quality of sequencing data from different platforms, and discussed the currently available bioinformatics tools. We hope this work will benefit the selection of appropriate long-range technology for specific biological studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    芦笋kiusianus是一种抗病的雌雄异株植物,是花园芦笋(芦笋)的野生亲戚。为了增强A.kiusianus基因组资源,先进的植物科学,促进芦笋的繁殖,我们确定了A.kiusianus的雄性和雌性系的基因组序列。用连锁阅读技术获得的基因组序列阅读被组装成雄性和雌性系的四个单倍型相位重叠群序列(每个〜1.6Gb)。将重叠群序列与花园芦笋的染色体序列进行比对,以构建假分子序列。在每个基因组组装中预测了大约55,000个潜在的蛋白质编码基因,70%的基因组序列被注释为重复的。对这两个物种基因组的比较分析揭示了这两个物种之间以及每个物种的雄性和雌性品系之间的结构和序列变异。与男性特异性性别决定基因具有高度序列相似性的基因。MSE1/AoMYB35/AspTDF1存在于雄性系的基因组中,但不存在于雌性基因组组装中。总的来说,基因组序列组装,基因序列,在这项研究中确定的结构和序列变异将揭示植物性分化的遗传机制,并将加速芦笋花园的抗病育种。
    Asparagus kiusianus is a disease-resistant dioecious plant species and a wild relative of garden asparagus (Asparagus officinalis). To enhance A. kiusianus genomic resources, advance plant science, and facilitate asparagus breeding, we determined the genome sequences of the male and female lines of A. kiusianus. Genome sequence reads obtained with a linked-read technology were assembled into four haplotype-phased contig sequences (∼1.6 Gb each) for the male and female lines. The contig sequences were aligned onto the chromosome sequences of garden asparagus to construct pseudomolecule sequences. Approximately 55,000 potential protein-encoding genes were predicted in each genome assembly, and ∼70% of the genome sequence was annotated as repetitive. Comparative analysis of the genomes of the two species revealed structural and sequence variants between the two species as well as between the male and female lines of each species. Genes with high sequence similarity with the male-specific sex determinant gene in A. officinalis, MSE1/AoMYB35/AspTDF1, were presented in the genomes of the male line but absent from the female genome assemblies. Overall, the genome sequence assemblies, gene sequences, and structural and sequence variants determined in this study will reveal the genetic mechanisms underlying sexual differentiation in plants, and will accelerate disease-resistance breeding in garden asparagus.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    最近出现的“第三代”测序平台解决了标准短读的缺点,允许在基因组组装过程中解析复杂的基因组区域。然而,第三代平台的测序成本仍然很高。已经开发了在捕获远程信息的同时利用短读取测序的低成本的新方法。在这一章中,我们专注于这样一种方法,10倍基因组学铬系统。我们使用超新星汇编程序演示了B73玉米参考基因组的组装。我们还提供了有关如何通过分析装配度量来改进所产生的装配的建议。
    The recent emergence of \"third-generation\" sequencing platforms which address shortcomings of standard short reads has allowed the resolution of complex genomic regions during genome assembly. However, sequencing costs for third-generation platforms continue to be high. Novel approaches that leverage the low cost of short-read sequencing while capturing long-range information have been developed. In this chapter, we focus on one such approach, the 10x Genomics\' Chromium system. We demonstrate the assembly of the B73 maize reference genome using the Supernova assembler. We also offer suggestions on how one might improve the resulting assembly through analysis of assembly metrics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Felidae家族中的Puma谱系由3种物种组成,它们在490万年前共享一个共同的祖先。先前报道了该谱系中2种物种的全基因组序列:猎豹(Acinonyxjubatus)和山狮(Pumaconcolor)。本报告描述了剩余物种的全基因组组装,jaguarundi(Pumayagouaroundi)。我们用10X基因组学连锁读数对雄性jaguarundi的基因组进行了测序,并组装了全基因组序列。组装的基因组包含一系列支架,这些支架达到染色体臂的长度,并且在支架连续性上与猎豹和美洲狮的基因组组装相似,重叠群N50=100.2kbp,支架N50=49.27Mbp。我们使用BUSCO评估了jaguarundi基因组的组装序列,将测序的个体和另一个出版的雌性jaguarundi的读数与组装的基因组对齐,带注释的蛋白质编码基因,重复,基因组变异及其对蛋白质编码基因的影响,并分析了2个jaguarundis与参考线粒体基因组的差异。jaguarundi基因组组装及其注释的质量进行了比较,变体,以及先前报道的美洲狮和猎豹基因组组装的特征。研究中使用的计算分析以透明和可重复的方式实施,以允许其进一步重复使用和修改。
    The Puma lineage within the family Felidae consists of 3 species that last shared a common ancestor around 4.9 million years ago. Whole-genome sequences of 2 species from the lineage were previously reported: the cheetah (Acinonyx jubatus) and the mountain lion (Puma concolor). The present report describes a whole-genome assembly of the remaining species, the jaguarundi (Puma yagouaroundi). We sequenced the genome of a male jaguarundi with 10X Genomics linked reads and assembled the whole-genome sequence. The assembled genome contains a series of scaffolds that reach the length of chromosome arms and is similar in scaffold contiguity to the genome assemblies of cheetah and puma, with a contig N50 = 100.2 kbp and a scaffold N50 = 49.27 Mbp. We assessed the assembled sequence of the jaguarundi genome using BUSCO, aligned reads of the sequenced individual and another published female jaguarundi to the assembled genome, annotated protein-coding genes, repeats, genomic variants and their effects with respect to the protein-coding genes, and analyzed differences of the 2 jaguarundis from the reference mitochondrial genome. The jaguarundi genome assembly and its annotation were compared in quality, variants, and features to the previously reported genome assemblies of puma and cheetah. Computational analyzes used in the study were implemented in transparent and reproducible way to allow their further reuse and modification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    白棘(L.)希奇克。(雪莓),茜草科的成员,已被用作一系列健康问题,包括炎症和风湿病的民间疗法,并产生大量的专门代谢物,包括萜烯,生物碱,和类黄酮.我们为雪莓生成了558Mb的基因组草图,该基因组编码28,707个高置信度基因。与其他被子植物基因组的比较分析表明,雪莓中富含与特殊代谢有关的谱系特异性基因。雪莓和咖啡之间的合成PierreexA.Froehner(咖啡)很明显,包括编码咖啡中咖啡因生物合成的染色体区域,尽管雪莓中没有N-甲基转移酶的合成。总共鉴定了27个推定的萜合酶基因,包括10个编码二萜合酶。推定萜烯合酶子集的功能验证表明,二萜合酶的组合可获得一般和专门代谢的产物。具体来说,我们确定了生物合成美拉内酯和利苯酮的可能中间体,结构独特的抗菌二萜天然产物。访问C.alba基因组将能够进一步表征该药用物种中负责促进健康的化合物的生物合成途径。
    Chiococca alba (L.) Hitchc. (snowberry), a member of the Rubiaceae, has been used as a folk remedy for a range of health issues including inflammation and rheumatism and produces a wealth of specialized metabolites including terpenes, alkaloids, and flavonoids. We generated a 558 Mb draft genome assembly for snowberry which encodes 28,707 high-confidence genes. Comparative analyses with other angiosperm genomes revealed enrichment in snowberry of lineage-specific genes involved in specialized metabolism. Synteny between snowberry and Coffea canephora Pierre ex A. Froehner (coffee) was evident, including the chromosomal region encoding caffeine biosynthesis in coffee, albeit syntelogs of N-methyltransferase were absent in snowberry. A total of 27 putative terpene synthase genes were identified, including 10 that encode diterpene synthases. Functional validation of a subset of putative terpene synthases revealed that combinations of diterpene synthases yielded access to products of both general and specialized metabolism. Specifically, we identified plausible intermediates in the biosynthesis of merilactone and ribenone, structurally unique antimicrobial diterpene natural products. Access to the C. alba genome will enable additional characterization of biosynthetic pathways responsible for health-promoting compounds in this medicinal species.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    甜樱桃(Prunusavium)是世界上经济上最重要的水果之一。然而,该物种的遗传信息有限,这阻碍了分子水平的育种工作。我们能够描述二倍体甜樱桃(2n=2x=16)cv的高质量参考基因组组装和注释。Tieton使用链接读取测序技术。我们产生了超过7.5亿次干净的阅读,代表112.63GB的原始测序数据。超新星汇编程序产生的基因组序列比当前的鱼群草拟基因组更加有序和连续,重叠群N50为63.65KB,支架N50为2.48MB。最终的脚手架组件长度为280.33MB,代表估计的Tieton基因组的82.12%。构建了八个染色体尺度的假分子,完成最终支架组件的214MB序列。从头,基于同源性,和RNA-seq方法一起用于预测30,975个蛋白质编码位点。在胚胎植物中鉴定出98.39%的核心真核基因和97.43%的单拷贝直系同源物,表示程序集的完整性。链接阅读测序技术可有效构建甜樱桃的高质量参考基因组,这将有利于该物种的分子育种和品种鉴定。
    The sweet cherry (Prunus avium) is one of the most economically important fruit species in the world. However, there is a limited amount of genetic information available for this species, which hinders breeding efforts at a molecular level. We were able to describe a high-quality reference genome assembly and annotation of the diploid sweet cherry (2n = 2x = 16) cv. Tieton using linked-read sequencing technology. We generated over 750 million clean reads, representing 112.63 GB of raw sequencing data. The Supernova assembler produced a more highly-ordered and continuous genome sequence than the current P. avium draft genome, with a contig N50 of 63.65 KB and a scaffold N50 of 2.48 MB. The final scaffold assembly was 280.33 MB in length, representing 82.12% of the estimated Tieton genome. Eight chromosome-scale pseudomolecules were constructed, completing a 214 MB sequence of the final scaffold assembly. De novo, homology-based, and RNA-seq methods were used together to predict 30,975 protein-coding loci. 98.39% of core eukaryotic genes and 97.43% of single copy orthologues were identified in the embryo plant, indicating the completeness of the assembly. Linked-read sequencing technology was effective in constructing a high-quality reference genome of the sweet cherry, which will benefit the molecular breeding and cultivar identification in this species.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Case Reports
    密切相关的微生物菌株的群体可以同时存在于细菌群落中,例如人类肠道微生物组。我们最近开发了一种从头基因组组装方法,该方法使用读云测序来提供更完整的微生物基因组草稿,能够精确区分和跟踪宏基因组样品中的菌株水平动力学。在这个案例研究中,我们提出了一个概念验证,使用读云测序来描述一名造血细胞移植患者在2个月时间过程中肠道微生物组中的细菌菌株多样性,并强调了治疗期间肠道微生物菌株的时间性变化.治疗伴随着饮食改变和多种免疫抑制剂和抗微生物剂的施用。
    我们对一名造血细胞移植(HCT)患者在治疗过程中收集的四个纵向粪便样本中提取的DNA进行了短读和读云宏基因组测序。在应用读云宏基因组组装以发现这些复杂微生物组样本中的菌株水平序列变异后,我们进行了代谢组学分析,以研究抗生素耐药基因的差异表达.最后,我们通过体外抗生素敏感性测试和来自患者粪便样本的分离株的全基因组测序,验证了基因组和基因组发现的预测.
    在研究的56天纵向时间过程中,患者的微生物组被严重破坏,最终被拟杆菌属主导。使用读段云测序和宏基因组RNA测序获得的B.caccae基因组的比较分析使我们能够鉴定随着时间的推移在亚群体中的差异。基于此,我们预测,特定的移动元素整合可能会导致抗生素耐药性增加,我们进一步支持使用体外抗生素药敏试验。
    我们发现读云组装在鉴定宏基因组样品中的关键结构基因组菌株变体中是有用的。这些菌株在人类微生物群中在相对短的时间段内具有波动的相对丰度。我们还发现了在临床治疗过程中与抗生素抗性增加相关的特定结构基因组变异。
    Populations of closely related microbial strains can be simultaneously present in bacterial communities such as the human gut microbiome. We recently developed a de novo genome assembly approach that uses read cloud sequencing to provide more complete microbial genome drafts, enabling precise differentiation and tracking of strain-level dynamics across metagenomic samples. In this case study, we present a proof-of-concept using read cloud sequencing to describe bacterial strain diversity in the gut microbiome of one hematopoietic cell transplantation patient over a 2-month time course and highlight temporal strain variation of gut microbes during therapy. The treatment was accompanied by diet changes and administration of multiple immunosuppressants and antimicrobials.
    We conducted short-read and read cloud metagenomic sequencing of DNA extracted from four longitudinal stool samples collected during the course of treatment of one hematopoietic cell transplantation (HCT) patient. After applying read cloud metagenomic assembly to discover strain-level sequence variants in these complex microbiome samples, we performed metatranscriptomic analysis to investigate differential expression of antibiotic resistance genes. Finally, we validated predictions from the genomic and metatranscriptomic findings through in vitro antibiotic susceptibility testing and whole genome sequencing of isolates derived from the patient stool samples.
    During the 56-day longitudinal time course that was studied, the patient\'s microbiome was profoundly disrupted and eventually dominated by Bacteroides caccae. Comparative analysis of B. caccae genomes obtained using read cloud sequencing together with metagenomic RNA sequencing allowed us to identify differences in substrain populations over time. Based on this, we predicted that particular mobile element integrations likely resulted in increased antibiotic resistance, which we further supported using in vitro antibiotic susceptibility testing.
    We find read cloud assembly to be useful in identifying key structural genomic strain variants within a metagenomic sample. These strains have fluctuating relative abundance over relatively short time periods in human microbiomes. We also find specific structural genomic variations that are associated with increased antibiotic resistance over the course of clinical treatment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:橄榄果蝇,油菌,是橄榄水果农业综合企业行业中最重要的害虫。这是因为雌蝇在未成熟的果实中产卵,孵化后幼虫以果实为食,从而破坏了它们。缺乏高质量的基因组和其他基因组和转录组数据阻碍了理解苍蝇生物学和提出替代农药使用控制方法的进展。
    结果:从雄性和雌性Demokritos品系果蝇中测序基因组DNA,在实验室工作超过45年。我们用short-,mate-pair-,和长读测序技术以产生组合的男性-女性基因组组装(GenBank登录号GCA_001188975.2)。使用10x基因组学连接读取技术对雄性昆虫进行基因组DNA测序,然后进行配对和长读取支架以及间隙闭合,生成了高度连续的489Mb基因组,支架N50为4.69Mb,L50为30个支架(GenBank登录号GCA_001188975.4)。从12个组织和/或发育阶段产生的RNA-seq数据允许基因组注释。来自男性和女性的短读数和染色体商方法能够鉴定Y染色体支架,并通过PCR进行了广泛验证。
    结论:产生的高质量基因组代表了橄榄果蝇研究的重要工具。我们提供了一个广泛的RNA-seq数据集,和基因组注释,对于深入了解橄榄果蝇的生物学至关重要。此外,Y染色体序列的阐明将促进我们对Y染色体组织的理解,功能和进化,并准备为不育昆虫技术方法提供途径。
    BACKGROUND: The olive fruit fly, Bactrocera oleae, is the most important pest in the olive fruit agribusiness industry. This is because female flies lay their eggs in the unripe fruits and upon hatching the larvae feed on the fruits thus destroying them. The lack of a high-quality genome and other genomic and transcriptomic data has hindered progress in understanding the fly\'s biology and proposing alternative control methods to pesticide use.
    RESULTS: Genomic DNA was sequenced from male and female Demokritos strain flies, maintained in the laboratory for over 45 years. We used short-, mate-pair-, and long-read sequencing technologies to generate a combined male-female genome assembly (GenBank accession GCA_001188975.2). Genomic DNA sequencing from male insects using 10x Genomics linked-reads technology followed by mate-pair and long-read scaffolding and gap-closing generated a highly contiguous 489 Mb genome with a scaffold N50 of 4.69 Mb and L50 of 30 scaffolds (GenBank accession GCA_001188975.4). RNA-seq data generated from 12 tissues and/or developmental stages allowed for genome annotation. Short reads from both males and females and the chromosome quotient method enabled identification of Y-chromosome scaffolds which were extensively validated by PCR.
    CONCLUSIONS: The high-quality genome generated represents a critical tool in olive fruit fly research. We provide an extensive RNA-seq data set, and genome annotation, critical towards gaining an insight into the biology of the olive fruit fly. In addition, elucidation of Y-chromosome sequences will advance our understanding of the Y-chromosome\'s organization, function and evolution and is poised to provide avenues for sterile insect technique approaches.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:基因组测序产生来自基因组的许多短DNA片段(读段)的序列。基因组组装试图重建这些读段来源的原始基因组。由于测序数据中的空白和错误,这项任务很困难,基础基因组中的重复序列,和杂合性。因此,装配错误很常见。在没有参考基因组的情况下,这些错误的组装可以通过将测序数据与组装进行比较并寻找两者之间的差异来识别。一旦确定,这些错误的组装可能会得到纠正,提高组装序列的质量。尽管存在使用Illumina配对末端和配对测序来识别和纠正错误组装的工具,还没有这样的工具,利用由连接的读段提供的大分子的长距离信息,例如由10xGenomicsChromium平台提供的那些。我们开发了Tigmint工具来解决这个差距。
    结果:为了证明Tigmint的有效性,我们使用与ABySS2.0和其他组装体组装的短读段将其应用于人类基因组的组装。Tigmint将QUAST在ABySS组件中识别出的错误组件数量减少了216个(27%)。虽然单独使用ARCS的支架使组件的支架NGA50从3到8Mbp增加了一倍以上,Tigmint和ARCS的组合将组装的支架NGA50提高了五倍,达到16.4Mbp。连续性的这种显著改进突出了精炼组件中组件校正的实用性。我们演示了Tigmint在校正多个工具组件中的实用性,以及使用Chromium读数来校正和支架组装长单分子测序。
    结论:支架已经用Tigmint校正的组件产生的最终组件比尚未校正的组件更正确且更连续。将单分子测序与连接的读段结合使用,能够实现高序列连续性和高支架连续性的基因组序列组装。目前仅靠这两种技术都无法实现这一壮举。
    BACKGROUND: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap.
    RESULTS: To demonstrate the effectiveness of Tigmint, we applied it to assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate the utility of Tigmint in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing.
    CONCLUSIONS: Scaffolding an assembly that has been corrected with Tigmint yields a final assembly that is both more correct and substantially more contiguous than an assembly that has not been corrected. Using single-molecule sequencing in combination with linked reads enables a genome sequence assembly that achieves both a high sequence contiguity as well as high scaffold contiguity, a feat not currently achievable with either technology alone.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    通过链接读取捕获的远程测序信息,例如10×基因组学(10xG)帮助解决基因组序列重复,并产生准确和连续的草图基因组组装。我们介绍ARKS,一种无需比对的连锁阅读基因组支架方法,该方法使用连锁阅读将基因组组装进一步组织成连续的草稿。我们的方法与其他依赖于读取对齐的链接读取支架不同,包括我们自己的(ARCS),并使用基于kmer的映射方法。kmer映射策略比读取对齐方法有几个优点,包括更好的可用性和更快的处理速度,因为它排除了输入序列格式化和草稿序列汇编索引的需要。对kmers而不是阅读比对配对序列的依赖放松了工作流程要求,并大大减少了运行时间。
    这里,我们展示了链接读取,当与脚手架的Hi-C数据结合使用时,将PacBio长读数据的人类基因组组装草图改进五倍(基线与ARKSNG50=4.6vs.23.1Mbp,分别)。我们还演示了该方法如何提供兆碱基级超新星人类基因组组装的进一步改进(NG50=14.74Mbp与ARKS前后25.94Mbp),它本身专门使用链接的读取数据进行汇编,执行速度比竞争性链接读取支架快六到九倍(〜10.5h与75.7h相比,平均而言)。在人类基因组10xG超新星组装(细胞系NA12878)的ARKS支架之后,少于9个支架覆盖每个染色体,除最大外(染色体1,n=13)。
    ARKS使用kmer映射策略而不是链接的读段比对来记录和关联排序和定向草稿装配序列所需的条形码信息。简化的工作流程,与我们最初的实施相比,ARCS,显着提高了实验人类基因组数据集上的运行时间性能。此外,ARKS中的新型距离估计器利用来自链接读取的条形码信息来估计间隙大小。它通过对重叠群内区域的已知距离之间的关系进行建模并计算相关的Jaccard指数来实现这一目标。ARKS有可能提供正确的,染色体尺度的基因组组装,迅速。我们希望ARKS在帮助完善基因组草案方面具有广泛的实用性。
    The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.
    Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13).
    ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号