BWA

BWA
  • 文章类型: Journal Article
    下一代测序(NGS)的进步显着降低了生成DNA序列数据的成本,并提高了数据生成的速度。然而,这种高通量数据生产增加了对高效数据分析程序的需求。分析测序数据中计算最苛刻的步骤之一是将NGS产生的短读数映射到参考DNA序列。比如人类基因组。映射程序BWA-MEM及其较新版本BWA-MEM2,针对CPU进行了优化,是这项任务的一些最受欢迎的选择。在这项研究中,我们讨论了BWA-MEM在GPU上的实现。这是一项具有挑战性的任务,因为BWA-MEM中的许多算法和数据结构在GPU架构上无法有效执行。本文指出了在BWA-MEM程序的所有主要阶段开发高效GPU代码的主要挑战,包括播种,种子链,史密斯-沃特曼对齐,内存管理,和I/O处理。我们对在64线程CPU上运行的BWA-MEM和BWA-MEM2进行了比较实验。结果表明,使用NVIDIAA40时,我们的实施比BWA-MEM2实现了高达3.2倍的加速,比BWA-MEM实现了高达5.8倍的加速。使用NVIDIAA6000和NVIDIAA100,我们实现了比BWA-MEM2高3.4x/3.8x的靠墙时间加速,比BWA-MEM高6.1x/6.8x的靠墙时间加速。分别。在阶段比较中,A40/A6000/A100GPU分别达到3.7/3.8/4x,2/2.3/2.5x,和3.1/5/7.9x加速BWA-MEM的三个主要阶段:播种和种子链,史密斯-沃特曼,并使SAM输出。据我们所知,这是第一个尝试在GPU上实现整个BWA-MEM程序的研究。
    Advancements in Next-Generation Sequencing (NGS) have significantly reduced the cost of generating DNA sequence data and increased the speed of data production. However, such high-throughput data production has increased the need for efficient data analysis programs. One of the most computationally demanding steps in analyzing sequencing data is mapping short reads produced by NGS to a reference DNA sequence, such as a human genome. The mapping program BWA-MEM and its newer version BWA-MEM2, optimized for CPUs, are some of the most popular choices for this task. In this study, we discuss the implementation of BWA-MEM on GPUs. This is a challenging task because many algorithms and data structures in BWA-MEM do not execute efficiently on the GPU architecture. This paper identifies major challenges in developing efficient GPU code on all major stages of the BWA-MEM program, including seeding, seed chaining, Smith-Waterman alignment, memory management, and I/O handling. We conduct comparison experiments against BWA-MEM and BWA-MEM2 running on a 64-thread CPU. The results show that our implementation achieved up to 3.2x speedup over BWA-MEM2 and up to 5.8x over BWA-MEM when using an NVIDIA A40. Using an NVIDIA A6000 and an NVIDIA A100, we achieved a wall-time speedup of up to 3.4x/3.8x over BWA-MEM2 and up to 6.1x/6.8x over BWA-MEM, respectively. In stage-wise comparison, the A40/A6000/A100 GPUs respectively achieved up to 3.7/3.8/4x, 2/2.3/2.5x, and 3.1/5/7.9x speedup on the three major stages of BWA-MEM: seeding and seed chaining, Smith-Waterman, and making SAM output. To the best of our knowledge, this is the first study that attempts to implement the entire BWA-MEM program on GPUs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Coupled with the reduction in sequencing costs, the number of RAD-seq analysis have been surging, generating vast genetic knowledge in relation with many crops. Specialized platforms might be intimidating to non-expert users and difficult to implement on each computer despite the growing interest in the usage of the dataset obtained by high-throughput sequencing. Therefore, RAD-R scripts were developed on Windows10 for RAD-seq analysis, allowing users who are not familiar with bioinformatics to easily analyze big sequence data. These RAD-R scripts that run a flow from raw sequence reads of F2 population for the self-fertilization plants to the linkage map construction as well as the QTL analysis can be also useful to many users with limited experience due to the simplicity of copying Excel cells into the R console. During the comparison of linkage maps constructed by RAD-R scripts and Stacks, RAD-R scripts were shown to construct the linkage map with less missing genotype data and a shorter total genetic distance. QTL analysis results can be easily obtained by selecting the reliable genotype data that is visually inferred to be appropriate for error correction from the genotype data files created by RAD-R scripts.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于多个参考基因组的持续发展,极具挑战性的六倍体小麦(Triticumaestivum)基因组变得越来越容易获得,有助于更好地理解重要性状变异的困境因素。尽管变体调用的过程相对简单,选择用于分析的读取对齐和变体调用阶段的计算工具的最佳组合以及对错误变体调用的有效过滤并不总是容易的任务。先前的研究已经分析了方法对二倍体生物质量度量的影响。鉴于小麦的变异识别在很大程度上依赖于外显子组数据的准确挖掘,迫切需要更好地了解不同方法如何影响多倍体物种的全外显子组测序(WES)数据分析.这项研究旨在通过对48个小麦品种进行全外显子组测序并评估各种变体调用管道在其建议设置下的性能来解决这一问题。结果表明,所有管道都需要过滤以消除误报调用。由性能最佳的管道调用的参考SNP之间的高度一致性表明过滤提供准确和可重复的结果。该研究还提供了针对原始和过滤的SNP调用的个体和群体水平的高灵敏度和精确度的详细比较。
    The highly challenging hexaploid wheat (Triticum aestivum) genome is becoming ever more accessible due to the continued development of multiple reference genomes, a factor which aids in the plight to better understand variation in important traits. Although the process of variant calling is relatively straightforward, selection of the best combination of the computational tools for read alignment and variant calling stages of the analysis and efficient filtering of the false variant calls are not always easy tasks. Previous studies have analyzed the impact of methods on the quality metrics in diploid organisms. Given that variant identification in wheat largely relies on accurate mining of exome data, there is a critical need to better understand how different methods affect the analysis of whole exome sequencing (WES) data in polyploid species. This study aims to address this by performing whole exome sequencing of 48 wheat cultivars and assessing the performance of various variant calling pipelines at their suggested settings. The results show that all the pipelines require filtering to eliminate false-positive calls. The high consensus among the reference SNPs called by the best-performing pipelines suggests that filtering provides accurate and reproducible results. This study also provides detailed comparisons for high sensitivity and precision at individual and population levels for the raw and filtered SNP calls.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium-human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基因组测序数据的生物信息学分析以识别癌症样品中的体细胞突变远未达到所需的稳健性和标准化。在这项研究中,我们使用铂基因组样品NA12878生成了一个完整的外显子组测序基准数据集,并开发了一种交叉然后组合(ITC)方法,以提高在肿瘤-正常对中调用单核苷酸变体(SNV)和indel的准确性。我们评估了对齐的效果,基础质量重新校准,突变调用者和过滤对灵敏度和假阳性率的影响。ITC方法将灵敏度提高到17.1%,在不增加每兆碱基假阳性率(FPR/Mb)的情况下,其有效性在一组临床样本中得到证实。
    Bioinformatic analysis of genomic sequencing data to identify somatic mutations in cancer samples is far from achieving the required robustness and standardisation. In this study we generated a whole exome sequencing benchmark dataset using the platinum genome sample NA12878 and developed an intersect-then-combine (ITC) approach to increase the accuracy in calling single nucleotide variants (SNVs) and indels in tumour-normal pairs. We evaluated the effect of alignment, base quality recalibration, mutation caller and filtering on sensitivity and false positive rate. The ITC approach increased the sensitivity up to 17.1%, without increasing the false positive rate per megabase (FPR/Mb) and its validity was confirmed in a set of clinical samples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Photo-activatable ribonucleoside cross-linking and immunoprecipitation (PAR-CLIP) is a method to detect binding sites of RNA-binding proteins (RBPs) transcriptome-wide. This chapter covers the computational analysis of the high-throughput sequencing reads generated from PAR-CLIP experiments. It explains how the reads are mutated due to UV cross-linking and how to appropriately pre-process and align them to a reference sequence. Aligned reads are then aggregated into clusters which represent putative RBP-binding sites. Mapping artifacts are a source of false positives, which can be controlled by means of a mapping decoy and adaptive quality filtering of the read clusters. A step-by-step explanation of this procedure is given. All necessary tools are open source, including the scripts presented and used in this chapter.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Among the potential biological agents suitable as a weapon, Ebola virus represents a major concern. Classified by the CDC as a category A biological agent, Ebola virus causes severe hemorrhagic fever, characterized by high case-fatality rate; to date, no vaccine or approved therapy is available. The EVD epidemic, which broke out in West Africa since the late 2013, has got the issue of the possible use of Ebola virus as biological warfare agent (BWA) to come to the fore once again. In fact, due to its high case-fatality rate, population currently associates this pathogen to a real and tangible threat. Therefore, its use as biological agent by terrorist groups with offensive purpose could have serious repercussions from a psychosocial point of view as well as on closely sanitary level. In this paper, after an initial study of the main characteristics of Ebola virus, its potential as a BWA was evaluated. Furthermore, given the spread of the epidemic in West Africa in 2014 and 2015, the potential dissemination of the virus from an urban setting was evaluated. Finally, it was considered the actual possibility to use this agent as BWA in different scenarios, and the potential effects on one or more nation\'s stability.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world\'s fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Case Reports
    Type II citrullinaemia, also known as citrin deficiency, is an autosomal recessive metabolic disorder, which is caused by pathogenic mutations in the SLC25A13 gene on chromosome 7q21.3. One of the clinical manifestations of type II citrullinaemia is neonatal intrahepatic cholestatic hepatitis caused by citrin deficiency (NICCD, OMIM# 605814). In this study, a 5-month-old female Chinese neonate diagnosed with type II citrullinaemia was examined. The diagnosis was based on biochemical and clinical findings, including organic acid profiling using a gas chromatography mass spectrometry (GC/MS), and the patient\'s parents were unaffected. Approximately 14 kb of the exon sequences of the SLC25A13 and two relative genes (ASS1 and FAH) from the proband and 100 case-unrelated controls were captured by array-based capture method followed by high-throughput next-generation sequencing. Two single-nucleotide mutations were detected in the proband, including the previous reported c.1177+1G>A mutation and a novel c.754 G>A mutation in the SLC25A13 gene. Sanger sequence results showed that the patient was a compound heterozygote for the two mutations. The novel mutation (c.754 G>A), which is predicted to affect the normal structure and function of citrin, is a candidate pathogenic mutation. Target sequence capture combined with high-throughput next-generation sequencing technologies is proven to be an effective method for molecular genetic testing of type II citrullinaemia.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Macrophages as phagocytes and professional antigen presenting cells play critical roles in both innate and adaptive immunity. Main transcription factors acting during their differentiation and function are known, but the behavior and co-operation of these factors still remained unexplored. We introduce a new approach to map nucleosome-free regions using exclusively active enhancer and core promoter marking histone modification ChIP-seq data. We could detect approximately 56,000 potential active enhancers/promoters showing different lengths and histone modification shapes. Beside the highly enriched PU.1 and C/EBP sites, we could also detect binding sites for RUNX and AP-1, as well as for the MiT (MITF-TFE) family and MEF2 proteins. The PU.1 and C/EBP transcription factors are known for transforming cells into macrophages. The other transcription factors found in this study can play a role in macrophages as well, since it is known that the MiT family proteins are responsible for phagocytic activity and the MEF2 proteins specify monocytic differentiation over the granulocyte direction. Our results imply that this method can provide novel information about transcription factor organization at enhancers and core promoters as well as about the histone modifications surrounding regulatory regions in any immune or other cell types.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号