k-mers

k - mers
  • 文章类型: Journal Article
    背景:宏基因组分箱,属于同一基因组的组装重叠群的聚类,是回收宏基因组组装基因组(MAG)的关键步骤。Contigs通过利用基因组上一致的特征联系起来,例如读取覆盖模式。使用来自多个样本的覆盖导致更高质量的MAG;然而,标准管道要求对多个样本进行全面读取对齐,以计算覆盖率,成为关键的计算瓶颈。
    结果:我们呈现仙女(https://github.com/bluenote-1577/fairy),宏基因组分箱的近似覆盖率计算方法。Fairy是一种快速的基于k-mer的无比对方法。对于多样本分箱,仙女可以>250倍的速度比阅读对齐和足够准确的分箱。Fairy与主机和非主机关联数据集上的几个现有binner兼容。使用MetaBAT2,仙女恢复98.5%的MAG,相对于与BWA对齐,其完整性>50%,污染<5%。值得注意的是,与仙女的多样本分箱总是比使用BWA的单样本分箱更好(平均>1.5×更多>50%完整的MAG),同时仍然更快。对于一个公共沉积物宏基因组项目,我们证明,多样本分箱比单样本分箱回收更高质量的阿斯加古细菌MAG,并且仙女的结果与读数比对没有区别。
    结论:Fairy是一种新工具,用于近似且快速地计算用于分箱的多样本覆盖率,解决宏基因组学的计算瓶颈。视频摘要。
    BACKGROUND: Metagenomic binning, the clustering of assembled contigs that belong to the same genome, is a crucial step for recovering metagenome-assembled genomes (MAGs). Contigs are linked by exploiting consistent signatures along a genome, such as read coverage patterns. Using coverage from multiple samples leads to higher-quality MAGs; however, standard pipelines require all-to-all read alignments for multiple samples to compute coverage, becoming a key computational bottleneck.
    RESULTS: We present fairy ( https://github.com/bluenote-1577/fairy ), an approximate coverage calculation method for metagenomic binning. Fairy is a fast k-mer-based alignment-free method. For multi-sample binning, fairy can be > 250 × faster than read alignment and accurate enough for binning. Fairy is compatible with several existing binners on host and non-host-associated datasets. Using MetaBAT2, fairy recovers 98.5 % of MAGs with > 50 % completeness and < 5 % contamination relative to alignment with BWA. Notably, multi-sample binning with fairy is always better than single-sample binning using BWA ( > 1.5 × more > 50 % complete MAGs on average) while still being faster. For a public sediment metagenome project, we demonstrate that multi-sample binning recovers higher quality Asgard archaea MAGs than single-sample binning and that fairy\'s results are indistinguishable from read alignment.
    CONCLUSIONS: Fairy is a new tool for approximately and quickly calculating multi-sample coverage for binning, resolving a computational bottleneck for metagenomics. Video Abstract.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在细菌全基因组关联研究(bGWAS)中使用k-mers捕获遗传变异已证明其通过在不限于单个参考基因组的基因组集中提供遗传变体的全面阵列来克服细菌基因组的可塑性的有效性。然而,很少尝试在基因组重排的背景下解释k-mers,部分是由于在基因组结构和个体重排事件的详尽和高通量鉴定方面的挑战。这里,我们介绍GWarrange,bGWAS前和后加工方法,利用k-mer的独特特性促进bGWAS进行基因组重排。重复序列是通过基因组内同源重组的基因组重排的常见煽动者,它们常见于重排边界。使用全基因组序列,重复序列被短占位符序列取代,允许重复侧翼的区域被整合到相对短的k-mer中。然后,重要k-mers中侧翼区的位置被映射回完整的基因组序列,以可视化基因组重排。提出了基于两种细菌(百日咳博德特氏菌和屎肠球菌)和模拟基因组集的四个案例研究,以证明鉴定表型相关重排的能力。GWarrange可在https://github.com/DorothyTamYiLing/GWarrange获得。
    The use of k-mers to capture genetic variation in bacterial genome-wide association studies (bGWAS) has demonstrated its effectiveness in overcoming the plasticity of bacterial genomes by providing a comprehensive array of genetic variants in a genome set that is not confined to a single reference genome. However, little attempt has been made to interpret k-mers in the context of genome rearrangements, partly due to challenges in the exhaustive and high-throughput identification of genome structure and individual rearrangement events. Here, we present GWarrange, a pre- and post-bGWAS processing methodology that leverages the unique properties of k-mers to facilitate bGWAS for genome rearrangements. Repeat sequences are common instigators of genome rearrangements through intragenomic homologous recombination, and they are commonly found at rearrangement boundaries. Using whole-genome sequences, repeat sequences are replaced by short placeholder sequences, allowing the regions flanking repeats to be incorporated into relatively short k-mers. Then, locations of flanking regions in significant k-mers are mapped back to complete genome sequences to visualise genome rearrangements. Four case studies based on two bacterial species (Bordetella pertussis and Enterococcus faecium) and a simulated genome set are presented to demonstrate the ability to identify phenotype-associated rearrangements. GWarrange is available at https://github.com/DorothyTamYiLing/GWarrange.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因组学和蛋白质组学的快速发展是由先进的测序技术的出现推动的,大,多样化,和现成的组学数据集,以及计算数据处理能力的演变。这些进步产生的大量数据需要高效的算法来提取有意义的信息。K-mers在处理大型测序数据集时是一个有价值的工具,在计算速度和内存效率方面具有多个优势,并具有内在生物功能的潜力。这篇综述概述了这些方法,应用程序,以及k-mers在基因组和蛋白质组数据分析中的意义,以及缺失序列的效用,包括无效体和无效肽,在疾病检测中,疫苗开发,治疗学,和法医学。因此,这篇综述强调了k-mers在解决当前基因组和蛋白质组学问题中的关键作用,并强调了它们在未来研究中取得突破的潜力.
    The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    抗菌肽(AMP)由于其针对病原体的广谱活性和对耐药性发展的敏感性降低,因此是新抗生素的有希望的候选物。深度学习技术,比如深层生成模型,为加快发现和优化AMP提供了一条有希望的途径。一个显著的例子是反馈生成对抗网络(FBGAN),在训练阶段包含分类器的深度生成模型。我们的研究旨在探讨增强分类器对FBGAN生成能力的影响。为此,我们为FBGAN框架引入了两个替代分类器,两者都超过了原始分类器的精度。第一个分类器利用k-mers技术,而第二个应用迁移学习从大型蛋白质语言模型进化尺度建模2(ESM2)。与原始FBGAN相比,将这些分类器集成到FBGAN中不仅会产生显着的性能增强,而且还使所提出的生成模型能够实现与AMPGAN和HydrAMP等已建立的方法相当甚至更优越的性能。这一成就强调了在FBGAN框架内利用高级分类器的有效性,增强其对AMP从头设计的计算鲁棒性,并使其与现有文献相当。
    Antimicrobial peptides (AMPs) are promising candidates for new antibiotics due to their broad-spectrum activity against pathogens and reduced susceptibility to resistance development. Deep-learning techniques, such as deep generative models, offer a promising avenue to expedite the discovery and optimization of AMPs. A remarkable example is the Feedback Generative Adversarial Network (FBGAN), a deep generative model that incorporates a classifier during its training phase. Our study aims to explore the impact of enhanced classifiers on the generative capabilities of FBGAN. To this end, we introduce two alternative classifiers for the FBGAN framework, both surpassing the accuracy of the original classifier. The first classifier utilizes the k-mers technique, while the second applies transfer learning from the large protein language model Evolutionary Scale Modeling 2 (ESM2). Integrating these classifiers into FBGAN not only yields notable performance enhancements compared to the original FBGAN but also enables the proposed generative models to achieve comparable or even superior performance to established methods such as AMPGAN and HydrAMP. This achievement underscores the effectiveness of leveraging advanced classifiers within the FBGAN framework, enhancing its computational robustness for AMP de novo design and making it comparable to existing literature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    传统的基于比对的方法由于计算复杂度高,在基因组序列比较和系统发育重建方面面临严峻挑战。这里,我们提出了一种新的无比对方法来分析物种之间的系统发育关系(分类)。在我们的方法中,动态语言(DL)模型和混沌博弈表示(CGR)方法用于表征序列中k-mers的频率信息和上下文信息,分别。然后对于数据集中的每个DNA序列或蛋白质序列,我们的方法将序列转换为特征向量,该特征向量表示基于DL模型加权的CGR的序列信息,以推断系统发育关系。我们将我们的方法命名为CGRWDL。在8个病毒数据集的DNA和蛋白质序列上测试了其性能,以构建系统发育树。我们比较了每个数据集的CGRWDL构建的系统发育树和其他高级方法的参考树之间的Robinson-Foulds(RF)距离。结果表明,CGRWDL构建的系统发育树能够对病毒进行准确的分类,树和参考树之间的RF得分小于其他方法。
    Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:KaMraT设计用于处理源自多样本的大型k聚体计数表,RNA-seq数据。它的主要目的是识别条件特异性或差异表达的序列,无论基因或转录本注释。
    结果:KaMraT是用C++实现的。主要功能包括根据计数统计对k-mers进行评分,将重叠的k-mers合并为重叠群,并根据它们在特定样本中的出现情况选择k-mers。
    背景:源代码和文档可通过https://github.com/Transpedia/KaMraT获得。
    背景:补充数据可在Bioinformatics在线获得。
    BACKGROUND: KaMRaT is designed for processing large k-mer count tables derived from multi-sample, RNA-seq data. Its primary objective is to identify condition-specific or differentially expressed sequences, regardless of gene or transcript annotation.
    RESULTS: KaMRaT is implemented in C++. Major functions include scoring k-mers based on count statistics, merging overlapping k-mers into contigs and selecting k-mers based on their occurrence across specific samples.
    METHODS: Source code and documentation are available via https://github.com/Transipedia/KaMRaT.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    序列识别或匹配的问题-从给定集合中确定可能包含短,查询的核苷酸序列-与计算生物学中的许多重要任务有关,如宏基因组学和pangenome分析。由于此类分析的复杂性和参考集合的大规模,因此解决此问题的资源高效解决方案至关重要。这提出了用高效查询的数据结构表示引用集合的三重挑战,内存使用率很低,并扩展到大型收藏品。为了解决这个问题,我们描述了一个有效的彩色deBruijn图索引,作为k-mer字典与压缩的倒排索引的组合而产生。建议的索引充分利用了以下事实:彩色压缩的deBruijn图中的单位是单色的(即,一个Unitig中的所有k-mer都有相同的起源参考集,或颜色)。具体来说,系统按颜色顺序保存在字典中,从而允许以少至每单位1+o(1)位编码从k-mer到它们的颜色的映射。因此,每个Unitig一个颜色存储在索引中,几乎没有空间/时间开销。通过将此属性与整数列表的简单但有效的压缩方法相结合,该指数获得的空间非常小。我们在名为Fulgor的工具中实现这些方法,并进行广泛的实验分析,以证明我们的工具比以前的解决方案有所改进。例如,与Themisto相比-在指数空间方面最强的竞争对手查询时间权衡-Fulgor需要更少的空间(对于150,000个沙门氏菌基因组的集合,空间减少多达43%),至少是颜色查询的两倍,并且是2-6[公式:见正文]更快构造。
    The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    人类疾病的早期检测与改善的临床结果相关。然而,许多疾病通常在晚期被发现,症状阶段,患者已经过了有效的治疗期,可能导致不太有利的结果。因此,迫切需要能够在症状前阶段准确检测人类疾病的方法。这里,我们引入了“频率分子”;短序列是特异性的,在患者或健康对照样本中反复观察到,但不是两者。我们展示了使用宏基因组下一代测序数据从患者和对照的粪便样本中检测肝硬化的频率的效用。我们开发了用于检测肝硬化的分类模型,并使用十倍交叉验证实现了0.91的AUC评分。200个频率的一小部分可以在检测肝硬化方面获得可比的结果。最后,我们鉴定了肝硬化样本中的微生物,与最具预测性的频率生物标志物相关。
    Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients are past efficacious treatment periods and can result in less favorable outcomes. Therefore, methods that can accurately detect human disease at a presymptomatic stage are urgently needed. Here, we introduce \"frequentmers\"; short sequences that are specific and recurrently observed in either patient or healthy control samples, but not in both. We showcase the utility of frequentmers for the detection of liver cirrhosis using metagenomic Next Generation Sequencing data from stool samples of patients and controls. We develop classification models for the detection of liver cirrhosis and achieve an AUC score of 0.91 using ten-fold cross-validation. A small subset of 200 frequentmers can achieve comparable results in detecting liver cirrhosis. Finally, we identify the microbial organisms in liver cirrhosis samples, which are associated with the most predictive frequentmer biomarkers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:比较宏基因组分析需要测量数据集中宏基因组之间的成对相似性。计算两个宏基因组之间的β多样性距离的基于参考的方法高度依赖于参考数据库的质量和完整性,它们在研究较少的微生物区系上的应用可能具有挑战性。另一方面,从头比较宏基因组方法仅依赖于宏基因组的序列组成来比较数据集。虽然这些方法都有其优点和局限性,他们的比较目前是有限的。方法:我们开发了一组模拟的短读取宏基因组,以(1)比较基于k-mer和基于分类法的距离,并评估技术和生物学变量对这些指标的影响,以及(2)评估k-mer草图的影响和过滤。我们使用真实世界的宏基因组数据集来提供当前可用的从头宏基因组比较分析工具的概述。结果:使用已知组成和受控错误率的模拟宏基因组,我们发现,基于k-mer的距离指标与定量β-多样性指标的分类距离指标密切相关,但存在/不存在距离的相关性较低。在分类群丰富度和测序深度方面的群落复杂性显着影响了基于k-mer的距离的质量,而低量的序列污染和测序错误的影响是有限的。最后,我们对目前可用的从头比较宏基因组工具进行了基准测试,并比较了它们在两个粪便宏基因组数据集上的输出,结果表明,大多数基于k-mer的工具能够概括使用分类学方法观察到的数据结构.结论:本研究扩展了我们对基于k-mer的从头比较宏基因组方法的强度和局限性的理解,旨在为有兴趣将这些方法应用于宏基因组数据集的研究人员提供具体指南。
    Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited. Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis. Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches. Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    全基因组关联研究(GWAS)已被广泛用于鉴定与复杂性状相关的遗传变异。尽管它的成功和受欢迎,传统的GWAS方法具有各种局限性。出于这个原因,已经开发了更新的GWAS方法,包括使用泛基因组而不是参考基因组,以及利用单核苷酸多态性以外的标记,如结构变异和k聚体。近年来,基于k-mers的GWAS方法尤其受到研究人员的关注。然而,这些新的方法可能是复杂和具有挑战性的实施。这里,我们介绍kGWASflow,一个模块化的,用户友好,和可扩展的工作流,以使用k-mer执行GWAS。我们使用Snakemake和Conda等管理工具,将现有的kmersGWAS方法引入到更容易,更易于访问的工作流程中,并消除了因缺少依赖关系和版本冲突而带来的挑战。kGWASflow通过使用Snakemake自动化每个步骤并使用Docker等容器化工具来增加kmersGWAS方法的可重复性。工作流程包括补充组件,如质量控制,读取修整程序,并生成汇总统计数据。kGWASflow还提供了GWAS后分析选项,以确定性状相关k-mer的基因组位置和背景。kGWASflow可以应用于任何生物体,并且需要最少的编程技能。在GitHub(https://github.com/akcorut/kGWASflow)和Bioconda(https://anaconda.org/bioconda/kgwasflow)上免费提供kGWASflow。
    Genome-wide association studies (GWAS) have been widely used to identify genetic variation associated with complex traits. Despite its success and popularity, the traditional GWAS approach comes with a variety of limitations. For this reason, newer methods for GWAS have been developed, including the use of pan-genomes instead of a reference genome and the utilization of markers beyond single-nucleotide polymorphisms, such as structural variations and k-mers. The k-mers-based GWAS approach has especially gained attention from researchers in recent years. However, these new methodologies can be complicated and challenging to implement. Here, we present kGWASflow, a modular, user-friendly, and scalable workflow to perform GWAS using k-mers. We adopted an existing kmersGWAS method into an easier and more accessible workflow using management tools like Snakemake and Conda and eliminated the challenges caused by missing dependencies and version conflicts. kGWASflow increases the reproducibility of the kmersGWAS method by automating each step with Snakemake and using containerization tools like Docker. The workflow encompasses supplemental components such as quality control, read-trimming procedures, and generating summary statistics. kGWASflow also offers post-GWAS analysis options to identify the genomic location and context of trait-associated k-mers. kGWASflow can be applied to any organism and requires minimal programming skills. kGWASflow is freely available on GitHub (https://github.com/akcorut/kGWASflow) and Bioconda (https://anaconda.org/bioconda/kgwasflow).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号