k-mer

k - mer
  • 文章类型: Journal Article
    增强子在基因表达调控中至关重要,决定转录活性的特异性和时间,这突出了他们的鉴定对于解开基因调控的复杂性的重要性。因此,确定增强剂及其优势至关重要。基因组中的重复序列是相同或对称片段的重复序列。有大量证据表明,重复序列包含大量的遗传信息。因此,我们介绍W2V重复指数,设计用于鉴定增强子序列片段并通过分析增强子区域中的重复K聚体序列来评估其强度。利用word2vector算法进行数值转换,利用Manta射线觅食优化进行特征选择,该方法有效地捕获了K-mer序列的频率和分布。通过专注于重复的K-mer序列,它最大限度地降低了计算复杂性,并有助于分析较大的K值。实验表明,我们的方法在几乎所有指标上都优于所有其他高级方法。
    Enhancers are crucial in gene expression regulation, dictating the specificity and timing of transcriptional activity, which highlights the importance of their identification for unravelling the intricacies of genetic regulation. Therefore, it is critical to identify enhancers and their strengths. Repeated sequences in the genome are repeats of the same or symmetrical fragments. There has been a great deal of evidence that repetitive sequences contain enormous amounts of genetic information. Thus, We introduce the W2V-Repeated Index, designed to identify enhancer sequence fragments and evaluates their strength through the analysis of repeated K-mer sequences in enhancer regions. Utilizing the word2vector algorithm for numerical conversion and Manta Ray Foraging Optimization for feature selection, this method effectively captures the frequency and distribution of K-mer sequences. By concentrating on repeated K-mer sequences, it minimizes computational complexity and facilitates the analysis of larger K values. Experiments indicate that our method performs better than all other advanced methods on almost all indicators.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:使用下一代测序技术,科学家可以直接从环境中对复杂的微生物群落进行测序。对结构的重要见解,多样性,宏基因组学的研究产生了微生物群落和生态学。读段组装成更长的重叠群,然后将它们分成对应于宏基因组样本中不同物种的重叠群组,是宏基因组学分析的关键步骤。有必要将这些重叠群组织成操作分类单位(OTU),以进行进一步的分类学分析和功能分析。对于装箱,这与OTU的聚类同义,四核苷酸频率(TNF)通常用作每个OTU的组成特征。
    结果:在本文中,我们介绍AFIT,每个重叠群的一个新的l-mer统计向量,和AFITBin,一种基于AFIT和矩阵分解方法的宏基因组分箱新方法。为了评估AFIT向量的性能,t-SNE算法用于比较基于AFIT和TNF信息的物种聚类。此外,与MetaBAT2、MaxBin2.0、CONCOT、MetaCon,SolidBin,BusybaWeb,和MetaBinner。为了进一步分析有目的的AFIT向量的性能,我们比较了AFIT载体和TNF载体的条形码。
    结论:结果表明,与现有方法相比,AFITBin在分类学鉴定方面表现出优异的性能,利用AFIT载体改善宏基因组分箱的结果。这种方法有望推进宏基因组数据的分析,为微生物群落组成和功能提供更可靠的见解。
    背景:python软件包可在以下网址获得:https://github.com/SayehSobhani/AFITBin。
    BACKGROUND: Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU.
    RESULTS: In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector.
    CONCLUSIONS: The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function.
    BACKGROUND: A python package is available at: https://github.com/SayehSobhani/AFITBin .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    使用常规方法的系统发育树估计通常需要成对或多序列比对。然而,在长序列如全基因组的情况下,序列比对具有与可扩展性和准确性相关的困难,低序列同一性,并且存在基因组重排。为了解决这些问题,已经提出了无对齐方法。虽然这些方法已经证明了有希望的结果,当在基于比对的方法中被简单地检测到的一个或多个物种的序列中缺少区域时,这些中的许多导致错误。这里,我们提出了一种无需比对的方法,用于检测要估计系统发育的物种序列中的缺失区域。它基于k聚体的计数,可用于筛选出属于一个物种中一个或多个其他物种中缺失的区域的k聚体。我们对包含缺失区域的真实和模拟数据集进行实验,发现它可以成功检测到很大一部分这样的k-mers,并可以改善估计的系统发育。我们的方法可用于基于k-mer的无比对系统发育估计方法中,以筛选出对应于缺失区域的k-mer。
    Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因组和宏基因组数据的数量和数量的增加需要可扩展和强大的计算模型来进行精确分析。利用来自生物样品的k聚体的草图技术已被证明可用于大规模分析。近年来,FracMinHash已成为一种流行的草图技术,并已用于多种有用的应用中。最近对FracMinHash的研究证明了遏制指数和Jaccard指数的无偏估计。然而,其他指标的理论研究,比如余弦相似度,仍然缺乏。
    在本文中,我们提出了一个从FracMinHash草图估计余弦相似性的理论框架。我们建立了这种估计合理的条件,并建议最小比例因子s以获得准确的结果。实验证据支持我们的理论发现。
    我们还介绍了压裂kmc,快速高效的FracMinHash草图生成器程序。frac-kmc是已知最快的FracMinHash草图生成器,为真实数据的余弦相似性估计提供准确和精确的结果。我们证明,通过使用frac-kmc计算FracMinHash草图,我们可以在真实数据上快速准确地估计成对余弦相似度。frac-kmc在这里免费提供:https://github.com/KoslickiLab/frac-kmc/。
    2012应用计算→计算生物学。
    UNASSIGNED: The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking.
    UNASSIGNED: In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.
    UNASSIGNED: We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    尽管在生物信息学中广泛采用了基于k-mer的方法,一个基本问题仍然存在:我们如何量化k尺寸在应用程序中的影响?没有通用答案可用,选择最佳k大小或使用多个k大小仍然是特定于应用程序的,任意,计算昂贵。主要参数k的评估通常是经验性的,基于通过复杂基因组分析过程的应用程序的最终产品,比较,装配,对齐,和纠错。问题的难以捉摸源于对k聚体相对于k大小的转变的有限理解。的确,通过探索跨多个k大小的k-mer特定数量,有相当大的改进实践和理论的空间。本文介绍了一种基于新颖子串表示的算法框架:Prokrustean图。该框架的主要功能是在一系列k大小中提取各种基于k聚体的数量,但是它的计算复杂性只取决于最大重复,不在k范围内。例如,对于k=10,...,计算deBruijn图的最大单位数,100只需要几秒钟的时间,而Prokrustean图建立在一组读取的千兆字节大小上。这种效率将图形与其他子字符串索引区分开来,例如FM指数,它们通常是针对字符串模式搜索而优化的,而不是针对跨不同长度描绘子串结构。然而,Prokrustean图预计将缩小这一差距,因为它可以使用扩展的Burrows-Wheeler变换(eBWT)以节省空间的方式构建。该框架在pangenome和宏基因组分析中特别有用,由于所管理的信息的复杂性和多样性,对精确的multi-k方法的需求正在增加。我们介绍了使用该框架实施的四个应用程序,这些应用程序提取了现代pangenomics和宏基因组学中积极使用的关键数量。实现我们的数据结构和算法(以及正确性测试)的代码可在https://github.com/KoslickiLab/prokrustean获得。
    2012应用计算→计算生物学。
    10.4230/LIPIcs。WABI.2024。YY.
    https://github.com/KoslickiLab/prokrustean。
    Despite the widespread adoption of k -mer-based methods in bioinformatics, a fundamental question persists: How can we quantify the influence of k sizes in applications? With no universal answer available, choosing an optimal k size or employing multiple k sizes remains application-specific, arbitrary, and computationally expensive. The assessment of the primary parameter k is typically empirical, based on the end products of applications which pass complex processes of genome analysis, comparison, assembly, alignment, and error correction. The elusiveness of the problem stems from a limited understanding of the transitions of k -mers with respect to k sizes. Indeed, there is considerable room for improving both practice and theory by exploring k -mer-specific quantities across multiple k sizes. This paper introduces an algorithmic framework built upon a novel substring representation: the Prokrustean graph. The primary functionality of this framework is to extract various k -mer-based quantities across a range of k sizes, but its computational complexity depends only on maximal repeats, not on the k range. For example, counting maximal unitigs of de Bruijn graphs for k = 10 , … , 100 takes just a few seconds with a Prokrustean graph built on a read set of gigabases in size. This efficiency sets the graph apart from other substring indices, such as the FM-index, which are normally optimized for string pattern searching rather than for depicting the substring structure across varying lengths. However, the Prokrustean graph is expected to close this gap, as it can be built using the extended Burrows-Wheeler Transform (eBWT) in a space-efficient manner. The framework is particularly useful in pangenome and metagenome analyses, where the demand for precise multi- k approaches is increasing due to the complex and diverse nature of the information being managed. We introduce four applications implemented with the framework that extract key quantities actively utilized in modern pangenomics and metagenomics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    小麦(TriticumaestivumL.)是世界35%以上人口的主食,用面粉制作数百种烘焙食品。优良的最终使用质量是主要的育种目标,然而,改进它是特别耗时和昂贵的。此外,编码种子贮藏蛋白(SSP)的基因形成多基因家族,并且是重复的,在几个基因组组装中普遍存在缺口。为了克服这些障碍并有效地鉴定出优良的小麦SSP等位基因,我们开发了“PanSK”(Pan-SSPk-mer),用于基于基于SSP的pangenome资源的基因型到表型预测。PanSK使用29聚体序列在全基因组水平上代表每个SSP基因,以揭示地方品种和现代品种之间未开发的多样性。使用k-mer的全基因组关联研究鉴定了与最终使用质量相关的23个SSP基因,代表了新的改进目标。我们评估了黑麦secalin基因对最终使用质量的影响,发现从1BL/1RS小麦易位系中去除ω-secalin与最终使用质量的提高有关。最后,使用受PanSK启发的基于机器学习的预测,我们预测质量表型具有高准确性从基因型单独。本研究为基于SSP基因的基因组设计提供了一种有效的方法,使小麦品种具有优越的加工能力和改进的最终用途质量的育种。
    Wheat is a staple food for more than 35% of the world\'s population, with wheat flour used to make hundreds of baked goods. Superior end-use quality is a major breeding target; however, improving it is especially time-consuming and expensive. Furthermore, genes encoding seed-storage proteins (SSPs) form multi-gene families and are repetitive, with gaps commonplace in several genome assemblies. To overcome these barriers and efficiently identify superior wheat SSP alleles, we developed \"PanSK\" (Pan-SSP k-mer) for genotype-to-phenotype prediction based on an SSP-based pangenome resource. PanSK uses 29-mer sequences that represent each SSP gene at the pangenomic level to reveal untapped diversity across landraces and modern cultivars. Genome-wide association studies with k-mers identified 23 SSP genes associated with end-use quality that represent novel targets for improvement. We evaluated the effect of rye secalin genes on end-use quality and found that removal of ω-secalins from 1BL/1RS wheat translocation lines is associated with enhanced end-use quality. Finally, using machine-learning-based prediction inspired by PanSK, we predicted the quality phenotypes with high accuracy from genotypes alone. This study provides an effective approach for genome design based on SSP genes, enabling the breeding of wheat varieties with superior processing capabilities and improved end-use quality.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:癌细胞的身份是通过多种因素的混合来确定的,例如基因组变异,表观遗传学,以及参与转录的调控变异。转录组表达的差异以及肽中的异常结构决定了表型差异。因此,大量RNA-seq和最新的单细胞RNA-seq数据(scRNA-seq)对于识别致病性差异很重要。在这种情况下,我们依靠序列的k-mer分解来详细鉴定致病变异,不需要参考,因此,根据序列与参考的比对,它优于更传统的下一代测序(NGS)分析技术。
    结果:通过我们的无对齐分析,食管癌和胶质母细胞瘤患者,多个不同位置的高频变化(重复,基因间区域,外显子,内含子)以及多种不同形式(融合,聚腺苷酸化,拼接,等。)可以被发现。此外,我们在经典的转录组分析管道中系统地分析了注意力不集中的事件的重要性,这些事件被认为是肿瘤预后的指标。肿瘤预测,肿瘤新抗原推断,以及它们与免疫微环境的联系。
    结论:我们的结果表明,食管癌(ESCA)和胶质母细胞瘤过程可以通过病原微生物RNA来解释,重复序列,新颖的剪接变体,和长基因间非编码RNA(lincRNAs)。我们希望我们的无参考过程和分析的应用有助于肿瘤和正常样本差异scRNA-seq分析,这反过来又为重大癌症相关事件提供了更全面的方案。
    BACKGROUND: Cancerous cells\' identity is determined via a mixture of multiple factors such as genomic variations, epigenetics, and the regulatory variations that are involved in transcription. The differences in transcriptome expression as well as abnormal structures in peptides determine phenotypical differences. Thus, bulk RNA-seq and more recent single-cell RNA-seq data (scRNA-seq) are important to identify pathogenic differences. In this case, we rely on k-mer decomposition of sequences to identify pathogenic variations in detail which does not need a reference, so it outperforms more traditional Next-Generation Sequencing (NGS) analysis techniques depending on the alignment of the sequences to a reference.
    RESULTS: Via our alignment-free analysis, over esophageal and glioblastoma cancer patients, high-frequency variations over multiple different locations (repeats, intergenic regions, exons, introns) as well as multiple different forms (fusion, polyadenylation, splicing, etc.) could be discovered. Additionally, we have analyzed the importance of less-focused events systematically in a classic transcriptome analysis pipeline where these events are considered as indicators for tumor prognosis, tumor prediction, tumor neoantigen inference, as well as their connection with respect to the immune microenvironment.
    CONCLUSIONS: Our results suggest that esophageal cancer (ESCA) and glioblastoma processes can be explained via pathogenic microbial RNA, repeated sequences, novel splicing variants, and long intergenic non-coding RNAs (lincRNAs). We expect our application of reference-free process and analysis to be helpful in tumor and normal samples differential scRNA-seq analysis, which in turn offers a more comprehensive scheme for major cancer-associated events.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    测序费用的减少促进了用于扩大的生物体阵列的参考基因组和蛋白质组的创建。然而,没有建立详细说明特定长度的生物体特异性基因组和蛋白质组序列的存储库,被称为kmers,存在于我们的知识中。在这篇文章中,我们介绍kmerDB,通过交互式网络界面访问的数据库,该界面以系统的方式从基因组和蛋白质组序列中提供基于kmer的信息。kmerDB目前包含202,340,859,107个碱基对和19,304,903,356个氨基酸,跨越54,039和21,865个参考基因组和蛋白质组,分别,以及6,905,362和149,305,183基因组和蛋白质组物种特异性序列,称为准素数。此外,我们提供了每个基因组和蛋白质组中缺失的5,186,757核酸和214,904,089肽序列的访问,称为素数。kmerDB具有用户友好的界面,提供各种搜索选项和过滤器,以便于解析和搜索。该服务可在www上获得。kmerdb.com.
    The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    RNA假尿苷修饰存在于许多物种的不同RNA类型中,对生物过程的表达具有显著的调节作用。为了了解RNA假尿苷位点的功能机制,RNA序列中假尿苷位点的准确鉴定至关重要.尽管已经提出了几种快速且廉价的计算方法,提高识别准确性和泛化性的挑战仍然存在。这项研究提出了一种称为PseUpred-ELPSO的新型集成预测因子,用于改进的RNA假尿苷位点预测。在分析RNA假尿苷位点序列之间的核苷酸组成偏好后,确定了两个特征表示并将其输入到堆叠集成框架中。然后,使用五个基于树的机器学习分类器作为基础分类器,构建30维RNA图谱以代表RNA序列,并使用PSO算法,搜索RNA谱的权重以进一步增强代表性.使用逻辑回归分类器作为元分类器来完成最终预测。与最先进的预测因子相比,PseUpred-ELPSO在交叉验证和独立测试中的性能均较好。基于PseUpred-ELPSO预测器,建立了免费且易于操作的Web服务器,这将是一个强大的工具,用于伪尿苷网站识别。
    RNA pseudouridine modification exists in different RNA types of many species, and it has a significant role in regulating the expression of biological processes. To understand the functional mechanisms for RNA pseudouridine sites, the accurate identification of pseudouridine sites in RNA sequences is essential. Although several fast and inexpensive computational methods have been proposed, the challenge of improving recognition accuracy and generalization still exists. This study proposed a novel ensemble predictor called PseUpred-ELPSO for improved RNA pseudouridine site prediction. After analyzing the nucleotide composition preferences between RNA pseudouridine site sequences, two feature representations were determined and fed into the stacking ensemble framework. Then, using five tree-based machine learning classifiers as base classifiers, 30-dimensional RNA profiles are constructed to represent RNA sequences, and using the PSO algorithm, the weights of the RNA profiles were searched to further enhance the representation. A logistic regression classifier was used as a meta-classifier to complete the final predictions. Compared to the most advanced predictors, the performance of PseUpred-ELPSO is superior in both cross-validation and the independent test. Based on the PseUpred-ELPSO predictor, a free and easy-to-operate web server has been established, which will be a powerful tool for pseudouridine site identification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    长散布元件-1(LINE-1或L1)是占人类基因组17%的自主转座元件。L1异常表达与疾病之间有很强的相关性,特别是癌症,已经被许多研究记录了。先前已经创建了L1PD(LINE-1模式检测)以通过使用固定的预定组的50聚体探针和模式匹配算法来检测L1s。L1PD使用一种新颖的种子和模式匹配策略,而不是其他工具采用的众所周知的种子和扩展策略。这项研究讨论了L1PD的改进版本,该版本显示了如何将k-mer探针的大小从50增加到75或100产生更好的结果。与50聚体相比,实验显示更高的精确度和召回率。更新了探针生成过程,现在共享相应的软件,以便用户可以为其他参考基因组生成探针(有某些限制)。此外,L1PD应用于其他非人类基因组,比如狗,马,奶牛,进一步验证模式匹配策略。L1PD的改进版本被证明是一种有效且有前途的L1检测方法。
    Long Interspersed Element-1 (LINE-1 or L1) is an autonomous transposable element that accounts for 17% of the human genome. Strong correlations between abnormal L1 expression and diseases, particularly cancer, have been documented by numerous studies. L1PD (LINE-1 Pattern Detection) had been previously created to detect L1s by using a fixed pre-determined set of 50-mer probes and a pattern-matching algorithm. L1PD uses a novel seed-and-pattern-match strategy as opposed to the well-known seed-and-extend strategy employed by other tools. This study discusses an improved version of L1PD that shows how increasing the size of the k-mer probes from 50 to 75 or to 100 yields better results, as evidenced by experiments showing higher precision and recall when compared to the 50-mers. The probe-generation process was updated and the corresponding software is now shared so that users may generate probes for other reference genomes (with certain limitations). Additionally, L1PD was applied to other non-human genomes, such as dogs, horses, and cows, to further validate the pattern-matching strategy. The improved version of L1PD proves to be an efficient and promising approach for L1 detection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号