short reads

  • 文章类型: Journal Article
    生物学中的许多问题都受益于各种模型系统的使用。高通量测序方法在不同模型系统的民主化中取得了胜利。它们允许对感兴趣的整个基因组或转录组进行经济的测序,技术变化甚至可以提供对基因组组织和基因表达和调控的洞察。对此类大型数据集的分析和生物学解释可能会带来重大挑战,这取决于模型系统的“科学状态”。虽然高质量的基因组和转录组参考文献很容易用于建立良好的模型系统,为新兴的模型系统建立这种参考通常需要大量资源,例如财务,专业知识和计算能力。转录组的从头组装代表了新兴模型系统中遗传和分子研究的极好切入点,因为它可以有效地评估基因含量,同时也可以作为差异基因表达研究的参考。然而,从头转录组组装的过程是不平凡的,并且通常必须对每个数据集进行经验优化。对于使用新兴模型系统的研究人员来说,几乎没有从Illumina平台组装和量化短读数据的经验,这些过程可能令人望而生畏。在本指南中,我们概述了从头建立参考转录组时面临的主要挑战,并就如何进行这种努力提供建议。我们描述了主要的实验和生物信息学步骤,为新来者从头转录组组装和差异基因表达分析提供一些广泛的建议和注意事项。此外,我们提供了初步选择的工具,可以帮助从原始的短读数据到组装的转录组和差异表达基因列表.
    Many questions in biology benefit greatly from the use of a variety of model systems. High-throughput sequencing methods have been a triumph in the democratization of diverse model systems. They allow for the economical sequencing of an entire genome or transcriptome of interest, and with technical variations can even provide insight into genome organization and the expression and regulation of genes. The analysis and biological interpretation of such large datasets can present significant challenges that depend on the \'scientific status\' of the model system. While high-quality genome and transcriptome references are readily available for well-established model systems, the establishment of such references for an emerging model system often requires extensive resources such as finances, expertise and computation capabilities. The de novo assembly of a transcriptome represents an excellent entry point for genetic and molecular studies in emerging model systems as it can efficiently assess gene content while also serving as a reference for differential gene expression studies. However, the process of de novo transcriptome assembly is non-trivial, and as a rule must be empirically optimized for every dataset. For the researcher working with an emerging model system, and with little to no experience with assembling and quantifying short-read data from the Illumina platform, these processes can be daunting. In this guide we outline the major challenges faced when establishing a reference transcriptome de novo and we provide advice on how to approach such an endeavor. We describe the major experimental and bioinformatic steps, provide some broad recommendations and cautions for the newcomer to de novo transcriptome assembly and differential gene expression analyses. Moreover, we provide an initial selection of tools that can assist in the journey from raw short-read data to assembled transcriptome and lists of differentially expressed genes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:准确的基因组序列构成了基因组监测计划的基础,其附加值在COVID-19大流行期间通过追踪传播链得到了令人印象深刻的证明,发现新的病毒谱系和突变,并评估它们的传染性和对现有治疗的抵抗力。采用Illumina测序的扩增子策略已被广泛建立,用于SARS-CoV-2基因组的变异检测和基于参考的重建。是常规的生物信息学任务。然而,分析扩增子数据时会出现特定的挑战,例如,当关键甚至谱系决定突变发生在引物位点附近时。方法:我们介绍了CoVpipe2,这是德国公共卫生研究所开发的生物信息学工作流程,用于根据短读取测序数据准确重建SARS-CoV-2基因组。这里的决定性因素是可靠的,准确,快速重建基因组,考虑所用测序方案的细节。除了质量控制等基本任务,映射,变体调用,和共识一代,我们还实施了其他功能,以简化混合样品和重组体的检测。结果:我们强调了引物剪切中的常见陷阱,检测杂合子变异,处理低覆盖率区域和删除。我们引入CoVpipe2来解决上述挑战,并将管道与选定的公开可用基准数据集进行了比较并成功验证。CoVpipe2具有高可用性,再现性,和模块化设计,专门解决短读取扩增子方案的特征,但也可用于全基因组短读取测序数据。结论:CoVpipe2经历了多个改进周期,并与不断更新的引物方案和科学界的新进展一起持续维护。我们的管道易于设置和使用,由于其灵活性和模块化,可以作为未来其他病原体的蓝图,为持续支持提供长期视角。CoVpipe2是在Nextflow中编写的,可根据GPL3许可证从\\href{https://github.com/rki-mf1/CoVpipe2}{github.com/rki-mf1/CoVpipe2}自由访问。
    UNASSIGNED: Accurate genome sequences form the basis for genomic surveillance programs, the added value of which was impressively demonstrated during the COVID-19 pandemic by tracing transmission chains, discovering new viral lineages and mutations, and assessing them for infectiousness and resistance to available treatments. Amplicon strategies employing Illumina sequencing have become widely established for variant detection and reference-based reconstruction of SARS-CoV-2 genomes, and are routine bioinformatics tasks. Yet, specific challenges arise when analyzing amplicon data, for example, when crucial and even lineage-determining mutations occur near primer sites.
    UNASSIGNED: We present CoVpipe2, a bioinformatics workflow developed at the Public Health Institute of Germany to reconstruct SARS-CoV-2 genomes based on short-read sequencing data accurately. The decisive factor here is the reliable, accurate, and rapid reconstruction of genomes, considering the specifics of the used sequencing protocol. Besides fundamental tasks like quality control, mapping, variant calling, and consensus generation, we also implemented additional features to ease the detection of mixed samples and recombinants.
    UNASSIGNED: We highlight common pitfalls in primer clipping, detecting heterozygote variants, and dealing with low-coverage regions and deletions. We introduce CoVpipe2 to address the above challenges and have compared and successfully validated the pipeline against selected publicly available benchmark datasets. CoVpipe2 features high usability, reproducibility, and a modular design that specifically addresses the characteristics of short-read amplicon protocols but can also be used for whole-genome short-read sequencing data.
    UNASSIGNED: CoVpipe2 has seen multiple improvement cycles and is continuously maintained alongside frequently updated primer schemes and new developments in the scientific community. Our pipeline is easy to set up and use and can serve as a blueprint for other pathogens in the future due to its flexibility and modularity, providing a long-term perspective for continuous support. CoVpipe2 is written in Nextflow and is freely accessible from \\href{https://github.com/rki-mf1/CoVpipe2}{github.com/rki-mf1/CoVpipe2} under the GPL3 license.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    癌症是由许多基因组畸变引起的多方面疾病,所述基因组畸变已被鉴定为测序技术进步的结果。而下一代测序(NGS),它使用短读,改变了癌症研究和诊断,它受读取长度的限制。第三代测序(TGS),由太平洋生物科学和牛津纳米孔技术平台领导,采用长读序列,这标志着癌症研究的范式转变。癌症基因组通常包含复杂的事件,和TGS,具有跨越大型基因组区域的能力,促进了他们的表征,提供了一个更好的理解复杂的重排如何影响癌症的开始和进展。TGS还表征了各种癌症的整个转录组,揭示可作为生物标志物或治疗靶标的癌症相关亚型。此外,TGS通过改进基因组组装来推进癌症研究,检测复杂变异,并提供更完整的转录组和表观基因组。本文综述了TGS及其在癌症研究中日益增长的作用。我们研究了它的优点和局限性,提供严格的科学分析,用于检测NGS错过的先前隐藏的像差。这项有前途的技术在研究和临床应用方面都具有巨大的潜力,对癌症的诊断和治疗具有深远的意义。
    Cancer is a multifaceted disease arising from numerous genomic aberrations that have been identified as a result of advancements in sequencing technologies. While next-generation sequencing (NGS), which uses short reads, has transformed cancer research and diagnostics, it is limited by read length. Third-generation sequencing (TGS), led by the Pacific Biosciences and Oxford Nanopore Technologies platforms, employs long-read sequences, which have marked a paradigm shift in cancer research. Cancer genomes often harbour complex events, and TGS, with its ability to span large genomic regions, has facilitated their characterisation, providing a better understanding of how complex rearrangements affect cancer initiation and progression. TGS has also characterised the entire transcriptome of various cancers, revealing cancer-associated isoforms that could serve as biomarkers or therapeutic targets. Furthermore, TGS has advanced cancer research by improving genome assemblies, detecting complex variants, and providing a more complete picture of transcriptomes and epigenomes. This review focuses on TGS and its growing role in cancer research. We investigate its advantages and limitations, providing a rigorous scientific analysis of its use in detecting previously hidden aberrations missed by NGS. This promising technology holds immense potential for both research and clinical applications, with far-reaching implications for cancer diagnosis and treatment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在过去的十年中,对自然种群中结构变化的综合表征仅变得可行。为了研究结构变异(SV)的群体基因组性质,首先需要可重复和高置信度的SV调用集。我们创建了33个北欧麻雀(Passerdomesticus)个体的全基因组结构变化景观的种群尺度参考。要使用短读数据在所有样本中生成共识调用集,我们比较了基于启发式的质量过滤和视觉管理(Samplot/PlotCritic和Samplot-ML)方法。我们证明了SVs的策展对于减少假定的误报很重要,并且在此步骤中投入的时间超过了分析包含许多潜在误报的短读发现的SV数据集的潜在成本。我们发现,即使是宽松的手动策展策略(例如,由单个策展人应用)也可以将推定的误报比例降低多达80%,从而丰富了高置信度变异的比例.至关重要的是,在应用一个单一策展人的宽松手动策展策略时,几乎所有(>99%)被拒绝为推定假阳性的变异也通过使用另外3名策展人的更严格的策展策略进行了分类.此外,手动管理拒绝的变异未能反映SNP的预期种群结构,而通过策展的变体确实如此。因此,将基于启发式的质量过滤与短读数据中结构变体的快速手动管理相结合,可以成为需要高置信度SVCallset的功能和群体基因组研究的具有时间和成本效益的第一步。
    Comprehensive characterization of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation, reproducible and high-confidence structural variation callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus). To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of structural variants is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analyzing short-read-discovered structural variation data sets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence structural variation callsets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    从Illumina测序数据中准确重建大肠杆菌抗生素抗性基因(ARG)质粒已被证明是当前生物信息学工具的挑战。在这项工作中,我们提出了一种使用短读数重建大肠杆菌质粒的改进方法。我们开发了plasmidEC,集成分类器,通过组合三种不同的二进制分类工具的输出来识别质粒来源的重叠群。我们表明,质粒EC特别适合对来自ARG质粒的重叠群进行分类,召回率为0.941。此外,我们优化了gplas,一种基于图形的工具,可将质粒预测的重叠群分类为不同的质粒预测。Gplas2在回收具有大测序覆盖度变化的质粒方面更有效,并且可以与任何二元分类器的输出组合。当重建ARG质粒时,质粒EC与gplas2的组合显示出高度的完整性(中位数=0.818)和F1评分(中位数=0.812),并且超过了基于参考的方法MOB套件的分级能力。在没有长读数据的情况下,我们的方法为在大肠杆菌中重建ARG质粒提供了极好的替代方案。
    Accurate reconstruction of Escherichia coli antibiotic resistance gene (ARG) plasmids from Illumina sequencing data has proven to be a challenge with current bioinformatic tools. In this work, we present an improved method to reconstruct E. coli plasmids using short reads. We developed plasmidEC, an ensemble classifier that identifies plasmid-derived contigs by combining the output of three different binary classification tools. We showed that plasmidEC is especially suited to classify contigs derived from ARG plasmids with a high recall of 0.941. Additionally, we optimized gplas, a graph-based tool that bins plasmid-predicted contigs into distinct plasmid predictions. Gplas2 is more effective at recovering plasmids with large sequencing coverage variations and can be combined with the output of any binary classifier. The combination of plasmidEC with gplas2 showed a high completeness (median=0.818) and F1-Score (median=0.812) when reconstructing ARG plasmids and exceeded the binning capacity of the reference-based method MOB-suite. In the absence of long-read data, our method offers an excellent alternative to reconstruct ARG plasmids in E. coli.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    低覆盖率全基因组测序(也称为“基因组略读”)正在成为大规模系统发育分析的一种越来越负担得起的方法。虽然已经常规用于恢复细胞器基因组,基因组撇脂很少用于恢复单拷贝核标记。一个原因可能是,在系统基因组上下文中只有很少的工具可以处理这种数据类型,特别是处理片段化的基因组组装。我们在这里提出了一种称为Patchwork的新软件工具,用于从高度片段化的短阅读组件以及直接从序列阅读中挖掘系统发育标记。Patchwork是一种基于对齐的工具,它利用序列对齐器DIAMOND,并以编程语言Julia编写。同源区域通过序列相似性搜索获得,接下来是“命中缝合”阶段,其中相邻或重叠的区域被合并成单个单元。新颖的滑动窗口算法从所得序列中修剪掉任何非编码区域。我们通过在基准测试研究中恢复近乎通用的单拷贝直向同源物来证明拼凑的实用性,与其他程序相比,我们还评估了Patchwork的性能。我们发现Patchwork可以在不同的测序深度下从基因组略读数据集中准确检索(假定)单拷贝基因,并具有较高的计算速度。优于针对类似任务的现有软件。Patchwork在GNU通用公共许可证版本3下发布。安装说明,其他文档,和源代码本身都可以通过GitHub在https://github.com/fetalen/Patchwork。
    Low-coverage whole-genome sequencing (also known as \"genome skimming\") is becoming an increasingly affordable approach to large-scale phylogenetic analyses. While already routinely used to recover organellar genomes, genome skimming is rather rarely utilized for recovering single-copy nuclear markers. One reason might be that only few tools exist to work with this data type within a phylogenomic context, especially to deal with fragmented genome assemblies. We here present a new software tool called Patchwork for mining phylogenetic markers from highly fragmented short-read assemblies as well as directly from sequence reads. Patchwork is an alignment-based tool that utilizes the sequence aligner DIAMOND and is written in the programming language Julia. Homologous regions are obtained via a sequence similarity search, followed by a \"hit stitching\" phase, in which adjacent or overlapping regions are merged into a single unit. The novel sliding window algorithm trims away any noncoding regions from the resulting sequence. We demonstrate the utility of Patchwork by recovering near-universal single-copy orthologs within a benchmarking study, and we additionally assess the performance of Patchwork in comparison with other programs. We find that Patchwork allows for accurate retrieval of (putatively) single-copy genes from genome skimming data sets at different sequencing depths with high computational speed, outperforming existing software targeting similar tasks. Patchwork is released under the GNU General Public License version 3. Installation instructions, additional documentation, and the source code itself are all available via GitHub at https://github.com/fethalen/Patchwork.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这一章中,我们描述了通过高通量测序(HTS)从总RNA样品中检测植物病毒的计算流程。该管道设计用于分析使用Illumina平台和免费提供的软件工具生成的短读数。首先,我们为高质量的总RNA纯化提供建议,图书馆准备,和测序。生物信息学流水线从从测序机获得的原始读段开始,并执行一些管理步骤以获得长重叠群。对照参考核苷酸病毒序列的本地数据库,对重组体进行爆破,以鉴定样品中的病毒。然后,通过应用特定的过滤器来优化搜索。我们还提供代码以针对所发现的病毒重新映射短读取,以获得有关每种病毒的测序深度和读取覆盖率的信息。不需要以前的生物信息学背景,但是建议使用Unix命令行和R语言的基本知识。
    In this chapter, we describe a computational pipeline for the in silico detection of plant viruses by high-throughput sequencing (HTS) from total RNA samples. The pipeline is designed for the analysis of short reads generated using an Illumina platform and free-available software tools. First, we provide advice for high-quality total RNA purification, library preparation, and sequencing. The bioinformatics pipeline begins with the raw reads obtained from the sequencing machine and performs some curation steps to obtain long contigs. Contigs are blasted against a local database of reference nucleotide viral sequences to identify the viruses in the samples. Then, the search is refined by applying specific filters. We also provide the code to re-map the short reads against the viruses found to get information on sequencing depth and read coverage for each virus. No previous bioinformatics background is required, but basic knowledge of the Unix command line and R language is recommended.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Published Erratum
    [这更正了文章DOI:10.3389/fgene.202.816825。].
    [This corrects the article DOI: 10.3389/fgene.2022.816825.].
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    cfr基因编码23SrRNA甲基转移酶,赋予苯酚多抗性表型,lincosamide,恶唑烷酮,截短侧耳素,和链霉菌素A抗生素。这些基因已经在葡萄球菌中被描述,包括耐甲氧西林金黄色葡萄球菌(MRSA)。在这项研究中,我们回顾性地对三个cfr阳性,多药耐药(MDR)家畜相关(LA)MRSA克隆复合物(CCs)1和398在意大利家畜的人群研究(2008-2014)中检测到(2008-2011)。我们使用了Illumina和OxfordNanoporeTechnologies(ONT)的组合全基因组测序(WGS)方法对两个分离株(2008CC1和2010CC398分离株,但不是2011CC1隔离)。有趣的是,这三个分离株呈现不同的cfr变体,只有一个显示利奈唑胺耐药表型。在分离株2008CC1中,在Tn558复合转座子样结构中鉴定了cfr基因,该结构的侧翼为位于新型44,826bp质粒上的IS元件。这代表了在其功能变体中包含cfr基因的CC1LA-MRSA的首次报道。不同的是,cfr在分离株2010CC398中位于染色体上。我们的发现对公共卫生有重大影响,确认需要对cfr阳性人畜共患LA-MRSA进行连续基因组监测,并将意大利猪的LA-MRSA中cfr的存在追溯到至少2008年。
    The cfr genes encode for a 23S rRNA methyltransferase, conferring a multiresistance phenotype to phenicol, lincosamide, oxazolidinone, pleuromutilin, and streptogramin A antibiotics. These genes have been described in staphylococci, including methicillin-resistant Staphylococcus aureus (MRSA). In this study, we retrospectively performed an in-depth genomic characterisation of three cfr-positive, multidrug-resistant (MDR) livestock-associated (LA) MRSA clonal complexes (CCs) 1 and 398 detected in different Italian pig holdings (2008-2011) during population studies on Italian livestock (2008-2014). We used a combined Illumina and Oxford Nanopore Technologies (ONT) whole genome sequencing (WGS) approach on two isolates (the 2008 CC1 and the 2010 CC398 isolates, but not the 2011 CC1 isolate). Interestingly, the three isolates presented different cfr variants, with only one displaying a linezolid-resistant phenotype. In isolate 2008 CC1, the cfr gene was identified within a Tn558 composite transposon-like structure flanked by IS elements located on a novel 44,826 bp plasmid. This represents the first report of CC1 LA-MRSA harbouring the cfr gene in its functional variant. Differently, cfr was chromosomally located in isolate 2010 CC398. Our findings have significant public health implications, confirm the need for the continuous genomic surveillance of cfr-positive zoonotic LA-MRSA, and backdate cfr presence in LA-MRSA from Italian pigs to at least 2008.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    LINE-1反转录转座子有可能导致DNA损伤,导致基因组不稳定,并诱导干扰素反应。因此,精确测量它们的表达,特别是在基因组不稳定和干扰素反应相关的疾病环境中,是特别重要的。基于Illumina的批量RNA测序仍然是用于测量基因表达的最丰富的数据类型。然而,来自其自身内部启动子的“活性”表达仅是RNA-seq实验中LINE-1比对读段的一个来源。大约有50万条LINE-1序列散布在整个基因组中,许多被整合到与LINE-1活性无关的其他转录物中。我们称之为“被动”共转录。在这里,我们将描述如何使用L1EM,一种在基因座特异性水平上分离主动和被动LINE-1表达的计算方法。
    LINE-1 retrotransposons have the potential to cause DNA damage, contribute to genome instability, and induce an interferon response. Thus, accurate measurements of their expression, especially in disease contexts where genome instability and the interferon response are relevant, are of particular importance. Illumina-based bulk RNA sequencing remains the most abundant datatype for measuring gene expression. However, \"active\" expression from its own internal promoter is only one source of LINE-1 aligning reads in an RNA-seq experiment. With about half a million LINE-1 sequences scattered throughout the genome, many are incorporated into other transcripts that have nothing to do with LINE-1 activity. We call this \"passive\" co-transcription. Here we will describe how to use L1EM, a computational method that separates active from passive LINE-1 expression at the locus-specific level.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号