short reads

  • 文章类型: Journal Article
    Many questions in biology benefit greatly from the use of a variety of model systems. High-throughput sequencing methods have been a triumph in the democratization of diverse model systems. They allow for the economical sequencing of an entire genome or transcriptome of interest, and with technical variations can even provide insight into genome organization and the expression and regulation of genes. The analysis and biological interpretation of such large datasets can present significant challenges that depend on the \'scientific status\' of the model system. While high-quality genome and transcriptome references are readily available for well-established model systems, the establishment of such references for an emerging model system often requires extensive resources such as finances, expertise and computation capabilities. The de novo assembly of a transcriptome represents an excellent entry point for genetic and molecular studies in emerging model systems as it can efficiently assess gene content while also serving as a reference for differential gene expression studies. However, the process of de novo transcriptome assembly is non-trivial, and as a rule must be empirically optimized for every dataset. For the researcher working with an emerging model system, and with little to no experience with assembling and quantifying short-read data from the Illumina platform, these processes can be daunting. In this guide we outline the major challenges faced when establishing a reference transcriptome de novo and we provide advice on how to approach such an endeavor. We describe the major experimental and bioinformatic steps, provide some broad recommendations and cautions for the newcomer to de novo transcriptome assembly and differential gene expression analyses. Moreover, we provide an initial selection of tools that can assist in the journey from raw short-read data to assembled transcriptome and lists of differentially expressed genes.






  • 文章类型: Journal Article
    UNASSIGNED: Accurate genome sequences form the basis for genomic surveillance programs, the added value of which was impressively demonstrated during the COVID-19 pandemic by tracing transmission chains, discovering new viral lineages and mutations, and assessing them for infectiousness and resistance to available treatments. Amplicon strategies employing Illumina sequencing have become widely established for variant detection and reference-based reconstruction of SARS-CoV-2 genomes, and are routine bioinformatics tasks. Yet, specific challenges arise when analyzing amplicon data, for example, when crucial and even lineage-determining mutations occur near primer sites.
    UNASSIGNED: We present CoVpipe2, a bioinformatics workflow developed at the Public Health Institute of Germany to reconstruct SARS-CoV-2 genomes based on short-read sequencing data accurately. The decisive factor here is the reliable, accurate, and rapid reconstruction of genomes, considering the specifics of the used sequencing protocol. Besides fundamental tasks like quality control, mapping, variant calling, and consensus generation, we also implemented additional features to ease the detection of mixed samples and recombinants.
    UNASSIGNED: We highlight common pitfalls in primer clipping, detecting heterozygote variants, and dealing with low-coverage regions and deletions. We introduce CoVpipe2 to address the above challenges and have compared and successfully validated the pipeline against selected publicly available benchmark datasets. CoVpipe2 features high usability, reproducibility, and a modular design that specifically addresses the characteristics of short-read amplicon protocols but can also be used for whole-genome short-read sequencing data.
    UNASSIGNED: CoVpipe2 has seen multiple improvement cycles and is continuously maintained alongside frequently updated primer schemes and new developments in the scientific community. Our pipeline is easy to set up and use and can serve as a blueprint for other pathogens in the future due to its flexibility and modularity, providing a long-term perspective for continuous support. CoVpipe2 is written in Nextflow and is freely accessible from \\href{}{} under the GPL3 license.






  • 文章类型: Journal Article
    Cancer is a multifaceted disease arising from numerous genomic aberrations that have been identified as a result of advancements in sequencing technologies. While next-generation sequencing (NGS), which uses short reads, has transformed cancer research and diagnostics, it is limited by read length. Third-generation sequencing (TGS), led by the Pacific Biosciences and Oxford Nanopore Technologies platforms, employs long-read sequences, which have marked a paradigm shift in cancer research. Cancer genomes often harbour complex events, and TGS, with its ability to span large genomic regions, has facilitated their characterisation, providing a better understanding of how complex rearrangements affect cancer initiation and progression. TGS has also characterised the entire transcriptome of various cancers, revealing cancer-associated isoforms that could serve as biomarkers or therapeutic targets. Furthermore, TGS has advanced cancer research by improving genome assemblies, detecting complex variants, and providing a more complete picture of transcriptomes and epigenomes. This review focuses on TGS and its growing role in cancer research. We investigate its advantages and limitations, providing a rigorous scientific analysis of its use in detecting previously hidden aberrations missed by NGS. This promising technology holds immense potential for both research and clinical applications, with far-reaching implications for cancer diagnosis and treatment.






  • 文章类型: Journal Article
    Comprehensive characterization of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation, reproducible and high-confidence structural variation callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus). To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of structural variants is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analyzing short-read-discovered structural variation data sets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence structural variation callsets.






  • 文章类型: Journal Article
    Accurate reconstruction of Escherichia coli antibiotic resistance gene (ARG) plasmids from Illumina sequencing data has proven to be a challenge with current bioinformatic tools. In this work, we present an improved method to reconstruct E. coli plasmids using short reads. We developed plasmidEC, an ensemble classifier that identifies plasmid-derived contigs by combining the output of three different binary classification tools. We showed that plasmidEC is especially suited to classify contigs derived from ARG plasmids with a high recall of 0.941. Additionally, we optimized gplas, a graph-based tool that bins plasmid-predicted contigs into distinct plasmid predictions. Gplas2 is more effective at recovering plasmids with large sequencing coverage variations and can be combined with the output of any binary classifier. The combination of plasmidEC with gplas2 showed a high completeness (median=0.818) and F1-Score (median=0.812) when reconstructing ARG plasmids and exceeded the binning capacity of the reference-based method MOB-suite. In the absence of long-read data, our method offers an excellent alternative to reconstruct ARG plasmids in E. coli.






  • 文章类型: Journal Article
    Low-coverage whole-genome sequencing (also known as \"genome skimming\") is becoming an increasingly affordable approach to large-scale phylogenetic analyses. While already routinely used to recover organellar genomes, genome skimming is rather rarely utilized for recovering single-copy nuclear markers. One reason might be that only few tools exist to work with this data type within a phylogenomic context, especially to deal with fragmented genome assemblies. We here present a new software tool called Patchwork for mining phylogenetic markers from highly fragmented short-read assemblies as well as directly from sequence reads. Patchwork is an alignment-based tool that utilizes the sequence aligner DIAMOND and is written in the programming language Julia. Homologous regions are obtained via a sequence similarity search, followed by a \"hit stitching\" phase, in which adjacent or overlapping regions are merged into a single unit. The novel sliding window algorithm trims away any noncoding regions from the resulting sequence. We demonstrate the utility of Patchwork by recovering near-universal single-copy orthologs within a benchmarking study, and we additionally assess the performance of Patchwork in comparison with other programs. We find that Patchwork allows for accurate retrieval of (putatively) single-copy genes from genome skimming data sets at different sequencing depths with high computational speed, outperforming existing software targeting similar tasks. Patchwork is released under the GNU General Public License version 3. Installation instructions, additional documentation, and the source code itself are all available via GitHub at






  • 文章类型: Journal Article
    In this chapter, we describe a computational pipeline for the in silico detection of plant viruses by high-throughput sequencing (HTS) from total RNA samples. The pipeline is designed for the analysis of short reads generated using an Illumina platform and free-available software tools. First, we provide advice for high-quality total RNA purification, library preparation, and sequencing. The bioinformatics pipeline begins with the raw reads obtained from the sequencing machine and performs some curation steps to obtain long contigs. Contigs are blasted against a local database of reference nucleotide viral sequences to identify the viruses in the samples. Then, the search is refined by applying specific filters. We also provide the code to re-map the short reads against the viruses found to get information on sequencing depth and read coverage for each virus. No previous bioinformatics background is required, but basic knowledge of the Unix command line and R language is recommended.






  • 文章类型: Published Erratum
    [This corrects the article DOI: 10.3389/fgene.2022.816825.].






  • 文章类型: Journal Article
    The cfr genes encode for a 23S rRNA methyltransferase, conferring a multiresistance phenotype to phenicol, lincosamide, oxazolidinone, pleuromutilin, and streptogramin A antibiotics. These genes have been described in staphylococci, including methicillin-resistant Staphylococcus aureus (MRSA). In this study, we retrospectively performed an in-depth genomic characterisation of three cfr-positive, multidrug-resistant (MDR) livestock-associated (LA) MRSA clonal complexes (CCs) 1 and 398 detected in different Italian pig holdings (2008-2011) during population studies on Italian livestock (2008-2014). We used a combined Illumina and Oxford Nanopore Technologies (ONT) whole genome sequencing (WGS) approach on two isolates (the 2008 CC1 and the 2010 CC398 isolates, but not the 2011 CC1 isolate). Interestingly, the three isolates presented different cfr variants, with only one displaying a linezolid-resistant phenotype. In isolate 2008 CC1, the cfr gene was identified within a Tn558 composite transposon-like structure flanked by IS elements located on a novel 44,826 bp plasmid. This represents the first report of CC1 LA-MRSA harbouring the cfr gene in its functional variant. Differently, cfr was chromosomally located in isolate 2010 CC398. Our findings have significant public health implications, confirm the need for the continuous genomic surveillance of cfr-positive zoonotic LA-MRSA, and backdate cfr presence in LA-MRSA from Italian pigs to at least 2008.






  • 文章类型: Journal Article
    LINE-1 retrotransposons have the potential to cause DNA damage, contribute to genome instability, and induce an interferon response. Thus, accurate measurements of their expression, especially in disease contexts where genome instability and the interferon response are relevant, are of particular importance. Illumina-based bulk RNA sequencing remains the most abundant datatype for measuring gene expression. However, \"active\" expression from its own internal promoter is only one source of LINE-1 aligning reads in an RNA-seq experiment. With about half a million LINE-1 sequences scattered throughout the genome, many are incorporated into other transcripts that have nothing to do with LINE-1 activity. We call this \"passive\" co-transcription. Here we will describe how to use L1EM, a computational method that separates active from passive LINE-1 expression at the locus-specific level.





