%0 Journal Article %T MarkerScan: Separation and assembly of cobionts sequenced alongside target species in biodiversity genomics projects. %A Vancaester E %A Blaxter ML %J Wellcome Open Res %V 9 %N 0 %D 2024 %M 38617467 暂无%R 10.12688/wellcomeopenres.20730.1 %X Contamination of public databases by mislabelled sequences has been highlighted for many years and the avalanche of novel sequencing data now being deposited has the potential to make databases difficult to use effectively. It is therefore crucial that sequencing projects and database curators perform pre-submission checks to remove obvious contamination and avoid propagating erroneous taxonomic relationships. However, it is important also to recognise that biological contamination of a target sample with unexpected species' DNA can also lead to the discovery of fascinating biological phenomena through the identification of environmental organisms or endosymbionts. Here, we present a novel, integrated method for detection and generation of high-quality genomes of all non-target genomes co-sequenced in eukaryotic genome sequencing projects. After performing taxonomic profiling of an assembly from the raw data, and leveraging the identity of small rRNA sequences discovered therein as markers, a targeted classification approach retrieves and assembles high-quality genomes. The genomes of these cobionts are then not only removed from the target species' genome but also available for further interrogation. Source code is available from https://github.com/CobiontID/MarkerScan. MarkerScan is written in Python and is deployed as a Docker container.
This article addresses a common issue in genetic research: the accidental mixing of genetic information from different species in public databases, often due to mislabelling or contamination. Interestingly, this ‘contamination’ can sometimes lead to exciting discoveries, like identifying DNA from unexpected species in a sample, revealing insights about organisms that live in the environment of the target organism. In our study, we developed a tool called MarkerScan for identifying these additional species found alongside the target species in eukaryotic genome sequencing projects. The method includes a way to sequence the whole genomes of the additional species. Our method involves sorting through the genetic data to identify certain small RNA sequences, which we then use as markers. These markers help to classify and assemble high-quality genomes from these additional species. This not only cleans up the main target species’ genome data but also provides new, valuable genomes for further exploration.