关键词: De novo annotation Dfam and repbase Orthoptera genome TE database Transposable elements

来  源:   DOI:10.1186/s13100-024-00316-x   PDF(Pubmed)

Abstract:
Transposable elements (TEs) are a major component of eukaryotic genomes and are present in almost all eukaryotic organisms. TEs are highly dynamic between and within species, which significantly affects the general applicability of the TE databases. Orthoptera is the only known group in the class Insecta with a significantly enlarged genome (0.93-21.48 Gb). When analyzing the large genome using the existing TE public database, the efficiency of TE annotation is not satisfactory. To address this limitation, it becomes imperative to continually update the available TE resource library and the need for an Orthoptera-specific library as more insect genomes are publicly available. Here, we used the complete genome data of 12 Orthoptera species to de novo annotate TEs, then manually re-annotate the unclassified TEs to construct a non-redundant Orthoptera-specific TE library: Orthoptera-TElib. Orthoptera-TElib contains 24,021 TE entries including the re-annotated results of 13,964 unknown TEs. The naming of TE entries in Orthoptera-TElib adopts the same naming as RepeatMasker and Dfam and is encoded as the three-level form of \"level1/level2-level3\". Orthoptera-TElib can be directly used as an input reference database and is compatible with mainstream repetitive sequence analysis software such as RepeatMasker and dnaPipeTE. When analyzing TEs of Orthoptera species, Orthoptera-TElib performs better TE annotation as compared to Dfam and Repbase regardless of using low-coverage sequencing or genome assembly data. The most improved TE annotation result is Angaracris rhodopa, which has increased from 7.89% of the genome to 53.28%. Finally, Orthoptera-TElib is stored in Sqlite3 for the convenience of data updates and user access.
摘要:
转座因子(TE)是真核生物基因组的主要组成部分,存在于几乎所有的真核生物中。TEs在物种之间和物种内部都是高度动态的,这显著影响了TE数据库的一般适用性。直翅目是昆虫纲中唯一已知的基因组显着扩大(0.93-21.48Gb)。当使用现有的TE公共数据库分析大基因组时,TE注释的效率并不令人满意。为了解决这个限制,随着更多的昆虫基因组公开可用,不断更新可用的TE资源库和对直翅目特异性文库的需求变得势在必行。这里,我们使用12种直翅目的完整基因组数据从头注释TEs,然后手动重新注释未分类的TE以构建非冗余的直翅目特异性TE库:直翅目-TElib。直翅目-TElib包含24,021个TE条目,包括13,964个未知TE的重新注释结果。直翅目-TElib中TE条目的命名采用与RepeatMasker和Dfam相同的命名,并编码为“level1/level2-level3”的三级形式。直翅目-TElib可以直接用作输入参考数据库,并且与主流重复序列分析软件如RepeatMasker和dnaPipeTE兼容。在分析直翅目物种的TEs时,与Dfam和Repbase相比,直翅目-TElib执行更好的TE注释,无论使用低覆盖测序或基因组组装数据。改进最大的TE注释结果是Angaracrisrhodopa,从基因组的7.89%增加到53.28%。最后,直翅目-TElib存储在Sqlite3中,以便于数据更新和用户访问。
公众号