关键词: convolutional neural networks deep learning gene expression methylation promoter subgenome dominance

来  源:   DOI:10.1111/tpj.16979

Abstract:
Deep learning offers new approaches to investigate the mechanisms underlying complex biological phenomena, such as subgenome dominance. Subgenome dominance refers to the dominant expression and/or biased fractionation of genes in one subgenome of allopolyploids, which has shaped the evolution of a large group of plants. However, the underlying cause of subgenome dominance remains elusive. Here, we adopt deep learning to construct two convolutional neural network (CNN) models, binary expression model (BEM) and homoeolog contrast model (HCM), to investigate the mechanism underlying subgenome dominance using DNA sequence and methylation sites. We apply these CNN models to analyze three representative polyploidization systems, Brassica, Gossypium, and Cucurbitaceae, each with available ancient and neo/synthetic polyploidized genomes. The BEM shows that DNA sequence of the promoter region can accurately predict whether a gene is expressed or not. More importantly, the HCM shows that the DNA sequence of the promoter region predicts dominant expression status between homoeologous gene pairs retained from ancient polyploidizations, thus predicting subgenome dominance associated with these events. However, HCM fails to predict gene expression dominance between new homoeologous gene pairs arising from the neo/synthetic polyploidizations. These results are consistent across the three plant polyploidization systems, indicating broad applicability of our models. Furthermore, the two models based on methylation sites produce similar results. These results show that subgenome dominance is associated with long-term sequence differentiation between the promoters of homoeologs, suggesting that subgenome expression dominance precedes and is the driving force or even the determining factor for sequence divergence between subgenomes following polyploidization.
摘要:
深度学习为研究复杂生物现象的潜在机制提供了新的方法。如亚基因组优势。亚基因组优势是指基因在异源多倍体的一个亚基因组中的显性表达和/或偏向分级分离。塑造了一大群植物的进化。然而,亚基因组优势的根本原因仍然难以捉摸。这里,我们采用深度学习来构建两个卷积神经网络(CNN)模型,二元表达模型(BEM)和同构对比模型(HCM),使用DNA序列和甲基化位点研究亚基因组优势的潜在机制。我们应用这些CNN模型来分析三个代表性的多倍化系统,芸苔属,棉属,还有葫芦科,每个都有可用的古代和新/合成多倍体基因组。BEM显示启动子区的DNA序列可以准确地预测基因是否表达。更重要的是,HCM表明,启动子区域的DNA序列预测了古代多倍体保留的同源基因对之间的显性表达状态,从而预测与这些事件相关的亚基因组优势。然而,HCM无法预测新/合成多倍体化产生的新同源基因对之间的基因表达优势。这些结果在三个植物多倍化系统中是一致的,表明我们的模型具有广泛的适用性。此外,基于甲基化位点的两个模型产生相似的结果.这些结果表明,亚基因组优势与同源物启动子之间的长期序列分化有关,这表明亚基因组表达优势先于多倍体化后亚基因组之间序列差异的驱动力甚至决定因素。
公众号