关键词: AutoML DNA synthesis cloud platform feature reduction machine learning

Mesh : Base Sequence Machine Learning Escherichia coli / genetics Base Composition DNA / genetics

来  源:   DOI:10.3390/genes14030605   PDF(Pubmed)

Abstract:
DNA synthesis is widely used in synthetic biology to construct and assemble sequences ranging from short RBS to ultra-long synthetic genomes. Many sequence features, such as the GC content and repeat sequences, are known to affect the synthesis difficulty and subsequently the synthesis cost. In addition, there are latent sequence features, especially local characteristics of the sequence, which might affect the DNA synthesis process as well. Reliable prediction of the synthesis difficulty for a given sequence is important for reducing the cost, but this remains a challenge. In this study, we propose a new automated machine learning (AutoML) approach to predict the DNA synthesis difficulty, which achieves an F1 score of 0.930 and outperforms the current state-of-the-art model. We found local sequence features that were neglected in previous methods, which might also affect the difficulty of DNA synthesis. Moreover, experimental validation based on ten genes of Escherichia coli strain MG1655 shows that our model can achieve an 80% accuracy, which is also better than the state of art. Moreover, we developed the cloud platform SCP4SSD using an entirely cloud-based serverless architecture for the convenience of the end users.
摘要:
DNA合成在合成生物学中广泛用于构建和组装从短RBS到超长合成基因组的序列。许多序列特征,如GC含量和重复序列,已知会影响合成难度和随后的合成成本。此外,有潜在的序列特征,特别是序列的局部特征,这也可能影响DNA合成过程。对给定序列的合成难度的可靠预测对于降低成本很重要。但这仍然是一个挑战。在这项研究中,我们提出了一种新的自动机器学习(AutoML)方法来预测DNA合成难度,它的F1得分为0.930,优于当前最先进的模型。我们发现了在以前的方法中被忽略的局部序列特征,这也可能影响DNA合成的难度。此外,基于大肠杆菌菌株MG1655的十个基因的实验验证表明,我们的模型可以达到80%的准确率,这也比艺术更好。此外,为了方便最终用户,我们使用完全基于云的无服务器架构开发了云平台SCP4SSD。
公众号