随着数字数据的指数级增长,迫切需要创新的存储介质和技术。DNA分子,由于其稳定性,存储容量,和密度,为信息存储提供了一个有前途的解决方案。然而,DNA存储也面临许多挑战,如复杂的生化约束和编码效率。本文介绍了资源管理器,一种基于DeBruijn图的高效DNA编码算法,利用其表征局部序列的能力。Explorer可以在各种生化约束下进行编码,如均聚物,GC含量,和不想要的图案。本文还介绍了Codeformer,一种基于变压器结构的快速解码算法,进一步提高解码效率。数值实验表明,与其他高级算法相比,Explorer不仅在各种生化约束下实现了稳定的编码和解码,而且还将编码效率和比特率提高了10%。此外,编解码器证明了有效解码大量DNA序列的能力。在不同的参数设置下,它的解码效率比传统算法高出两倍多。当编码器与Reed-Solomon码结合使用时,它的解码精度超过99%,使其成为高速解码应用的良好选择。预计这些进步将有助于基于DNA的数据存储系统的开发以及对DNA作为新型信息存储介质的更广泛的探索。
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.