DNA storage

  • 文章类型: Journal Article
    聚合酶链反应(PCR)扩增广泛用于从DNA存储中检索信息。在PCR扩增过程中,引物的3'末端和DNA序列之间的非特异性配对可以在扩增反应中引起串扰,导致干扰序列的产生和降低的扩增精度。为了解决这个问题,提出了一种高效的PCR扩增信息检索编码算法(ECA-PCRAIR)。该算法采用可变长度扫描和修剪优化来构造码本,该码本在满足传统生物学约束的同时最大化存储密度。随后,基于引物库构建码字搜索树以优化码本,可变长度交织器用于约束检测和校正,从而最大限度地减少非特异性配对的可能性。实验结果表明,ECA-PCRAIR可以将引物3'末端与DNA序列之间的非特异性配对概率降低到2-25%,增强DNA序列的鲁棒性。此外,ECA-PCRAIR的存储密度为每个核苷酸2.14-3.67位(位/nt),显著提高存储容量。
    Polymerase Chain Reaction (PCR) amplification is widely used for retrieving information from DNA storage. During the PCR amplification process, nonspecific pairing between the 3\' end of the primer and the DNA sequence can cause cross-talk in the amplification reaction, leading to the generation of interfering sequences and reduced amplification accuracy. To address this issue, we propose an efficient coding algorithm for PCR amplification information retrieval (ECA-PCRAIR). This algorithm employs variable-length scanning and pruning optimization to construct a codebook that maximizes storage density while satisfying traditional biological constraints. Subsequently, a codeword search tree is constructed based on the primer library to optimize the codebook, and a variable-length interleaver is used for constraint detection and correction, thereby minimizing the likelihood of nonspecific pairing. Experimental results demonstrate that ECA-PCRAIR can reduce the probability of nonspecific pairing between the 3\' end of the primer and the DNA sequence to 2-25%, enhancing the robustness of the DNA sequences. Additionally, ECA-PCRAIR achieves a storage density of 2.14-3.67 bits per nucleotide (bits/nt), significantly improving storage capacity.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    数据量呈指数级增长,因此需要采用替代存储解决方案,DNA储存是最有前途的解决方案。然而,与合成和测序相关的高昂成本阻碍了其发展。预压缩数据被认为是降低存储成本的最有效方法之一。然而,不同的压缩方法对同一文件产生不同的压缩比,用单一方法压缩大量文件可能达不到最大压缩率。本研究提出了一种基于机器学习分类算法的多文件动态压缩方法,该方法为每个文件选择合适的压缩方法,以尽可能最大程度地减少存储到DNA中的数据量。首先,四种不同的压缩方法被应用于收集的文件。随后,选择最佳压缩方法作为标签,以及文件类型和大小用作功能,将其放入七种机器学习分类算法中进行训练。结果表明,在验证集和测试集上,k最近邻算法在大多数时间优于其他机器学习算法。准确率超过85%,波动性较小。此外,根据k-近邻模型可以实现30.85%的压缩率,与传统的单一压缩方法相比,超过4.5%,在0.48亿至30亿美元/TB的范围内节省了大量的DNA存储成本。与传统的压缩方法相比,多文件动态压缩方法在压缩多个文件时表现出更显著的压缩效果。因此,它可以大大降低DNA存储的成本,并促进DNA存储技术的广泛实施。
    The exponential growth in data volume has necessitated the adoption of alternative storage solutions, and DNA storage stands out as the most promising solution. However, the exorbitant costs associated with synthesis and sequencing impeded its development. Pre-compressing the data is recognized as one of the most effective approaches for reducing storage costs. However, different compression methods yield varying compression ratios for the same file, and compressing a large number of files with a single method may not achieve the maximum compression ratio. This study proposes a multi-file dynamic compression method based on machine learning classification algorithms that selects the appropriate compression method for each file to minimize the amount of data stored into DNA as much as possible. Firstly, four different compression methods are applied to the collected files. Subsequently, the optimal compression method is selected as a label, as well as the file type and size are used as features, which are put into seven machine learning classification algorithms for training. The results demonstrate that k-nearest neighbor outperforms other machine learning algorithms on the validation set and test set most of the time, achieving an accuracy rate of over 85% and showing less volatility. Additionally, the compression rate of 30.85% can be achieved according to k-nearest neighbor model, more than 4.5% compared to the traditional single compression method, resulting in significant cost savings for DNA storage in the range of $0.48 to 3 billion/TB. In comparison to the traditional compression method, the multi-file dynamic compression method demonstrates a more significant compression effect when compressing multiple files. Therefore, it can considerably decrease the cost of DNA storage and facilitate the widespread implementation of DNA storage technology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    生物分子的稳健封装和可控释放具有广泛的生物医学应用,从生物传感,将药物输送到信息存储。然而,传统的生物分子封装策略在复杂的操作中具有局限性,光学不稳定性,解封困难。这里,我们报告一个简单的,健壮,基于具有低温相变特性的镓液态金属的无溶剂生物分子封装策略,自我修复,高气密性密封,和固有的抗光学损伤。我们将生物分子与固体镓薄膜夹在中间,然后对薄膜进行低温焊接以直接密封。镓不仅可以保护DNA和酶免受各种物理和化学损害,而且还可以通过施加振动以破坏液体镓来按需释放生物分子。我们证明了在加速老化测试后,可以恢复DNA编码的图像文件,序列保留率高达99.9%。我们还展示了生物试剂的可控释放在一锅法RPA-CRISPR/Cas12a反应中的实际应用,用于SARS-COV-2筛选,检测限在40分钟内达到10个拷贝。这项工作可以通过将低熔点金属用于生物技术来促进坚固且刺激响应的生物分子胶囊的开发。
    Robust encapsulation and controllable release of biomolecules have wide biomedical applications ranging from biosensing, drug delivery to information storage. However, conventional biomolecule encapsulation strategies have limitations in complicated operations, optical instability, and difficulty in decapsulation. Here, we report a simple, robust, and solvent-free biomolecule encapsulation strategy based on gallium liquid metal featuring low-temperature phase transition, self-healing, high hermetic sealing, and intrinsic resistance to optical damage. We sandwiched the biomolecules with the solid gallium films followed by low-temperature welding of the films for direct sealing. The gallium can not only protect DNA and enzymes from various physical and chemical damages but also allow the on-demand release of biomolecules by applying vibration to break the liquid gallium. We demonstrated that a DNA-coded image file can be recovered with up to 99.9% sequence retention after an accelerated aging test. We also showed the practical applications of the controllable release of bioreagents in a one-pot RPA-CRISPR/Cas12a reaction for SARS-COV-2 screening with a low detection limit of 10 copies within 40 min. This work may facilitate the development of robust and stimuli-responsive biomolecule capsules by using low-melting metals for biotechnology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    今天的数字数据存储系统通常提供先进的数据恢复解决方案,以解决灾难性的数据丢失问题。例如基于软件的磁盘扇区分析或传统硬盘驱动器的物理级数据检索方法。然而,基于DNA的数据存储目前仅依赖于用于将数字数据编码为DNA链的方法的固有纠错特性。不能利用由DNA编码方法添加的冗余校正的任何错误导致永久的数据丢失。为DNA存储系统提供数据恢复,我们提出了一种使用喷泉码自动重建存储在DNA中的损坏或丢失数据的方法。我们的方法利用用喷泉码编码的数据包之间的关系来识别和纠正损坏或丢失的数据。此外,我们介绍了三种文件类型的特定文件类型和基于内容的数据恢复方法,说明了喷泉编码特定冗余和有关数据的知识的融合如何有效地恢复损坏的DNA存储系统中的信息,无论是在自动和引导手动方式。为了展示我们的方法,我们引入DR4DNA,包含所有方法的软件工具包。我们使用计算机和体外实验评估DR4DNA。
    Today\'s digital data storage systems typically offer advanced data recovery solutions to address the problem of catastrophic data loss, such as software-based disk sector analysis or physical-level data retrieval methods for conventional hard disk drives. However, DNA-based data storage currently relies solely on the inherent error correction properties of the methods used to encode digital data into strands of DNA. Any error that cannot be corrected utilizing the redundancy added by DNA encoding methods results in permanent data loss. To provide data recovery for DNA storage systems, we present a method to automatically reconstruct corrupted or missing data stored in DNA using fountain codes. Our method exploits the relationships between packets encoded with fountain codes to identify and rectify corrupted or lost data. Furthermore, we present file type-specific and content-based data recovery methods for three file types, illustrating how a fusion of fountain encoding-specific redundancy and knowledge about the data can effectively recover information in a corrupted DNA storage system, both in an automatic and in a guided manual manner. To demonstrate our approach, we introduce DR4DNA, a software toolkit that contains all methods presented. We evaluate DR4DNA using both in-silico and in-vitro experiments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    环境DNA(eDNA)工作流程包含许多熟悉的分子实验室技术,但也采用了几种独特的方法。当使用eDNA时,必须通过保存从收集点避免污染,并选择有意义的阴性对照。由于eDNA可以从各种样品和栖息地获得(例如,土壤,水,空气,或组织),协议将根据使用情况而有所不同。样品可能需要额外的步骤来稀释,块,或去除抑制剂或物理分解样品或过滤器。此后,采用标准DNA分离技术(基于试剂盒或苯酚:氯仿:异戊基[PCI])。一旦DNA被提取出来,它通常使用荧光计进行定量。收益率差异很大,但重要的是在扩增感兴趣的基因之前知道。鼓励采样材料和提取的DNA的长期储存,因为它为溢出/污染的样品提供了备份,数据丢失,重新分析,以及使用较新技术的未来研究。在冰箱中储存通常是理想的;然而,一些存储缓冲区(例如,Longmires)要求过滤器或拭子保持在室温下,以防止与缓冲液相关的溶质沉淀。这些eDNA分离的基线方法,验证,和保存在本协议章节中详细介绍。此外,我们概述了一个具有成本效益的,优化了自制提取协议以提取eDNA。
    Environmental DNA (eDNA) workflows contain many familiar molecular-lab techniques, but also employ several unique methodologies. When working with eDNA, it is essential to avoid contamination from the point of collection through preservation and select a meaningful negative control. As eDNA can be obtained from a variety of samples and habitats (e.g., soil, water, air, or tissue), protocols will vary depending on usage. Samples may require additional steps to dilute, block, or remove inhibitors or physically break up samples or filters. Thereafter, standard DNA isolation techniques (kit-based or phenol:chloroform:isoamyl [PCI]) are employed. Once DNA is extracted, it is typically quantified using a fluorometer. Yields vary greatly, but are important to know prior to amplification of the gene(s) of interest. Long-term storage of both the sampled material and the extracted DNA is encouraged, as it provides a backup for spilled/contaminated samples, lost data, reanalysis, and future studies using newer technology. Storage in a freezer is often ideal; however, some storage buffers (e.g., Longmires) require that filters or swabs are kept at room temperature to prevent precipitation of buffer-related solutes. These baseline methods for eDNA isolation, validation, and preservation are detailed in this protocol chapter. In addition, we outline a cost-effective, homebrew extraction protocol optimized to extract eDNA.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在没有DNA模板的情况下,预定义序列的长双链DNA分子的从头算产生尤其具有挑战性。DNA合成步骤仍然是许多应用的瓶颈,例如祖先基因功能评估,分析选择性剪接或基于DNA的数据存储。在本报告中,我们提出了一种完全体外的方案,以使用GoldenGate组装在不到3天的时间内从市售的短DNA块开始产生非常长的双链DNA分子。这种创新的应用使我们能够简化生产24kb长的DNA分子的过程,该分子存储了1789年《人权宣言》和《公民权利宣言》的一部分。产生的DNA分子可以容易地克隆到合适的宿主/载体系统中用于扩增和选择。
    In the absence of a DNA template, the ab initio production of long double-stranded DNA molecules of predefined sequences is particularly challenging. The DNA synthesis step remains a bottleneck for many applications such as functional assessment of ancestral genes, analysis of alternative splicing or DNA-based data storage. In this report we propose a fully in vitro protocol to generate very long double-stranded DNA molecules starting from commercially available short DNA blocks in less than 3 days using Golden Gate assembly. This innovative application allowed us to streamline the process to produce a 24 kb-long DNA molecule storing part of the Declaration of the Rights of Man and of the Citizen of 1789 . The DNA molecule produced can be readily cloned into a suitable host/vector system for amplification and selection.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于其高信息密度,DNA作为数据存储系统非常有吸引力。然而,一个主要障碍是使用下一代测序检索DNA数据的高成本和长周转时间.在这里,描述了使用微流体超大规模集成(mVLSI)平台来执行存储在DNA中的数据的高度并行和快速读出。此外,证明了编码在DNA中的多态数据可以通过片上熔解曲线分析来解密,从而进一步增加可以分析的数据内容。mVLSI网络体系结构与精细特异性DNA识别的配对产生了用于快速DNA数据读取的可扩展平台。
    Due to its high information density, DNA is very attractive as a data storage system. However, a major obstacle is the high cost and long turnaround time for retrieving DNA data with next-generation sequencing. Herein, the use of a microfluidic very large-scale integration (mVLSI) platform is described to perform highly parallel and rapid readout of data stored in DNA. Additionally, it is demonstrated that multi-state data encoded in DNA can be deciphered with on-chip melt-curve analysis, thereby further increasing the data content that can be analyzed. The pairing of mVLSI network architecture with exquisitely specific DNA recognition gives rise to a scalable platform for rapid DNA data reading.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    DNA,作为生物体的储存介质,可以解决现有电磁存储介质的缺点,例如低信息密度,维护功耗高,和短的存储时间。当前对DNA存储的研究主要集中在设计相应的编码器,以将二进制数据转换为满足生物学约束的DNA基础数据。我们创建了一个新的汉字代码表,可以实现非常高的信息存储密度来存储汉字(与传统的UTF-8编码相比)。为了满足生物限制,我们设计了一种低算法复杂度的DNA移位编码方案,可以编码DNA的任何链甚至具有过长的均聚物。设计的DNA序列将存储在744bp的双链质粒中,确保存储过程中的高可靠性。此外,质粒对环境干扰的抵抗力,确保信息长期稳定储存。此外,它可以以较低的成本复制。
    DNA, as the storage medium in organisms, can address the shortcomings of existing electromagnetic storage media, such as low information density, high maintenance power consumption, and short storage time. Current research on DNA storage mainly focuses on designing corresponding encoders to convert binary data into DNA base data that meets biological constraints. We have created a new Chinese character code table that enables exceptionally high information storage density for storing Chinese characters (compared to traditional UTF-8 encoding). To meet biological constraints, we have devised a DNA shift coding scheme with low algorithmic complexity, which can encode any strand of DNA even has excessively long homopolymer. The designed DNA sequence will be stored in a double-stranded plasmid of 744bp, ensuring high reliability during storage. Additionally, the plasmid\'s resistance to environmental interference ensuring long-term stable information storage. Moreover, it can be replicated at a lower cost.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在单链DNA/RNA中,二级结构非常常见,尤其是在长序列中。已经认识到,DNA序列中的高度二级结构可能干扰DNA存储中信息的正确写入和读取。然而,很少研究如何规避其副作用。
    方法:由于DNA序列的二级结构程度与复杂折叠过程中释放的自由能的大小密切相关,我们首先基于随机产生的DNA序列研究不同编码长度下的自由能分布。然后,我们构建了双向长短期(BiLSTM)-注意力深度学习模型来预测序列的自由能。
    结果:我们的模拟结果表明,特定长度的DNA序列的自由能遵循右偏斜分布,并且平均值随着长度的增加而增加。给定20kcal/mol的容许自由能阈值,我们可以通过选择100nt的可行编码长度,将编码序列中严重二级结构的比例控制在显着水平的1%以内。与传统的深度学习模式相比,该模型在平均相对误差(MRE)和判定系数(R2)上都能取得较好的预测效果。仿真实验中MRE=0.109,R2=0.918。BiLSTM和注意模块的组合可以处理长期依赖性并捕获碱基配对的特征。Further,该预测具有线性时间复杂度,适合在未来大规模应用中检测具有严重二级结构的序列。最后,可以在真实数据集上筛选出94个预测自由能中的70个。它表明,所提出的模型可以筛选出一些高度可疑的序列,这些序列容易产生更多的错误和低测序拷贝。
    BACKGROUND: In single-stranded DNAs/RNAs, secondary structures are very common especially in long sequences. It has been recognized that the high degree of secondary structures in DNA sequences could interfere with the correct writing and reading of information in DNA storage. However, how to circumvent its side-effect is seldom studied.
    METHODS: As the degree of secondary structures of DNA sequences is closely related to the magnitude of the free energy released in the complicated folding process, we first investigate the free-energy distribution at different encoding lengths based on randomly generated DNA sequences. Then, we construct a bidirectional long short-term (BiLSTM)-attention deep learning model to predict the free energy of sequences.
    RESULTS: Our simulation results indicate that the free energy of DNA sequences at a specific length follows a right skewed distribution and the mean increases as the length increases. Given a tolerable free energy threshold of 20 kcal/mol, we could control the ratio of serious secondary structures in the encoding sequences to within 1% of the significant level through selecting a feasible encoding length of 100 nt. Compared with traditional deep learning models, the proposed model could achieve a better prediction performance both in the mean relative error (MRE) and the coefficient of determination (R2). It achieved MRE = 0.109 and R2 = 0.918 respectively in the simulation experiment. The combination of the BiLSTM and attention module can handle the long-term dependencies and capture the feature of base pairing. Further, the prediction has a linear time complexity which is suitable for detecting sequences with severe secondary structures in future large-scale applications. Finally, 70 of 94 predicted free energy can be screened out on a real dataset. It demonstrates that the proposed model could screen out some highly suspicious sequences which are prone to produce more errors and low sequencing copies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    DNA是高密度的,长期稳定,和可扩展的存储介质,可以满足数据指数增长对存储介质的需求。现有的DNA存储编码方案趋向于实现高密度存储,但没有充分考虑DNA序列的局部和全局稳定性以及存储信息的读写精度。为了解决这些问题,本文提出了一种基于图的DeBruijn修剪旋转图(DBTRG)编码方案。通过将所提出的动态二进制序列与原始二进制序列进行异或,k-mers可以分为DeBruijn修剪图,存储的信息可以根据重叠关系进行压缩。仿真实验结果表明,DBTRG保证了基平衡和多样性,减少了不期望的图案的可能性,提高了DNA存储和数据恢复的稳定性。此外,实现了在存储510KB图像时保持1.92的编码率,并引入了用于DNA存储编码方法的新颖方法和概念。
    DNA is a high-density, long-term stable, and scalable storage medium that can meet the increased demands on storage media resulting from the exponential growth of data. The existing DNA storage encoding schemes tend to achieve high-density storage but do not fully consider the local and global stability of DNA sequences and the read and write accuracy of the stored information. To address these problems, this article presents a graph-based De Bruijn Trim Rotation Graph (DBTRG) encoding scheme. Through XOR between the proposed dynamic binary sequence and the original binary sequence, k-mers can be divided into the De Bruijn Trim graph, and the stored information can be compressed according to the overlapping relationship. The simulated experimental results show that DBTRG ensures base balance and diversity, reduces the likelihood of undesired motifs, and improves the stability of DNA storage and data recovery. Furthermore, the maintenance of an encoding rate of 1.92 while storing 510 KB images and the introduction of novel approaches and concepts for DNA storage encoding methods are achieved.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号