The DiDAX consortium, supported by EU EIC funding, consists of eight research groups from academia and industry collaborating to revolutionize DNA-based data applications. Their innovative work focuses on developing advanced algorithms, cost-effective synthesis methods, novel chemistries, and cutting-edge protection technologies to harness the potential of DNA as a versatile and secure medium for long-term data storage and applications.
DiDAX is a consortium of eight research groups, coming from both academia and industry and working together, under EU EIC funding, to innovate DNA-based data applications. Our work is based on novel encoding and decoding algorithms, reduced cost flexible synthesis, novel chemistries and new protection and embedding technologies and material science innovations.
Innovating in these disciplines, DiDAX expands the applicability of long-term archival DNA-based data storage and develops new applications to protect, verify and object authenticity.
DNA, as an information storage medium, has many advantages. First, DNA’s density – it can store extraordinary amounts of data in an extremely small volume. To illustrate the scale, if all the information currently hosted on YouTube were encoded in DNA, it would, in theory, fit in a single shoe box. Second, DNA is inherently stable, and we understand how to preserve it for extremely long periods. Third, DNA is environmentally friendly in having little associated energy cost, compared to magnetic media in data centers and in reducing electronic waste. And finally, its longevity and universality – DNA is the fundamental building block of life. As long as humans (and biology) exist, we will possess the tools and knowledge to read, write, and interpret DNA, unlike legacy storage technologies such as magnetic tapes, VCRs, or obsolete disk formats that became unreadable.
Together, these and more make DNA an attractive medium for long-term, archival data storage.
Here is how we can (and indeed do so, in DiDAX) store data in DNA.
A DNA strand is a sequence built from four nucleotides, which together form a four-letter alphabet (A, C, G, T). Since modern synthesis technologies allow us to chemically create almost any DNA sequence we choose, we can treat DNA as a programmable storage medium. The process begins by taking a digital file and converting it into a binary representation (e.g., 00110010101). We then encode the binary string into DNA letters using a predefined mapping such as 00 → A, 01 → C, 10 → G, 11 11->T. After encoding, the DNA sequences are chemically synthesized and stored in a physical container. For better sustainability encapsulation, chemistry may also be applied. Error correction codes are also used to address corruption that can occur in the process.
To retrieve the information, the stored DNA is sequenced using DNA sequencing technology (such as Illumina or Oxford Nanopore). Sequencing reveals the nucleotide order of the strands, which is then decoded by applying the same mapping in reverse, including algorithmic error correction, to reconstruct the original binary data and the original file.