
Since the discovery of the double helix, researchers have been searching for the knowledge encoded in DNA. 70 years later, it is clear that the information hidden in DNA is multi-layered. Only 1-2 percent of the genome consists of genes, the sequences that code for proteins.
“DNA has many functions that go beyond protein coding. Some sequences regulate genes, others serve structural purposes, and most sequences fulfill several functions simultaneously. At present, we do not understand the significance of most of the DNA. For the areas outside of genes, we seem to have only scratched the surface. This is where AI and large language models can help,” says Dr. Anna Poetsch, research group leader at BIOTEC.
DNA as language
Large language models such as GPT have changed our understanding of language. Trained exclusively with text, the language models developed the ability to use language in many contexts.
“DNA is the code of life. Why shouldn’t it be treated like a language?” asks Dr. Poetsch. The Poetsch team trained a large language model on a reference human genome. The resulting tool called GROVER, or “Genome Rules Obtained via Extracted Representations”, can be used to extract biological meaning from DNA.
“GROVER has learned the rules of DNA. In terms of language, we talk about grammar, syntax and semantics. For DNA, this means learning the rules of sequences, the order of nucleotides and sequences and their meaning. Similar to how GPT models learn human languages, GROVER has basically learned to ‘speak DNA’,” explains Dr. Melissa Sanabria, the researcher behind the project.
The team showed that GROVER can not only accurately predict the following DNA sequences, but can also be used to extract information of biological meaning from context. For example, it can identify the start of genes or protein binding sites on DNA. GROVER also learns processes that are generally considered “epigenetic”, i.e. those that take place on the DNA and have not previously been considered “coded”.
“It is fascinating that by training GROVER with just the DNA sequence, without any additional functional data, we can actually extract information about biological function. For us, this shows that function, including some epigenetic information, is also encoded in the sequence,” says Dr. Sanabria.
The DNA dictionary
“DNA is similar to language. It consists of four letters that form sequences, and the sequences carry a meaning. However, unlike a language, there is no concept of words,” says Dr. Poetsch. DNA consists of four letters (A, T, G and C) and genes, but there are no predefined sequences of different lengths that combine to form genes or other meaningful sequences.
To train GROVER, the team first had to create a DNA dictionary. They used a trick from compression algorithms. “This step is crucial and distinguishes our DNA language model from previous attempts,” says Dr. Poetsch.
“We analyzed the entire genome and looked for letter combinations that occur most frequently. We started with two letters and searched the DNA again and again to build it up to the most common multi-letter combinations. In this way, over about 600 cycles, we fragmented the DNA into ‘words’ that allow GROVER to best predict the next sequence,” explains Dr. Sanabria.
The promise of AI in genomics
GROVER promises to unlock the different levels of the genetic code. DNA contains important information about what makes us human, our susceptibilities to disease and our responses to treatments.
“We believe that understanding the rules of DNA through a language model will help us uncover the depths of biological meaning hidden in DNA. This should advance both genomics and personalized medicine,” says Dr. Poetsch.
Original publication
Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert, and Anna R. Poetsch: DNA language model GROVER learns sequence context in the human genome. Nature Machine Intelligence (July 2024)
Link: https://doi.org/10.1038/s42256-024-00872-0 
About the Biotechnology Center (BIOTEC)
The Biotechnology Center (BIOTEC) was founded in 2000 as a central scientific institution of TU Dresden with the aim of combining cutting-edge research approaches in molecular and cell biology with the engineering sciences, which are traditionally strong in Dresden. Since 2016, BIOTEC has been one of three institutes of the central scientific institution Center for Molecular and Cellular Bioengineering (CMCB) at TU Dresden. BIOTEC occupies a central position in research and teaching in the research focus area Molecular Bioengineering and combines cell biological, biophysical and bioinformatic approaches. It thus makes a decisive contribution to raising the profile of TU Dresden in the fields of health sciences, biomedicine and bioengineering.
Scientific contact
Dr. Anna Poetsch
E-mail: anna.poetsch@tu-dresden.de
– – – – –
Further links
👉 www.tu-dresden.de 
👉 www.tud.de/cmcb 
👉 www.tud.de/biotec  
Photo: Magdalena Gonciarz, generated with Dall-E3