Software

TU Dresden: Cracking the code of life – New AI model deciphers the hidden language of DNA

August 2, 2024. DNA contains the basic information for life. Understanding how this information is stored and organized has been one of the greatest scientific challenges of the last century. With GROVER, a new Large Language Model trained with human DNA, researchers can now try to decode the complex information hidden in our genome. Developed by a team at the Biotechnology Center (BIOTEC) of the Technische Universität Dresden, GROVER treats human DNA like language and learns its rules and relationships to derive functional information about the DNA sequences. This new tool, published in “Nature Machine Intelligence”, has the potential to revolutionize genomics and advance personalized medicine.

Artistic representation of the large language model trained on DNA sequences. Photo: Magdalena Gonciarz, generated with Dall-E3

DNA as language

Large language models such as GPT have changed our understanding of language. Trained exclusively with text, the language models developed the ability to use language in many contexts.

“DNA is the code of life. Why shouldn’t it be treated like a language?” asks Dr. Poetsch. The Poetsch team trained a large language model on a reference human genome. The resulting tool called GROVER, or “Genome Rules Obtained via Extracted Representations”, can be used to extract biological meaning from DNA.

“GROVER has learned the rules of DNA. In terms of language, we talk about grammar, syntax and semantics. For DNA, this means learning the rules of sequences, the order of nucleotides and sequences and their meaning. Similar to how GPT models learn human languages, GROVER has basically learned to ‘speak DNA’,” explains Dr. Melissa Sanabria, the researcher behind the project.

The team showed that GROVER can not only accurately predict the following DNA sequences, but can also be used to extract information of biological meaning from context. For example, it can identify the start of genes or protein binding sites on DNA. GROVER also learns processes that are generally considered “epigenetic”, i.e. those that take place on the DNA and have not previously been considered “coded”.

“It is fascinating that by training GROVER with just the DNA sequence, without any additional functional data, we can actually extract information about biological function. For us, this shows that function, including some epigenetic information, is also encoded in the sequence,” says Dr. Sanabria.

The DNA dictionary

“DNA is similar to language. It consists of four letters that form sequences, and the sequences carry a meaning. However, unlike a language, there is no concept of words,” says Dr. Poetsch. DNA consists of four letters (A, T, G and C) and genes, but there are no predefined sequences of different lengths that combine to form genes or other meaningful sequences.

To train GROVER, the team first had to create a DNA dictionary. They used a trick from compression algorithms. “This step is crucial and distinguishes our DNA language model from previous attempts,” says Dr. Poetsch.

“We analyzed the entire genome and looked for letter combinations that occur most frequently. We started with two letters and searched the DNA again and again to build it up to the most common multi-letter combinations. In this way, over about 600 cycles, we fragmented the DNA into ‘words’ that allow GROVER to best predict the next sequence,” explains Dr. Sanabria.

The promise of AI in genomics

GROVER promises to unlock the different levels of the genetic code. DNA contains important information about what makes us human, our susceptibilities to disease and our responses to treatments.

“We believe that understanding the rules of DNA through a language model will help us uncover the depths of biological meaning hidden in DNA. This should advance both genomics and personalized medicine,” says Dr. Poetsch.

Original publication

Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert, and Anna R. Poetsch: DNA language model GROVER learns sequence context in the human genome. Nature Machine Intelligence (July 2024)

Link: https://doi.org/10.1038/s42256-024-00872-0

About the Biotechnology Center (BIOTEC)

The Biotechnology Center (BIOTEC) was founded in 2000 as a central scientific institution of TU Dresden with the aim of combining cutting-edge research approaches in molecular and cell biology with the engineering sciences, which are traditionally strong in Dresden. Since 2016, BIOTEC has been one of three institutes of the central scientific institution Center for Molecular and Cellular Bioengineering (CMCB) at TU Dresden. BIOTEC occupies a central position in research and teaching in the research focus area Molecular Bioengineering and combines cell biological, biophysical and bioinformatic approaches. It thus makes a decisive contribution to raising the profile of TU Dresden in the fields of health sciences, biomedicine and bioengineering.