News
AI will unravel secrets of non-coding genes
From smart chatbots to apps that can write entire articles, Artificial Intelligence (AI) is becoming an increasingly ubiquitous part of our lives. Michael Schon, a research associate at Wageningen University & Research, is designing an AI tool that can perform comparisons of non-coding RNA on plant genomes. The tool is expected to accelerate and simplify the future development of new plant varieties with greater resistance to drought or diseases, for example. Schon has received a Veni grant to support his research.
Proteins are the building blocks for cells in organisms. The instructions for making these proteins are issued (coded) by RNA from genes. Alongside these coding RNAs, some genes can produce non-coding RNAs: in other words, RNA that doesn’t include instructions to make a protein. This type of RNA also plays an important role in the development of organisms, says Michael Schon. “For example, they can activate genes, or do the opposite and switch them off. This will affect the appearance of a plant and the properties it has. Certain important non-coding RNAs also determine whether a plant reaches maturity at all.”
Relatives within the same family
Non-coding RNA could also potentially reveal why a plant species belongs to a particular family yet has different characteristics. In previous research, Schon identified non-coding RNAs of Arabidopsis thaliana (thale cress). This plant is used by plant scientists as a model organism. “Arabidopsis belongs to the Brassicaceae family, along with important crops like broccoli, cauliflower and kohlrabi. This family is also known as the mustard or crucifer family. However, it’s difficult to compare non-coding RNAs of Arabidopsis with that of other plants in the mustard family because previous work in these species has focused mainly on protein coding genes.”
Limited annotation of non-coding RNA
This means that a comparison between plants requires separate gene annotation for the non-coding RNA for each crop. Through his Veni project, Schon is looking for new ways to identify non-coding RNAs by using knowledge from related species. “More than 200 genome sequences are available for plants within the mustard family. Each genome is stored as a large text file consisting of millions of letters that represent the bases of a DNA molecule (A, C, T and G). Because the non-coding bits aren’t catalogued (annotated) properly in these genomes, it’s impossible to compare all the non-coding genes scattered inside this mountain of data. We need new strategies and tools for that. I’m trying to develop those.”
A small part of each genome
The first problem is knowing where in the genome to look. One of the tools Schon is developing is something he calls GeneSketch. To find the corresponding parts of different genomes, he’s using a method called Minimizer Sketch. “The idea behind the Minimizer Sketch is that you only need to look at a small piece of DNA – a sketch – rather than the entire sequence,” says Schon. “That means you only have to pay attention to a few thousand characters per genome to perform a comparison, rather than millions. The Minimizer Sketch was previously used to build a tree of primate evolution, which includes humans and their closest relatives. It turned out that a very accurate family tree of our ancestors can be made from sketches made of less than 1% of the whole genomes. A minimizer sketch therefore is a very efficient way to estimate how similar pieces of DNA are to each other, so it should also be useful for comparing genomes within the mustard family.”
Same technology as ChatGPT
After you know where to look, then next step is to understand what you are looking at. The technology Schon plans to use in GeneSketch is the same as that which is currently used in other AI tools, such as ChatGPT. “It’s something called ‘transformer’ technology,” says Schon. “You can ask a transformer to fill in a missing word in a sentence, for example. Initially, the transformer gives you a random word because it has never seen words before. But if you train it on millions of example sentences, it slowly learns to guess the right words by paying attention to patterns in the text. After training, a large language model like ChatGPT becomes very good at certain tasks, like answering questions or translating from one language to another. A transformer can be trained to learn not just human languages, but also the language of DNA, which has its own distinct patterns. I am working on a model to detect patterns in the DNA of many different species, and translate those patterns into a language that we as humans can understand.”
Model must be trained
Schon will train the transformer for GeneSketch to pay attention to how genes change across different species, especially non-coding genes. But he expects to come up against some challenges along the way. “One important issue is reliability. The transformer is a relatively new technology, and it makes mistakes. ChatGPT, for example, was trained on many different sources of text, but if you ask it a topic it never saw during training, it needs to make something up. You hope that it makes up something reasonable based on the patterns it has seen, but this is never a guarantee. You obviously want to avoid nonsense output. The more you train a transformer, the less nonsense it produces, but training can cost a lot of time and money. Is it better to train the model completely from scratch or build off of existing models? I am trying both approaches.”
Potential of the GeneSketch
Schon hopes to have a prototype of the GeneSketch after the first year of the project, which started in October 2023. He plans to use it to create gene annotations for the entire mustard family. The tool could be useful not just for the research sector but also for the agricultural industry, says Schon. “It could, for example, provide seed breeders with a quick way of understanding the DNA of a crop and its wild relatives. By learning more about how crops have been able to develop unique traits over the centuries, breeders could make more informed decisions for improving traits, such as making crops more resilient to climate change. So, the potential impact could be huge.”