PhD defence
Knowledge-driven approaches to improve genomic prediction in plants
Summary
Genomic Selection (GS) is a method to speed up breeding process by precisely selecting best plants based on cumulative effects of Single Nucleotide Polymorphisms (SNPs), estimated using Genomic Prediction (GP) models. Improvements in GP can, therefore, directly support GS and are central to this work. I propose using knowledge-driven approach, based on the concept that whatever we know already about SNPs, plants, phenotypes or environment should be incorporated into the model and improve its predictions. To this end, I used prior biological knowledge to group and prioritise SNPs based on functional information of genes; and leveraged on shared genetics between traits. I demonstrated using conventional statistical methods for complex traits like photosynthesis and biomass and found many functionally-related gene groups significantly improving predictions compared to the benchmark. Moreover, a novel deep learning framework prioritising these groups was developed. I, further, improved predictions of biomass using photosynthesis as a secondary trait.
Plant breeding is the science to develop new genetically superior cultivars for the favourable characters to satisfy human requirements. During the green revolution until 2000, global agricultural production doubled compared to the historical development till that time; but challenges are growing as well. Dealing with the current and anticipated challenges of the rapidly growing human and animal population, climate change and its consequent effects on environment and food security is practically not possible through the current pace for genetic improvements using conventional breeding practices. Improving plant breeding processes is, therefore, dearly required for sustainable growth of the global ecosystem.
Over the past couple of decades, technological advancements in High Throughput Sequencing (HTS) of the genomic DNA has paved the way forward to mitigate the gap between the genetic potential of existing germplasms versus the projected improvement. This led to equip plant breeding with high resolution molecular information, to speed up the breeding processes for acquiring the desired characters much quicker than in conventional breeding, with increased precision and accuracy. Genomic Selection (GS) is one such technique, where genome-wide DNA polymorphisms, often single nucleotide polymorphisms (SNPs), are used to estimate their cumulative effects, as breeding values, and select the best germplasm from a population as future parents, using only the breeding values, much earlier than true phenotypes appear. Pivotal to the GS framework is a genomic prediction (GP) model, that capitalises over the accurate SNP information to estimate breeding values and steer the decision making for selection. An improvement in GP can, therefore, be directly translated to the improvement in GS based breeding; and is central to this thesis.
Various factors including genetic architecture, population structure and genotype and phonotype data characteristics affect the prediction performance of GP models. Resultantly, several methods were proposed along with numerous extensions to account for these characteristic factors. Most commonly, the parametric Linear Mixed effect Models (LMMs) implementing GP as a whole genome regression and many non-parametric machine learning and deep learning methods have been applied. It is important to explore the potential application scenarios for certain methods under different GP problem characteristics (Chapter 2). I found out that a definite conclusion is still hard to draw, but a general guideline can still be made that ensemble ML methods along with Bayesian LMMs (e.g. BayesA and BayesB) are a reasonable choice for the traits predominantly characterised by the presence of large effects. On the other hand, complex polygenic traits governed by many small effects are hard to predict by all methods and prediction performance is generally low for the high-dimensional SNP data. This led me to explore possible solutions to improve GP for complex traits. To this end, this thesis presents novel development and application of improved GP methodologies for complex traits.
A possible choice for improving GP is to incorporate the wealth of prior biological knowledge, available freely from public repositories, and has been curated through multiple experimental and computational approaches. However, incorporating such an information into GP models still remains an open question. One strategy is to prioritise the genome-wide SNPs based on this information, resulting into different groups of SNPs, which can then be differentially prioritised in the model. At first, I used this strategy using LMMs, where the groups of SNPs were formed using functional information of the genes, in which they belong to (Chapter 3). In this connection, I utilised gene ontology (GO) and coexpressed gene clusters (COEX). The approach increased the prediction accuracy of the commonly used Genomic Best Linear Unbiased Prediction (GBLUP) method with different traits related GO and COEX groups, when tested on growth related traits i.e. photosynthetic light use efficiency of the photosystem II and projected leaf area, in Arabidopsis thaliana.
Next, the differential prioritisation approach in LMM was further extended to develop a novel deep learning framework, called PRIORNET, based on the fully connected feed-forward artificial neural network architecture, Multilayer Perceptron (MLP) (Chapter 4). The SNPs were grouped into knowledge and background using trait-related list of genes, GO or pathways etc, known a priori. Additionally, knowledge of protein-protein interactions was leveraged to make PRIORNET sparser than MLP. It is capable of increasing the prediction accuracy up to the theoretical maximum value of a GP model when highly specific knowledge is provided, and accommodates for more practical situations when partial or noisy knowledge is provided. Tested on both simulated phenotypes and experimental traits, it outperforms its benchmark MLP.
Another strategy to improve GP for complex traits is augmenting the model with some supplemental knowledge. I used genetically correlated component traits, referred as secondary traits, measured along the target trait in multi-trait GP (MT-GP) modelling (Chapter 5). To demonstrate the efficacy of MT-GP, different photosynthesis parameters are used as secondary traits to predict biomass as target in Arabidopsis thaliana, measured as projected leaf area (PLA). Moreover, I showed that how photosynthesis-related traits can improve PLA predictions up to ~3 folds than predicting it alone, given the incident light is also dynamic.
In conclusion, this thesis aids improving genomic prediction of complex plant traits using knowledge-driven models. While significant improvement was observed in both statistical and machine learning based modelling frameworks, there is considerable room for development using benchmarked prior knowledge for a general purpose applicability.