Student information

MSc thesis topic: Interpreting training data weights in random forest predictions

Today, machine learning is frequently utilized to tackle regression and classification issues. This study specifically delves into one machine learning regression method: random forest. A random forest prediction averages the predictions of numerous decision trees, with each tree's prediction being the average of the dependent variable values within a subset of the training data. Consequently, the random forest prediction is a weighted average of the training data, with weights contingent on the covariates. Unfortunately, these weights are typically not computed in practice, despite containing valuable information about each training data point's contribution to the prediction.

There are R packages available for computing the aforementioned weights. In this study, you will utilize these functions to compute and interpret the weights within a digital soil mapping case study. It is worth noting that each prediction location may feature distinct weight values associated with the training data points. While we anticipate that training data points with similar covariate values will receive greater weight, the extent of this effect remains uncertain. Additionally, there is a geographic dimension to consider: we anticipate that nearby training data points will carry the highest weights, but again, the magnitude of this effect is unknown. These questions are of high interest to digital soil mappers, as they seek to understand how the prediction accuracy of a random forest model is improved by adding a few local data points to an already extensive training set. For instance, SoilGrids employs hundreds of thousands of training data; therefore, the question arises: does the inclusion of 10 to 30 local observations locally enhance its predictive capabilities?

Objectives and Research questions

The overall objective is to develop a tool that can efficiently calculate the weights of a random forest model and analyse which factors determine the distribution of the weights in a real-world digital soil mapping case study.

Possible research questions are:

In what way can the weights of training data that are hidden in a random forest regression algorithm be efficiently calculated and stored?
What factors determine the weight values in a concrete digital soil mapping model and how does this depend on the size of the training data set and its spatial configuration?
Are weight values more determined by distance in feature space than by distance in geographic space?
Does it pay off to collect a relatively small local dataset to improve the local prediction performance of a random forest model that already has a very large training data set?

Requirements

Background in machine learning, scripting in R and digital soil mapping is advantageous but not required.

Literature and information

Theme(s): Modelling & visualisation