Publications

How AI can provide an overview of protein quality from literature

Vlek, R.J.; Heuer, H.E.J.M.; van Rooijen, L.A.; van der Sluis, A.A.; de Jong, G.A.H.; Mes, J.J.

Summary

A transition to diets with alternative, primarily plant-based, protein sources can benefit both climate and public health, but requires a better insight in their protein quality. Traditional literature studies are effortful and often incomplete, which is why this project explored the potential of artificial intelligence (AI) and natural language processing (NLP) to perform large scale, automated extraction of information on protein digestibility and protein quality from scientific literature.

Information on protein and amino acid (AA) digestibility was extracted, along with contextual details, such as the protein source, study type (animal or human in vivo trial, in vitro work), and any processing methods potentially affecting the observed AA digestibility. The extracted information was semantically standardised, normalised, and validated to our best extent possible. It was enriched with uniform recalculations of the amino-acid score (AAS), protein-digestibility corrected amino acid score (PDCAAS) and digestible indispensable amino acid score (DIAAS) against the most recent FAO (2013) reference values, as well as with information from established knowledge graphs (e.g. NCBI Taxonomy, FoodOn). Provenance was stored alongside the extracted data, so that it can be traced back to its origin. Since the extraction process is automated and takes only a few minutes per document, it can easily be repeated to include information from newly appearing publications, with one update already planned. The AI extraction process was validated against the output of a manually conducted systematic literature review. Using the AI process, a preliminary dataset was generated with information from 463 scientific publications, containing 77 different protein sources and 261 lines of data on their protein quality and/or AA digestibility.

This AI supported dataset provides a single point of access to information on protein quality that was previously scattered, yielding insights on variation within and between protein sources, and supporting knowledge gap identification for alternative protein sources.