Publications

Exploring RNA-seq data normalization methods using principal component analysis and KEGG pathway enrichment

van Lingen, H.J.; Suarez Diez, M.; Saccenti, E.

Summary

RNA-seq data is usually normalized before performing the downstream data analysis. Normalization corrects the RNA-seq data for uncontrollable experimental conditions. Therefore, normalization increases the fairness of data comparisons to approach biological truth. However, normalization methods depend on assumptions that are hard to verify, because RNA-seq reference data does not exist. Nonetheless, implications of the various normalizations could still be explored. The aim was to investigate the implications of various normalization methods for RNA-seq data using principal component analysis (PCA).

RNA-seq data for 10144 genes from 6 patients of which 3 diagnosed with a tumor and 3 non-diagnosed (Tuch et al., 2010), and data for 9854 genes from 4 HCT116 human colon cancer cell lines of which 2 control and 2 knockdown (Park and Seo, 2022) were used. Treatments were represented on columns and genes on rows. RNA-seq data normalization methods were total-count, upper-quartile, trimmed mean of M-values (i.e. fold changes; TMM), relative log expression, centred log ratio, full-quantile, probabilistic quotient, within-lane GC-content based, within-and-between-lane GC-content based, TPM, RPKM and conditional quantile.

Upon every normalization, a PCA was performed on the normalized data, after which the 1000 most important genes were selected by taking the sum of the loadings multiplied by the variance explained for the first 3 principal components. For the 1000 most important genes per normalization method, a KEGG pathway enrichment was performed. Per KEGG unit obtained for all normalization methods together, a matrix containing zeros and ones indicated if the unit could be obtained. Another PCA was performed on this matrix with joint KEGG units obtained for all normalization methods as rows and normalization methods as columns.

Performing a PCA on the matrix that indicated if the most influential genes per tumor data normalization method resulted in the enrichment of a certain KEGG unit. TPM and RPKM normalization clustered together, probabilistic quotient and conditional quantile normalization appeared rather close to each other, and all other normalization methods appeared relatively close to each other as well. For the colon cancer cell lines, TPM and RPKM normalization and probabilistic quotient and conditional quantile normalization also appeared close to each other, whereas all other normalization methods separated slightly more than for the tumor data.

A specific feature of the present analysis is the use of PCA for evaluating normalization methods for RNA-seq data after which KEGG pathway enrichment and another PCA are applied. Selecting the top 1000 most influential genes is relatively arbitrary, but it still indicated differences between the various normalization methods. In conclusion, PCA investigation indicates normalization method selection is not trivial as these methods have implications for the biology based on the KEGG pathways that were enriched.