The Microbial Load Predictor (MLP) is a tool validated and published by Nishijima et al in Cell, 2024 Link to paper

It enables the prediction of bacterial cell counts in a metagenomic sample using the relative abundance values of individual taxa. Impressively, it works for 16S data.

Below, I sought to validate the analysis that enabled the MLP to work.

Validating the analysis

Load training data

First, we will need to download the training data from Nishijima’s github directory.

The training data were originally obtained from Vandeputte 2021

We will extract the training data from Nishijima’s machine learning model.

Here are the first 10 rows of the training data, the first 4 features, and the measured microbial load (“outcome”):

##    g_Faecalibacterium g_Bacteroides g_Roseburia   g_Blautia  outcome
## 1          0.44300184   -0.14320039   1.1665424 -1.02560377 11.40312
## 2          0.56620951    1.19313402   1.1878420 -0.42095071 11.29226
## 3          0.03222058    0.15113804   1.0946180  0.54632125 11.33846
## 4          0.26088258   -0.16088539   0.5155671  0.52050033 11.39094
## 5          0.01802752    0.01530088   0.5729155  0.26640734 11.46240
## 6          0.21728774   -0.25554118   1.3867448 -0.09262196 11.44091
## 7          0.75881013    0.49137268   1.0571067 -0.44479970 11.41830
## 8          0.64452361    0.57314304   1.2230498 -0.38510833 11.42975
## 9          0.63061554    0.12854935   1.1881320 -1.11317291 11.03743
## 10         0.82820946    1.10830044   1.1726207  0.15101203 11.24797

Training data comments

If we glance at the training data, we see they log-transformed the data, which is an appropriate processing step for many machine learning models. Interestingly, it is not necessary for tree-based approaches, like XGBoost, which they used.

Here is the distribution of Faecalibacterium prausnitzii, a highly abundant and prevalent gut microbe:

In order to log-transform, 0 values had to be replaced with a small pseudocount, which appears to be different for each taxa.

Here is a distribution of the minimum values of all taxa:

This will be important to note later, when we discuss how data pre-processing impacts predictions.

Load testing data

The testing data were obtained from Vandeputte 2017

We can also download these data from Nishijima’s github, however, they do not have the outcome data appended. So, let’s pull the original source instead.

##    g_Acetanaerobacterium g_Acetitomaculum g_Acidaminococcus g_Actinomyces
## 1                      0                0        0.04386515             0
## 2                      0                0        0.01120933             0
## 3                      0                0        0.19432838             0
## 4                      0                0        0.00000000             0
## 5                      0                0        0.10026993             0
## 6                      0                0        0.00000000             0
## 7                      0                0        0.00000000             0
## 8                      0                0        0.00000000             0
## 9                      0                0        0.00000000             0
## 10                     0                0        0.00000000             0

Test Nishijima’s model

First, let’s see how Nishijima’s model performs “out-of-the-box”.

We will compare the performance using Internal Validation (performed during model training and cross-validation) and, later, External Validation (performed on the testing data set).

Internal Validation

Here, we will extract the best model from Nishijima’s saved model object:

Nishijima uses Pearson correlation coefficients to evaluate the predictive performance of their model. This should be appropriate because we’re working on a regression problem to predict a continuous variable. The outcome is log-transformed microbial load, which is approximately normally distributed.

Let’s see what the Pearson correlation coefficient is between the true (measured) values and the predicted (model output) values:

Good, this appears to recapitulate Nishijima’s work.

External Validation

Now, let’s check the model’s performance on the testing data.

In Nishijima’s publication, the external validation correlation was 0.60.

I suspect this is because of small nuances in the training data, particularly the pseudocount.

Let’s rebuild their training data from the original source and process it in a way that is more amenable to external validation.

Rebuild training data

vandeputte.training.rebuilt = read.csv("2025_03_05_vandeputte_otu_unrarefied.csv")
vandeputte.load.rebuilt = read.csv("2025_03_05_vandeputte_load.csv")[,c("X", "Cell_count_per_gram", "ID_Number", "Day_Number")]

# Note how many unique subjects there are:
unique(vandeputte.load.rebuilt$ID_Number) %>% length()
# n = 20 

# remove NA values, which can't be used for prediction
vandeputte.training.rebuilt = vandeputte.training.rebuilt[!is.na(vandeputte.load.rebuilt$Cell_count_per_gram),]

vandeputte.load.rebuilt = vandeputte.load.rebuilt[!is.na(vandeputte.load.rebuilt$Cell_count_per_gram),]

# Do feature values sum to 1?
rowSums(vandeputte.training.rebuilt)
# No, therefore we need to rarefy

# First, delete the sample name column
vandeputte.training.rebuilt$X = NULL

# Second, rarefy
set.seed(25)
vandeputte.training.rebuilt = phyloseq::rarefy_even_depth(phyloseq::otu_table(vandeputte.training.rebuilt, taxa_are_rows=F), 
                                     sample.size=10000,
                                     replace=F) %>% data.frame()
# Third, convert to %
vandeputte.training.rebuilt <- sweep(vandeputte.training.rebuilt, 1, rowSums(vandeputte.training.rebuilt), FUN = "/")

# Fourth, log-transform and save pseudocount
van.pseudo.keep = min(vandeputte.training.rebuilt[vandeputte.training.rebuilt!=0])/2
vandeputte.training.rebuilt = log(vandeputte.training.rebuilt+van.pseudo.keep, base=10)

# Last, append outcome, matching by rownames
vandeputte.training.rebuilt$outcome = log(vandeputte.load.rebuilt[rownames(vandeputte.training.rebuilt),]$Cell_count_per_gram, base=10)
##    g_Faecalibacterium g_Bacteroides g_Roseburia g_Blautia  outcome
## 1          -0.6288397    -0.7852886  -0.5599572 -1.660549 11.40384
## 3          -0.8243433    -0.5002442  -0.7960158 -1.770830 11.29222
## 5          -1.0252581    -0.8799199  -0.8034093 -1.343423 11.33932
## 7          -0.9220873    -1.0215912  -1.1229167 -1.337714 11.39156
## 8          -0.9364790    -0.8756588  -1.0023952 -1.356054 11.46240
## 9          -0.9819239    -1.1098585  -0.7168119 -1.629857 11.44048
## 10         -0.6954018    -0.7190803  -0.8039621 -1.672641 11.41769
## 11         -0.7787162    -0.7025678  -0.7453311 -1.658565 11.42962
## 13         -0.5760173    -0.7047629  -0.5760173 -1.738737 11.03878
## 15         -0.7774136    -0.5876235  -0.8675802 -1.571056 11.24818

We’ve rebuilt the training data, now we can re-train the predictor.

Re-train model

Our rebuilt model is comparable to the original model in terms of internal validation performance.

How does it compare on the external validation set?

This correlation is now comparable to the original correlation (R = 0.60).

Therefore, using non-standard pre-processing had a significant impact on the external validation performance.

Extending the analysis

Now that we’ve validated the original results, we can ask additional questions.

With machine learning, a fair question is: would other algorithms work better?

Other algorithms

Let’s re-run this pipeline using different machine learning models.

We’ll use common models included in caret: linear models like elastic net, support vector machines, tree-based models like random forest, gradient boosting machine, and the original: XGBoost.

We’re also going to reduce the number of CV from 10 to 5, to save on time, but repeat each model evaluation 15 times to see how variable the CV process is.

Let’s see how long these model took to build:

This took about 45 minutes to run.

Next, lets check the internal validation RMSE, the loss function caret uses to select the optimal hyperparameters.

From the internal validation data (based on how caret selects optimal models), it appears like gradient boosting machine and random forest are slightly superior to XGBoost because they have lower RMSE values.

Now we will evaluate the performance by calculating the Pearson correlation coefficients for each model and each iteration.

Conclusions

From the above, we can conclude that XGBoost (xgbTree) performs well, but variably, on the validation data set. Other models perform comparably, like elastic net (glmnet) and random forest (ranger).

Importantly, we see that gradient boosting machine (gbm) has a slightly superior performance.