The curatedMetagenomicData (CMD) is a large collection of microbiome datasets, re-processed using a standard pipeline to minimize batch effects.

Ning et al. (2023) perform a meta-analysis of microbiome studies using the CMD to predict IBD from non-IBD controls.

The authors apply leave-one-cohort-out (LOCO) cross-validation to evaluate whether larger, multi-cohort datasets can be used to predict IBD in a validation cohort.

Below, we attempt to validate their model and evaluate additional models that may boost the performance.

Of note, only cohorts available in the CMD will be used.

Validating the analysis

Loading data

CMD includes 3 out of 6 of the cohorts used in the study.

These include IBD and non-IBD stool microbiome samples from the Human Microbiome Project 2 (“HMP”), LifelinesDeep (“LLD”), and Nielsen 2014 (“NIE”) cohorts.

Note: VilaAV_2018 was combined with LifelinesDeep in the publication.

##                  study_condition
## study_name        control  IBD
##   HMP_2019_ibdmdb      27  103
##   LifeLD_Vila        1135  355
##   NielsenHB_2014      236   82

We can see some fairly large imbalances in the data sets. The authors downsampled the larger class until the imbalance was closer to “1:2 to 1:3”. It appears that they performed this balancing once, and discarded the rest of the data from the analyses thereafter.

Instead of doing this (because we can’t fully replicate their results without knowing which samples they discarded), we will evaluate the impact of balancing using an algorithm called SCUT (SMOTE and Cluster-based Undersampling).

Ning used Random Forest, which we will use to evaluate LOCO AUC.

Helper function

First, let’s build a function that does the following:

optionally re-balance the dataset using SCUT
train a model using 5x CV
use the model to predict the class of the held-out dataset
calculate AUCs and plot ROC curves

Should we rebalance?

Let’s compare the performance of an imbalanced data set with and without hybrid up + down sampling (using SCUT).

The LifelinesDeep + VichVila datasets contain 1135 controls and 355 IBD samples, a ratio of 3:1. Let’s combine it with the HMP data to evaluate the Nielsen data set.

Here is the ROC when using unbalanced data:

Here is the ROC when using balanced data:

Evidently, re-balancing achieves a comparable AUC with unremarkable specificities and sensitivities, so it may not be necessary for this dataset.

Now, let’s move on and evaluate the LOCO-CV AUCs for the other 2 datasets.

Here is the ROC curve when using the Lifelines + Nielsen data sets to build a model, and the HMP data set to evaluate:

Here is the ROC curve when using the Nielson + HMP data sets to build a model, and the Lifelines data set to evaluate:

These AUCs are less impressive than what were originally published See Figure 2G.

Performances

Nielsen was originally 0.780, but is 0.695 here. HMP was originally 0.713, but is 0.692 here. Lifelines was originally 0.843, but is 0.832 here.

I suspect this is because each of our training data sets are missing 3 additional cohorts! Without these data freely available, we cannot fully replicate the authors’ results.

Nonetheless, we can extend this analysis to compare other machine learning models.

Extending the analysis

Let’s evaluate the LOCO performances using additional machine learning algorithms.

# select models
ml.models = c("gbm","glmnet","ranger","svmLinear","svmRadial","xgbTree")

Recall: HMP was originally 0.713. Lifelines was originally 0.843. Nielsen was originally 0.780.

From these results, it appears that Random Forest is the best in 2 / 3 cohorts. Elastic net (glmnet) approaches the published AUC (0.780) performance for the Nielsen data set.

However, I suspect that the AUCs in the Nielsen (NIE) are being influenced by class imbalances, such that sensitivities are high but specificities are low (or vice versa).

Let’s examine this suspicion by plotting the sensitivities and specificities:

Here, we see that where selected models had decent AUCs, the specificity was higher and sensitivity was lower (and vice versa).

This illustrates the limitations of using imbalanced datasets and how a single metric (e.g. AUC) can hide key performance attributes (e.g. sensitivity). The study’s authors did not evaluate these metrics, instead relying only on the composite AUC score to render conclusions.

Conclusions

To conclude, I could not validate the performances of all 6 cohorts included in the original study, and my re-analysis yielded lower performances in the available 3 cohorts.

Importantly, AUCs in these data sets were sensitive to class imbalances. Despite using hybrid sampling, sensitivities were quite variable.