The Huttenhower Lab conducted one of the largest and deepest characterizations of the IBD microbiome, incorporating metagenomics, metatranscriptomics, and metabolomics. This study was published in Nature in 2019.
Their metabolomics data set has been reprocessed and uploaded by the Borenstein lab, available here. These data were incorporated in a meta-analysis of gut microbiome-metabolome associations.
Towards the end of Franzosa’s study, they use Random Forest to classify IBD and non-IBD microbiomes using metabolites, achieving an AUC of 0.92 using 5-fold cross-validation. Here, we attempt to validate this results.
Let’s load the metabolite feature table and check the dimensions:
## [1] "220 samples by 8848 features"
Since the metabolomics were performed untargeted, most of the features might not be annotated.
How many of the 8848 features were annotated?
## [1] "466 features are annotated (5%)"
Let’s keep just these 466 features, which is a good feature space for machine learning, and renders a more interpretable conclusion when we look for important features.
Now let’s check the sample data.
First, how many samples are in the original data set, and how many in the external validation sets?
## dataset disease samples
## 1 Training CD 68
## 2 Training Control 34
## 3 Training UC 53
## 4 Validation CD 20
## 5 Validation Control 22
## 6 Validation UC 23
We’ll be working with 34 control, 53 UC, and 68 CD samples to build the model, which is a bit of an imbalance worth correcting for. It’s made worse by the fact we’ll be pooling UC and CD together. The validation dataset was originally nicely balanced between n=20-23 per group, but these will also become imbalanced after pooling UC and CD.
Lastly, let’s check for pseudoreplicates:
Great! There are no repeat samplings of any subject.
Now we can check the feature distributions.
Based on the x axis scale, I suspect these data are profoundly right-skewed and are asking to be log-transformed. Let’s see what happens. We’ll also need to add a pseudocount because of the 0’s.
These look way better. Let’s move on to building our models and validating them using internal CV.
Franzosa employed 5x cross-validated random forest to predict IBD, achieving an AUC of 0.92. The authors specify the following:
labels were randomly balanced before training and 100 trees were considered
So, we’ll employ downsampling and specify 100 trees per random forest.
These values (0.903 and 0.910) are close to those reported in the original study (0.92 and 0.89)
Now we can extend the analysis and apply additional models, under the hypothesis that random forest was suboptimal for this data set.
We’ll conduct 15 iterations of 5x CV and report AUCs from internal and external data sets. We’ll continue to apply downsampling.
All models are significantly superior to a model trained on null data.
Random forest is slightly suboptimal for the internal validation, but optimal for external validation, which suggests we’ve avoided overfitting.
Despite testing additional models, we haven’t learned anything new from the data.
Instead, we can assess feature importances to see which metabolites drive differences between IBD and controls.
Let’s go back to the originally trained model and extract feature importances.
Interestingly, the most important feature for discriminating IBD from controls was urobilin. Urobilin was found to be significantly decreased in IBD in another cohort in 2023, and validated again by a separate research group. Interestingly, reduced microbial bilirubin metabolism is a marker of IBD and previous antibiotic use (source).
To conclude, we were able to validate Franzosa’s metabolome-based IBD classifier, seeing no significant improvement when using other machine learning models. By extracting feature importances, we found urobilin to be the top predictor of disease, which is highly accordant with the literature.