The Borenstein lab published a study seeking to predict metabolites in stool microbiomes using genus-level microbe abundances. This is not the first attempt at a research group trying to tackle this problem, (for example), but the Borenstein group did great work collating several datasets made readily available for others to use.
At a simple level, Muller’s study built machine learning (random forest) regression algorithms, using relative abundances of genera to predict the abundances of metabolites shared across datasets. They did additional analyses (e.g. random-effects models) to assess the generalizability of predictions across datasets. I am interested in simply seeing if bacterial abundances can predict metabolites, and whether additional models can outperform random forest.
Below, I focus on only 2 of the datasets Muller studied: FRANZOSA and IHMP. Both of these datasets employ whole genome shotgun sequencing and untargeted metabolomics on IBD microbiomes. Below, I will:
Build models using FRANZOSA’s n=164 samples
Evaluate the predictions on the IHMP’s n=79 samples
Test additional models that improve the predictive power.
Extract feature importances and identify bacteria associated with metabolites
First, we’ll select the samples we want to work with. We already know there are pseudoreplicates in the raw datasets, so let’s select only unique IBD patients to work with:
## dataset disease samples
## 1 FRANZOSA_IBD_2019 IBD 164
## 2 iHMP_IBDMDB_2019 IBD 79
There are 164 unique subjects’ samples in the Franzosa dataset, and 79 in the IHMP dataset.
Now, we’ll load both metabolomic datasets and restrict to annotated metabolites shared in each dataset.
## [1] "Franzosa contains 164 samples by 8,848 metabolites (467 annotated)"
## [1] "IHMP contains 79 samples by 81,867 metabolites (597 annotated)"
## [1] "238 intersecting metabolites"
We’ll check the distributions of the datasets.
Log-transforming the data with an added pseudocount had the effect we expected.
Now we can load the metagenomic data, and as with the metabolites, we’ll take the intersecting species.
We’re going to need to apply some stringent prevalence and abundance filtering.
Let’s see how many taxa remain across various thresholds.
As we can see, we can preserve a little over 300 taxa if we apply a filter that keeps taxa “detected at 0.01% abundance in 10%+ samples”
Let’s apply that filter and keep the intersecting (shared) taxa between datasets.
## [1] "355 taxa remain that are shared with IHMP"
Excellent. 355 taxa is a reasonable number to perform this analysis on.
Now let’s assess the distributions of taxa and log-transform if necessary, applying a single pseudocount to both datasets.
Log-transforming improved the distributions of taxa in both datasets.
Our objective is to predict the abundance of each metabolite using the taxa abundances.
I’m going to write a function that performs the following:
Selects one metabolite (from the Franzosa dataset)
Builds a model that predicts the metabolite’s abundance using taxa abundances (using the Franzosa dataset). We’re going to use random forest OOB predictions, which are as robust as multi-fold cross-validation, but blazingly faster.
Builds a model that predicts scrambled (null) metabolite abundances
Records the Spearman correlation of predictions from internal validation samples
Applies the model to the IHMP dataset, and records the Spearman correlation coefficient
And repeat this process across all metabolites
## [1] "41 metabolites with a Spearman rho > 0.3"
Here, I’ve plotted the metabolites with a Spearman correlation coefficient above 0.3 using the external evaluation (IHMP) dataset. Results from the internal data are in red, while external data are in blue. Null performances (built and evaluated using internal and external data) are in grey. All models were built and evaluated 15 times.
The authors used a threshold of 0.3 as well, but added additional statistical validation for their definition of a metabolite that can be robustly predicted.
Note: the original study used only healthy microbiomes to build predictors, whereas I am focusing on IBD microbiomes. Despite this key difference, I’ve nonetheless validated that urobilin can be well predicted. Interestingly, this metabolite is also useful for discriminating healthy from IBD microbiomes, as I showed here.
Now, we’ll ask whether other machine learning models are superior to random forest. The authors also evaluated lasso and elastic net. To save on computation time, I’m going to limit this benchmark to metabolites with external validation Spearman correlations > 0.3. We’re also going to drop the null tests, since our goal is to evaluate whether these other models outperform random forest.
It doesn’t appear like anything is superior to random forest. Let’s test this statistically using a mixed effects model.
# set Random Forest (ranger) as the reference group
muller_loop_output$model = factor(muller_loop_output$model, levels=c("ranger", "gbm", "glmnet", "svmLinear", "svmRadial", "xgbTree"))
# build a mixed effects regression model
lmerTest::lmer(external_real_spearman ~ model + (1|metabolite), muller_loop_output) %>%
summary() %>% coef()
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.378583693 0.011151587 44.02096 33.948862 3.284385e-33
## modelgbm -0.058653410 0.003736449 3644.00000 -15.697636 8.741438e-54
## modelglmnet -0.020449873 0.003736449 3644.00000 -5.473078 4.721136e-08
## modelsvmLinear -0.136274861 0.003736449 3644.00000 -36.471761 1.495995e-248
## modelsvmRadial -0.006595381 0.003736449 3644.00000 -1.765147 7.762279e-02
## modelxgbTree -0.097983469 0.003736449 3644.00000 -26.223690 5.341538e-139
The mixed effects model shows random forest is significantly superior (across all metabolites), except for svmRadial.
Now, we sort of biased these results by selecting the top metabolites (ρ > 0.3) identified by random forest. It’s entirely possible that other models could have identified different metabolites better than random forest, which could have bumped up their overall performance. However, these 41 metabolites alone took 9 hours to compute.
Since random forest was superior, let’s go back to the first models we built using the ranger package. We’ll extract out the feature importances for metabolites and correlate them with the taxa to identify their direction of association.
Let’s plot a heatmap with features (taxa) and their associations with the top 41 metabolites.
We’ll select only the top 20 taxa based on their mean importance value.
There are a few metabolites with an feature importances worth noting.
First, urobilin, which I previously found to be associated with IBD microbiomes here. Urobilin has a few features with outsized importance values towards the top right of the heatmap.
Second, chenodeoxycholate (CDCA), a primary bile acid, has a band of importance features. Primary bile acids, including CDCA, have been consistently found to be enriched in IBD, including in Franzosa’s original study.
Let’s explore the features most strongly predictive of these two metabolites
For urobilin, we see Holdemania is most strongly associated and predictive feature.
For CDCA, several taxa are negatively associated, including Angelakisella, Pullichristensenella, and an unclassified species annotated as UBA866. I’m unfamiliar with these taxa names.
Angelakisella is a genus in the Oscillospiraceae family. UBA866 is annotated as a Ruminococcaceae. Pullichristensenella is a candidatus (proposed, but not yet accepted) annotation for a genus in the Clostridia incertae sedis family.
To conclude, we were able to predict some metabolite abundances using genus-level metagenomic data in a cohort of IBD patients. We employed a separate cohort to validate these predictions.
We found that random forest was superior (or at least non-inferior) to the other models.
We identified specific taxa that are predictive of certain metabolites, like urobilin and CDCA, which are relevant to IBD.