The Segata Lab published a study in 2016 employing machine learning on microbiomes to predict various diseases, including liver cirrhosis, colorectal cancer, IBD, obesity, and type 2 diabetes.

The stool microbiome dataaset is available here.

Below, we validate their findings and extend the analysis to employ multiclass classification with downsampling.

Validating the analysis

Loading data

Let’s load the data and see what types of samples are included:

##                       disease sample count
## 1                           -           20
## 2                           -            7
## 3                      cancer           48
## 4                   cirrhosis          118
## 5           ibd_crohn_disease           25
## 6      ibd_ulcerative_colitis          148
## 7  impaired_glucose_tolerance           49
## 8               large_adenoma           13
## 9                     leaness           89
## 10                          n          944
## 11                 n_relative           47
## 12                      obese            5
## 13                    obesity          164
## 14                 overweight           10
## 15              small_adenoma           26
## 16             stec2-positive           43
## 17                        t2d          223
## 18                underweight            1

This is a little messy, so let’s clean up the samples into the categories included in the study.

Metadata processing

##     disease sample count
## 1    cancer           48
## 2 cirrhosis          118
## 3       IBD          173
## 4      none          944
## 5   obesity          164
## 6       t2d          223

A small note: it’s unclear which categories of samples contributed to the 981 control samples listed in the original study. I only see 944.

Let’s also check if there are sample duplicates:

Uh oh. That’s a notable number of pseudoreplicates. Which conditions do they belong to?

##     disease >1 replicate
## 1    cancer            0
## 2 cirrhosis            0
## 3       IBD           66
## 4      none          153
## 5   obesity            0
## 6       t2d            0

Many of the controls were replicates, which we can spare. However, 66 / 173 of the IBD samples were replicates, which we’d obviously rather not spare. But, we need to if we want to apply machine learning without blatantly overfitting. So, we’ll take only the first entry per subject.

Here are the final tallies of samples after removing pseudoreplicates.

##     disease samples
## 1    cancer      48
## 2 cirrhosis     118
## 3       IBD     107
## 4      none     851
## 5   obesity     164
## 6       t2d     223

Alright. Now let’s check the ratios of controls to cases per study among studies of disease.

##            disease
## studyID     cancer cirrhosis IBD none obesity t2d
##   22699609       0         0   0  115       0   0
##   25807110       0         0   0   35       0   0
##   25981789       0         0   0   38       0   0
##   cancer        48         0   0   47       0   0
##   cirrhosis      0       118   0  114       0   0
##   IBD            0         0 107  260       0   0
##   obesity        0         0   0   25     164   0
##   T2D            0         0   0  174       0 170
##   WT2D           0         0   0   43       0  53

Every study of disease also included approximately as many, or more, control samples as case samples.

Now let’s process the sequencing data.

Data processing

Let’s extract out the species data and check out the their distributions.

## [1] "826 species remain."

We should convert to relative abundance, log transform, and add a pseudocount.

Log transforming improved the distributions to approximately normal, but still right skewed. Nonetheless, it’s looking in acceptable shape to continue on to building models.

Model building

The authors approach their task in a number of ways:

Our multi-level validation strategy includes the assessment of microbiome models on single cohorts, across stages of the same study, across different studies, and across target outcomes and conditions

I’m interested in validating Fig 2, where they calculate a variety of metrics including AUC, for each of the datasets separately. Of note, they evaluate both Random Forest and SVM models. For species-based models, the SVM used a radial basis kernel.

Since they also mention evaluating elastic net, let’s go ahead include it, and a few more models.

They also compared their models against null datasets, where the class label was randomly shuffled. This is helpful to show, statistically, superiority of their model to an empirical null distribution.

So, we’ll do the following:

  • For each of the 6 diseases/datasets:

  • Evaluate 5 machine learning algorithms,

  • Build models 15 times,

  • For both real and null data sets,

  • And extract the internal validation AUCs.

Here are the results:

This plot shows the distributions of 15 repeats of using real (colored) data and null (grey) data to train each model. I ran t-tests on each comparison followed by a Bonferroni adjustment to control the false discovery rate. Significant values are indicated by a *.

Using the internal validation metrics, we can see that random forest (ranger) is generally superior to SVM across datasets. Elastic net (glmnet) is quite competitive with ranger.

We can also get a sense of how some diseases (e.g. obesity) are harder to predict than others (e.g. IBD).

Importantly, most of these values are either as good or better than those reported in the original study.

Let’s consider these results validated and move on to multiclass machine learning.

Extending the analysis

Above, we constructed 1 predictor for each disease. Depending on the research question, it might be more ideal to build ONE model that can report the probability a microbiome belongs to one of the diseases, simultaneously. This is a task for a multiclass classifier.

Multiclass Classifier

First, we’ll build a multiclass random forest that handles all samples simultaneously:

This confusion matrix illustrates a general low accuracy in predicting each disease. That is, the values in parentheses along the diagonal tiles (from bottom left to top right) are quite low for most diseases. The model is predicting most microbiomes belong to healthy individuals.

I suspect the large class imbalance (e.g. n=665 “none” vs n=48 “cancer”) is causing the model to bias its prediction towards the overrepresented class.

What do the ROC curves look like?

This ROC curve seems to suggest all diseases are being predicted reasonably well. However, “none” is predicted the poorest, likely because of the high degree of false positives observed in the confusion matrix.

Downsample

To eliminate the class imbalance, let’s downsample these data down to the smallest dataset (n=48). I suspect this will sacrifice accuracy, while improving the false positive rate.

This looks considerably better. The values in parentheses along the diagonal indicate that each disease is being predicted much better.

Let’s see the ROC curves:

Once again, these ROC curves seem to suggest all diseases are being predicted at least reasonably well. We see that the AUC has shrunk by ~5%. We also see that the model performs most poorly with “none” and “t2d”, in agreement with the confusion matrix.

Conclusions

To conclude, we found that microbiome-based classifiers could predict disease quite well, particularly when control samples were confined to each study.

When we built a model that tries to predict all diseases at once, we found that classification specificities and sensitivities suffered from large class imbalances. This illustrates the trade-offs of balancing datasets when considering class-specific accuracies.