The Segata Lab published a study in 2016 employing machine learning on microbiomes to predict various diseases, including liver cirrhosis, colorectal cancer, IBD, obesity, and type 2 diabetes.
The stool microbiome dataaset is available here.
Below, we validate their findings and extend the analysis to employ multiclass classification with downsampling.
Let’s load the data and see what types of samples are included:
## disease sample count
## 1 - 20
## 2 - 7
## 3 cancer 48
## 4 cirrhosis 118
## 5 ibd_crohn_disease 25
## 6 ibd_ulcerative_colitis 148
## 7 impaired_glucose_tolerance 49
## 8 large_adenoma 13
## 9 leaness 89
## 10 n 944
## 11 n_relative 47
## 12 obese 5
## 13 obesity 164
## 14 overweight 10
## 15 small_adenoma 26
## 16 stec2-positive 43
## 17 t2d 223
## 18 underweight 1
This is a little messy, so let’s clean up the samples into the categories included in the study.
## disease sample count
## 1 cancer 48
## 2 cirrhosis 118
## 3 IBD 173
## 4 none 944
## 5 obesity 164
## 6 t2d 223
A small note: it’s unclear which categories of samples contributed to the 981 control samples listed in the original study. I only see 944.
Let’s also check if there are sample duplicates:
Uh oh. That’s a notable number of pseudoreplicates. Which conditions do they belong to?
## disease >1 replicate
## 1 cancer 0
## 2 cirrhosis 0
## 3 IBD 66
## 4 none 153
## 5 obesity 0
## 6 t2d 0
Many of the controls were replicates, which we can spare. However, 66 / 173 of the IBD samples were replicates, which we’d obviously rather not spare. But, we need to if we want to apply machine learning without blatantly overfitting. So, we’ll take only the first entry per subject.
Here are the final tallies of samples after removing pseudoreplicates.
## disease samples
## 1 cancer 48
## 2 cirrhosis 118
## 3 IBD 107
## 4 none 851
## 5 obesity 164
## 6 t2d 223
Alright. Now let’s check the ratios of controls to cases per study among studies of disease.
## disease
## studyID cancer cirrhosis IBD none obesity t2d
## 22699609 0 0 0 115 0 0
## 25807110 0 0 0 35 0 0
## 25981789 0 0 0 38 0 0
## cancer 48 0 0 47 0 0
## cirrhosis 0 118 0 114 0 0
## IBD 0 0 107 260 0 0
## obesity 0 0 0 25 164 0
## T2D 0 0 0 174 0 170
## WT2D 0 0 0 43 0 53
Every study of disease also included approximately as many, or more, control samples as case samples.
Now let’s process the sequencing data.
Let’s extract out the species data and check out the their distributions.
## [1] "826 species remain."
We should convert to relative abundance, log transform, and add a pseudocount.
Log transforming improved the distributions to approximately normal, but still right skewed. Nonetheless, it’s looking in acceptable shape to continue on to building models.
The authors approach their task in a number of ways:
Our multi-level validation strategy includes the assessment of microbiome models on single cohorts, across stages of the same study, across different studies, and across target outcomes and conditions
I’m interested in validating Fig 2, where they calculate a variety of metrics including AUC, for each of the datasets separately. Of note, they evaluate both Random Forest and SVM models. For species-based models, the SVM used a radial basis kernel.
Since they also mention evaluating elastic net, let’s go ahead include it, and a few more models.
They also compared their models against null datasets, where the class label was randomly shuffled. This is helpful to show, statistically, superiority of their model to an empirical null distribution.
So, we’ll do the following:
For each of the 6 diseases/datasets:
Evaluate 5 machine learning algorithms,
Build models 15 times,
For both real and null data sets,
And extract the internal validation AUCs.
Here are the results:
This plot shows the distributions of 15 repeats of using real (colored) data and null (grey) data to train each model. I ran t-tests on each comparison followed by a Bonferroni adjustment to control the false discovery rate. Significant values are indicated by a *.
Using the internal validation metrics, we can see that random forest (ranger) is generally superior to SVM across datasets. Elastic net (glmnet) is quite competitive with ranger.
We can also get a sense of how some diseases (e.g. obesity) are harder to predict than others (e.g. IBD).
Importantly, most of these values are either as good or better than those reported in the original study.
Let’s consider these results validated and move on to multiclass machine learning.
Above, we constructed 1 predictor for each disease. Depending on the research question, it might be more ideal to build ONE model that can report the probability a microbiome belongs to one of the diseases, simultaneously. This is a task for a multiclass classifier.
First, we’ll build a multiclass random forest that handles all samples simultaneously:
This confusion matrix illustrates a general low accuracy in predicting each disease. That is, the values in parentheses along the diagonal tiles (from bottom left to top right) are quite low for most diseases. The model is predicting most microbiomes belong to healthy individuals.
I suspect the large class imbalance (e.g. n=665 “none” vs n=48 “cancer”) is causing the model to bias its prediction towards the overrepresented class.
What do the ROC curves look like?
This ROC curve seems to suggest all diseases are being predicted reasonably well. However, “none” is predicted the poorest, likely because of the high degree of false positives observed in the confusion matrix.
To eliminate the class imbalance, let’s downsample these data down to the smallest dataset (n=48). I suspect this will sacrifice accuracy, while improving the false positive rate.
This looks considerably better. The values in parentheses along the diagonal indicate that each disease is being predicted much better.
Let’s see the ROC curves:
Once again, these ROC curves seem to suggest all diseases are being predicted at least reasonably well. We see that the AUC has shrunk by ~5%. We also see that the model performs most poorly with “none” and “t2d”, in agreement with the confusion matrix.
To conclude, we found that microbiome-based classifiers could predict disease quite well, particularly when control samples were confined to each study.
When we built a model that tries to predict all diseases at once, we found that classification specificities and sensitivities suffered from large class imbalances. This illustrates the trade-offs of balancing datasets when considering class-specific accuracies.