Gevers et al’s 2014 Cell Host and Microbe study is one of the first and most highly cited studies on the microbiome in IBD.
This study characterizes the microbiome obtained from various intestinal compartments (e.g. ileum, rectum, and stool),and applies sparse logistic regression (L1 regularized) to identify bacterial signatures that discriminate CD from controls.
Below, we attempt to validate their model and evaluate additional models that may boost the performance.
We will download data from Dan Knights lab’s Machine Learning Repository.
This includes a metadata file and feature table (16S read counts).
Already we can see that there are very few reads per sample. This study was published in 2014, before deeper sequencing technologies were available and affordable. Additionally, microbiomes were harvested from biopsies, which have low bacterial biomass compared to human DNA.
There are 446 samples, with a minimum read count of:
## [1] 102
Ordinarily, I prefer to rarefy data to account for uneven sequencing effort across samples (not without caveats. However, I think rarefying to this number of reads would harm the integrity of the data. Therefore, we will simply scale read counts to 100% instead.
The authors mention that they use genus-level taxonomy to build their classifier, so let’s do that, too.
Here are the first 5 columns and rows of the genus-level relative abundance data:
## .Eubacterium. .Prevotella. .Ruminococcus. X02d06 X1.68
## SKBTI.1286.1247055 0.000000000 0.000000000 0.096202532 0 0
## SKBTI.1147.1247101 0.003508772 0.000000000 0.005263158 0 0
## SKBTI.0149.1246577 0.000000000 0.000000000 0.000000000 0 0
## SKBTI.1229.1246952 0.008695652 0.000000000 0.010869565 0 0
## SKBTI.0257.1246558 0.000000000 0.005249344 0.018372703 0 0
We are going to try using lasso regression to predict CD. First, we need to process the data (e.g. log transform with a pseudocount). Interestingly, the authors do not state that they log-transformed their data.
We should also evaluate whether our data set is imbalanced.
##
## CD no
## 78 62
There is a slight class imbalance in the dataset, which can lead to a bias that increases the tendency for the model to guess a sample belongs to the larger class.
Class imbalances can be approached a few ways, such as downsampling, upsampling with synthetic data, a hybrid approach, or simply ignoring them.
Let’s follow the author’s methods and build a lasso without balancing groups during CV:
# first, we will construct 5 folds
set.seed(25)
cv_folds <- caret::createFolds(gevers_16S_cd$CD, k = 5, returnTrain = TRUE)
# Stratified sampling
gevers_16S_cd_lasso = caret::train(
CD ~ .,
data = gevers_16S_cd,
trControl = caret::trainControl(method="cv",
number = 5,
index=cv_folds,
savePredictions = TRUE,
classProbs = TRUE,
summaryFunction = caret::twoClassSummary),
tuneGrid = expand.grid(alpha = 1, # set to 1 to turn off Ridge regression
lambda = 10^seq(5,-5, length=100)),
metric = "ROC",
method = "glmnet",
preProc = c("center", "scale")
)
Now we can plot a ROC curve using each fold’s predictions.
This 5x CV AUC (0.758) suggests good performance in predicting CD from non-IBD controls using ileal biopsy microbiome samples. However, it is below what was reported in the original paper (0.85).
I noted above that the authors did not state they log transformed their data. Given they used regularized logistic regression, which assumes features are normally distributed, I want to check whether this impacts the results.
Let’s re-run our analysis without log-transforming:
Clearly, and not surprisingly, log-transforming the data did not cause the performance to suffer.
Apart from other potential differences in data pre-processing (which I circumvented by downloading from MLRepo), there is no obvious source of this reduced performance.
At the time this paper was published, machine learning was not widely adopted in the microbiome literature.
Lasso is a reasonable choice to built a predictive algorithm, but others (e.g. random forest) tend to be superior. Let’s see if that assumption holds with this data set, by repeating the analysis above with different algorithms and evaluating the internal validation AUCs.
# select models
ml.models = c("gbm","glmnet","ranger","svmLinear","svmRadial","xgbTree")
The assumption holds: the median random forest performance is superior to all other models, including lasso.
Here, we replicated Gevers’ analysis and concluded that while lasso is superior to other common machine learning models (based on internal validation performance). However, I could not replicate their published AUC (0.85).
Of note, their 16S samples were sequenced to a very low depth, which means the contributions of rarer taxa likely weren’t able to contribute to cross-validated performances. Additionally, external data sets were not used to validate the performance of the model.