Machine Learning Models

Decision Tree

Recursively splits observations (classes or values) based on binary rules. The best rule (split) based on a splitting criterion is placed at the root of the tree, followed by recursive rules down the branches. The tree stops being built based on a stopping criteria, such as the number of observations in a terminal (leaf) node. Predictions are made by applying the ruleset and applying majority vote (classification) or averaging values (regression).

Strengths: Rulesets are typically simple and interpretable, and capture higher-order relationships between features.

Limitations: Predictions are made step-wise, so continuous variables cannot be predicted precisely. Decision trees tend to perform quite poorly (underfit) because they find an overly simplistic solution.

Random Forest

Constructs a forest (ensemble) of decision trees. Each tree is constructed using a random subset of samples (with replacement, i.e. bootstrapping) and features (without replacement). Class votes (classification) or values (regression) are aggregated across trees and a final prediction is made. With bootstrap aggregation (“bagging”), samples not included in one tree can be used to test other trees, enabling an internal error rate to be calculated that approximates cross-validation accuracy.

Strengths: Tends to perform near optimally compared to other models. Relatively fast model construction. Out-of-bag error approximates cross-validation performance, which greatly improves validation speed. Other features, like proximity matrices and interaction scores can be used to examine sample and feature relationships, respectively.

Limitations: Overfits when the number of features greatly exceeds the number of samples.

Gradient Boosting Machine

Constructs a series of decision trees, beginning with a base decision tree. Subsequent trees predict the error of the previous tree, which is multiplied by a weight (i.e. undergoing “shrinkage”), added to the previous prediction, and all trees are “ensembled” to make the final predictor. Each iteration seeks to minimize loss using gradient descent. The sequential addition of improved learners is called “boosting”.

Strengths: Tends to perform near optimally compared to other models.

Limitations: Models cannot be trained in parallel because of sequential structure.

K Nearest Neighbours

Transforms observations into a distance matrix, then assigns the value of the majority class (classification) or average value (regression) of the observation’s k nearest neighbours.

Strengths: Conceptually simple to employ.

Limitations: Tends to underfit.

Linear Regression

Fits a straight line to predict an outcome, using a linear combination of predictor feature coefficients (B). The objective is to minimize the residuals (squared difference) between X and fitted Y.

Note: Linear regression is generally not used for machine learning tasks on its own. Additional dimension reduction or regularization is required for competitive performance.

PLS and PCR

Partial least squares regression transforms predictors into latent variables (components), that explain variance in predictors and responses, followed by linear regression using components as predictors.

Principal component regression derives principal components that explain variance in predictors.

Strengths: Both PLS and PCR outperform a simple linear regression.

Limitations: Latent factors and principal components are not directly interpretable, but can be considered a signature score (linear combination of feature weights).

Regularized Regression

L1 (lasso) and L2 (ridge) penalties shrink coefficients of low importance and collinear features towards 0, or to exactly 0, depending on alpha value.

Strengths: Tends to perform near optimally compared to other models. Regularization reduces feature importances down to the most interpretable, which is useful for creating parsimonious models.

Limitations: Overfits when the number of features greatly exceeds the number of samples.

Neural Network

Inspired by biological neural circuitry, the fundamental architecture of a neural network is a series of layers of neurons (i.e. nodes). Nodes are the fundamental unit of neural networks, receiving an input or weighted sum of inputs, applying a linear regression using weights and biases, and “activating” based on an activation function (e.g. softmax, sigmoid, rectified linear unit). Nodes are arranged in layers, requiring an input and output layer, and one or more hidden layers. Neural networks are trained via forward propagation (inputs flow through the nodes towards the output layer, which makes a prediction) and back propagation (adjusting node weights and biases using gradient descent). The interconnectivity of nodes and nonlinear activation functions enable leveraging complex patterns underlying the predictive signals in a dataset. A multi-layer perceptron with one hidden layer is an example of a simple feedforward neural network.

Strengths: Can be engineered to handle complex datatypes.

Limitations: The number of hyperparameters adjusted scales with risk of overfitting.

Support Vector Machine

Transforms observations into n-dimensions, and draws a boundary (classification) or function (regression) (i.e. “hyperplane”). Observations closest to the hyperplane are called “support vectors”. SVM seeks to maximize the distance between support vectors of different classes (the “margin”). A non-linear boundary can be drawn by applying the “kernel trick”, which transforms the datapoints into a higher-dimension space.

Strengths: Often performs competitively with other models, if the kernel is selected in accordance with the data structure.

Limitations: SVM are typically outperformed by better models, like random forest.