Advanced Machine Learning

Recall that in an ordinary multiple linear regression, we have a set of p predictor variables measuring some response variable (Y) to fit a model like:

$$ Y = \beta_0 + \beta_1 X_1 ... \beta_p X_p + \epsilon $$

Where beta represents the average effect of a unit increase in the predictor X_n (the n^thpredictor variable) and epsilon is the error term. The value for these beta coefficients is chosen using the least square method, which minimizes the sum of squared residuals (RSS), or the squared difference in observed minus expected outcome value.

$$ RSS = \Sigma (y_i - \hat y_i)^2 $$

Least Absolute Shrinkage and Selection Operator (LASSO)

When variables are highly correlated then coefficient estimates can have large variances leading to poor predictive accuracy.

Lasso regression is a regularization technique for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. Instead of trying to minimize RSS, Lasso uses the equation:

Total Cost = Measure of Fit [RSS] + Measure of magnitude of coefficients

$$ Cost = RSS + \lambda \Sigma | \beta_n | ; \lambda \ge 0 $$

Lambda is the 'tuning' parameter, or shrinkage penalty, and measures the balance of fit and sparsity. When this term is 0 the parameter has no effect, and as it approaches infinity the shrinkage penalty becomes more influential.

Start with full model (all possible features)
"Shrink" some coefficients to 0 (exactly)
Non-zero coefficients indicate "selected" features

The idea is to have as little bias as possible so the variance can be reduced, leading to a smaller mean squared error (MSE).

Note: This is very similar to ridge regression, except in ridge the coefficients are minimized toward 0 but must always be > 0

Lasso tends to perform better when only a small number of predictor variables are significant, and ridge when all coefficients have roughly equal importance. To determine which model is better use k-fold cross validation.

Process

Calculate correlation matrix and variance inflation factor values (VIF) for all predictor variables
Fit the lasso regression model and choose a value for lambda
Compare to a another regression model by way of k-fold cross-validation

Code

# https://www.r-bloggers.com/2020/05/quick-tutorial-on-lasso-regression-with-example/

library(glmnet)
# Loading the data
data(swiss)
x_vars <- model.matrix(Fertility~. , swiss)[,-1]
y_var <- swiss$Fertility
lambda_seq <- 10^seq(2, -2, by = -.1)
# Splitting the data into test and train
set.seed(86)
train = sample(1:nrow(x_vars), nrow(x_vars)/2)
x_test = (-train)
y_test = y_var[x_test]
cv_output <- cv.glmnet(x_vars[train,], y_var[train],
                       alpha = 1, lambda = lambda_seq, 
                       nfolds = 5)
# identifying best lamda
best_lam <- cv_output$lambda.min
best_lam


# Output
# [1] 0.3981072


# Rebuilding the model with best lamda value identified
lasso_best <- glmnet(x_vars[train,], y_var[train], alpha = 1, lambda = best_lam)
pred <- predict(lasso_best, s = best_lam, newx = x_vars[x_test,])
# Combine predicted and actual values
final <- cbind(y_var[test], pred)
# Checking the first six obs
head(final)


# Output
#            Actual    Pred
#Courtelary   80.2 66.54744
#Delemont     83.1 76.92662
#Franches-Mnt 92.5 81.01839
#Moutier      85.8 72.23535
#Neuveville   76.9 61.02462
#Broye        83.8 79.25439


# R-squared
actual <- test$actual
preds <- test$predicted
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
rsq


# Inspecting beta coefficients
coef(lasso_best)


# Output
#6 x 1 sparse Matrix of class "dgCMatrix"
#                         s0
#(Intercept)      66.5365304
#Agriculture      -0.0489183
#Examination       .        
#Education        -0.9523625
#Catholic          0.1188127
#Infant.Mortality  0.4994369

XGBoost

XGBoost is a software library installed independently from whatever language is used to interact with it. Most popular languages have an interface, and there is a CLI version.

The library is focused on computational speed and model performance, and there are a number of advanced features. As the name suggests it uses a model optimization technique called gradient boosting, or gradient tree boosting. Let's quickly review.

Gradient Boosting

The original paper: Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

The objective is to minimize the loss of the model by adding 'weak learners' ( usually a decision tree) using a gradient-descent like procedure. It is considered a stage-wide additive model, and was the landmark in the use of differentiable loss functions. The technique could be applied beyond binary classification problems to regression, multi-class classification models and more.

How it works:

1. A loss function to be optimized

This depends on the type of problem being solved. In the case of a regression one may used squared error (RSS), and a classification problem might have a logarithmic loss.

The benefit of gradient boosting is it does not need to derive a new boosting algorithm for each loss function; it is generic enough that any differentiable function can be used.

2. A decision tree to make predictions

Specifically regression trees that output real values for model outputs. This allows subsequent model outputs to be added together, and 'correct' the residuals.

Trees are constructed in a greedy manner, choosing best split points based on purity scores (e.g. Gini index) to minimize the lost. It is common to constrain the tree in specific ways, such as max nodes, layers, splits, or leaf nodes.

3. An additive model to add weak learners to minimize the loss function

A gradient descent procedure is used to minimize the loss when adding trees.

Traditionally, gradient descent is used to minimize a set of parameters. Error or loss is calculated and weights are updated to minimize that error.

Instead of parameters we have decision tree regression models. After calculating the loss in a given model, we must add the tree/model that reduces the loss (i.e. follow the gradient). This is done by parameterizing the tree then modifying the parameters of the tree to move in the right direction by reducing the residual loss. The output of the new tree is then added to the existing sequence of trees in an effort to correct the final output of the model.

A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves based on the validation set.

Common Enhancements to Gradient Boosting

Tree Constraints - As mentioned above, constraints on number of leaves, nodes, etc.

Shrinkage - Often called weighted updates, the predictions of each tree are added together sequentially and the contribution of each tree added can be weighted to 'slow down' the learning algorithm.

Random sampling - Instead of using the full dataset, at each iteration a sub-sample of the training data is used (without replacement) to fit the base learner.

Penalized learning - The decision trees can be made more complex by adding regularization terms (often called weights) to smooth the final learnt weights to avoid over-fitting. By design this tends to select a simple and predictive model.