Advanced Machine Learning

Recall that in an ordinary multiple linear regression, we have a set of p predictor variables measuring some response variable (Y) to fit a model like:

$$ Y = \beta_0 + \beta_1 X_1 ... \beta_p X_p + \epsilon $$

Where beta represents the average effect of a unit increase in the predictor X_n (the n^thpredictor variable) and epsilon is the error term. The value for these beta coefficients is chosen using the least square method, which minimizes the sum of squared residuals (RSS), or the squared difference in observed minus expected outcome value.

$$ RSS = \Sigma (y_i - \hat y_i)^2 $$

Least Absolute Shrinkage and Selection Operator (LASSO)

When variables are highly correlated then coefficient estimates can have large variances leading to poor predictive accuracy.

Lasso regression is a regularization technique for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. Instead of trying to minimize RSS, Lasso uses the equation:

Total Cost = Measure of Fit [RSS] + Measure of magnitude of coefficients

$$ Cost = RSS + \lambda \Sigma | \beta_n | ; \lambda \ge 0 $$

Lambda is the 'tuning' parameter, or shrinkage penalty, and measures the balance of fit and sparsity. When this term is 0 the parameter has no effect, and as it approaches infinity the shrinkage penalty becomes more influential.

Start with full model (all possible features)
"Shrink" some coefficients to 0 (exactly)
Non-zero coefficients indicate "selected" features

The idea is to have as little bias as possible so the variance can be reduced, leading to a smaller mean squared error (MSE).

Note: This is very similar to ridge regression, except in ridge the coefficients are minimized toward 0 but must always be > 0

Lasso tends to perform better when only a small number of predictor variables are significant, and ridge when all coefficients have roughly equal importance. To determine which model is better use k-fold cross validation.

Process

Calculate correlation matrix and variance inflation factor values (VIF) for all predictor variables
Fit the lasso regression model and choose a value for lambda
Compare to a another regression model by way of k-fold cross-validation

Code

# https://www.r-bloggers.com/2020/05/quick-tutorial-on-lasso-regression-with-example/

library(glmnet)
# Loading the data
data(swiss)
x_vars <- model.matrix(Fertility~. , swiss)[,-1]
y_var <- swiss$Fertility
lambda_seq <- 10^seq(2, -2, by = -.1)
# Splitting the data into test and train
set.seed(86)
train = sample(1:nrow(x_vars), nrow(x_vars)/2)
x_test = (-train)
y_test = y_var[x_test]
cv_output <- cv.glmnet(x_vars[train,], y_var[train],
                       alpha = 1, lambda = lambda_seq, 
                       nfolds = 5)
# identifying best lamda
best_lam <- cv_output$lambda.min
best_lam


# Output
# [1] 0.3981072


# Rebuilding the model with best lamda value identified
lasso_best <- glmnet(x_vars[train,], y_var[train], alpha = 1, lambda = best_lam)
pred <- predict(lasso_best, s = best_lam, newx = x_vars[x_test,])
# Combine predicted and actual values
final <- cbind(y_var[test], pred)
# Checking the first six obs
head(final)


# Output
#            Actual    Pred
#Courtelary   80.2 66.54744
#Delemont     83.1 76.92662
#Franches-Mnt 92.5 81.01839
#Moutier      85.8 72.23535
#Neuveville   76.9 61.02462
#Broye        83.8 79.25439


# R-squared
actual <- test$actual
preds <- test$predicted
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
rsq


# Inspecting beta coefficients
coef(lasso_best)


# Output
#6 x 1 sparse Matrix of class "dgCMatrix"
#                         s0
#(Intercept)      66.5365304
#Agriculture      -0.0489183
#Examination       .        
#Education        -0.9523625
#Catholic          0.1188127
#Infant.Mortality  0.4994369

XGBoost

XGBoost is a software library installed independently from whatever language is used to interact with it. Most popular languages have an interface, and there is a CLI version.

Advanced Machine Learning

Least Absolute Shrinkage and Selection Operator (LASSO)

Process

XGBoost

SVM