Advanced Machine Learning

Recall that in an ordinary multiple linear regression, we have a set of p predictor variables measuring some response variable (Y) to fit a model like:

$$ Y = \beta_0 + \beta_1 X_1 ... \beta_p X_p + \epsilon $$

Where beta represents the average effect of a unit increase in the predictor X_n (the n^thpredictor variable) and epsilon is the error term. The value for these beta coefficients is chosen using the least square method, which minimizes the sum of squared residuals (RSS), or the squared difference in observed minus expected outcome value.

$$ RSS = \Sigma (y_i - \hat y_i)^2 $$

Least Absolute Shrinkage and Selection Operator (LASSO)

When variables are highly correlated then coefficient estimates can have large variances leading to poor predictive accuracy.

Lasso regression is a regularization technique for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. Instead of trying to minimize RSS, Lasso uses the equation:

Total Cost = Measure of Fit [RSS] + Measure of magnitude of coefficients

$$ Cost = RSS + \lambda \Sigma | \beta_n | ; \lambda >= 0 $$

Lambda is the 'tuning' parameter, or shrinkage penalty, and measures the balance of fit and sparsity. When this term is 0 the parameter has no effect, and as it approaches infinity the shrinkage penalty becomes more influential.

The process:

Start with full model (all possible features)
"Shrink" some coefficients to 0 (exactly)
Non-zero coefficients indicate "selected" features

The idea is to have as little bias as possible so the variance can be reduced, leading to a smaller mean squared error (MSE).

Note: This is very similar to ridge regression, except in ridge the coefficients are minimized toward 0 but must always be > 0

Lasso tends to perform better when only a small number of predictor variables are significant, and ridge when all coefficients have roughly equal importance.

To determine which model is better use k-fold cross validation.

Advanced Machine Learning

Least Absolute Shrinkage and Selection Operator (LASSO)

xgboost

SVM