Tree Based Methods

Classification and regression trees can be generated from multivariable data sets using recursive partitioning with:

We can use cross-validation to assess the predictive accuracy of tree-based models.

Non-Linear Regression

Polynomial Regression adds extra predictors by raising original predictors to a power. Provides a simple way to provide a nonlinear fit to data.

Step functions/Broken stick regression cuts the range of a variable into K distinct regions. Fits a piecewise constant function/regression line.

Regression splines and smoothing lines is an extension of the two above. The range of X is divided into K distinct regions. Within each region a polynomial function is fit to the data. It is constrained in a way which smoothly joins at the region boundaries, or knots. It can produce an extremely flexible fit if there are enough divisions.

Classification and Regression Trees

Classification and Regression Trees (CART) are a computer intensive alternative to fitting classical regression models. CART works very well for highly non-linear problems.

A regression tree is basically just a multiple linear regression model with indicator variables that represent the branches of the tree.

Regression Tree

image-1666304260933.png

Note: Number of leaves = complexity

  1. We divide the predictor space into J distinct and non-overlapping regions R1, R2, Rj.
  2. The fitted value in a region Rj is simply the mean of the response values in this region :

    image-1666304470132.png

Goal: The regions are identified to minimize:
image-1666304497649.png
Where Rj is the rectangle corresponding to the jth terminal node and y_barj is the mean of the training observations in Rj

A complete search is not feasible and we take a top-down, greedy approach known as recursive binary splitting.
Top-down: Start from the top of the tree
Greedy: Best solution at each node of the tree (local rather than global)

Recursive Binary Splitting

Input: "explanatory" variables and outcome Y

For each variable Xg,
without splitting Xg, the variable Y has a certain variability
image-1666304720370.png
Splitting the data by Xg < s, then the new RSS is:image-1666304774871.png

Residual Mean Deviance

image-1666304961814.png

T0 = tree; | T0 | = number of leaves; y_bar = mean in region Rj

Pruning

Build a large tree and fold terminal branches back. This can be based on a maximum number of leaves, or use cost-complexity pruning.

Typically pruning overfits the data (too many splits). A smaller tree with fewer splits might lead to better interpretation at the cost of higher residual deviance.

Possible alternatives: Build the tree only so long as the decrease in the RSS due to each split exceeds some threshold. This strategy may terminate too early.
The idea here is to minimize RSS
For a given alpha search for the tree T nesting into T0 that minimizes: image-1666305701607.pngwhere alpha=cost; |T| number of leaves
Use a heuristic search to find alpha

Cart for Prediction

To access the accuracy of a regression tree to predict Y in new patients, you can evaluate the predictive accuracy by using a training and testing set.

Classification Tree

image-1666306498770.png

Input: Explanatory variables and a class label for different samples

The objective is to choose the tree which maximizes the classification accuracy

The procedure (same as regression tree with 2 differences):

Shannon Entropy

Shannon entropy is the most common measure of (Lack of) information content of a binary variable:
image-1666306823973.png

Gini's Impurity Index

Gini's Impurity Index is the information content of a binary variable.
image-1666306946461.png

image-1666307471032.png

Best Split

For each variable:

Do this linear search for each variable and identify the optimal split and maximum gain of information.

Split the node by using the variable that gives the largest information gain with a binary split.

Typically Gini index is used for growing a tree, and misclassification error is a good metric for pruning a tree.

Pros and Cons of Trees

Bagging, Boosting, and Random Forest

Bagging and Random Forest improve prediction at the cost of interpretability.

One can generate an importance value of each covariate as the average improvement of RSS (or entropy or Gini index) due to the covariate in each tree and rank the covariates by this importance value.

The larger the average change in deviance, the more important the variable.

Classification Rules

The data is split into testing and training sets. We use the training set to build the classification rule; and the test set to evaluate how the classification rule labels new cases with known label.

Accuracy is the rate of correctly classified labels in the test set.
image-1666308799698.png

Sensitivity is a metric of true positive detection and shows whether the rule is sensitive to identify positive outcomes:
image-1666308861537.png

Specificity is the measure of true negative detection and showos whether the rule is specific in detection of negative outcomes:

image-1666308909287.png


Revision #2
Created 20 October 2022 21:58:02 by Elkip
Updated 20 October 2022 23:41:30 by Elkip