class: center, middle, inverse, title-slide # Lecture 6 - Tree based models ## Decision tree and Random forest ### Issac Lee ### 2021-04-22 --- class: center, middle # Sungkyunkwan University ![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/en/thumb/4/40/Sungkyunkwan_University_seal.svg/225px-Sungkyunkwan_University_seal.svg.png) ## Actuarial Science --- # Introduction to CART Classification And Regression Tree * One of the oldest and simplest ML algorithms. * Idea: split the feature space into **exhaustive and mutually exclusive** segments under some rules. --- # Advantage vs. Disadvantage .pull-left[ Advantage * Easy to explain to people. * People believe that DT mirror human decision making than regression. * Easy to handle categorical data without doing dummy coding. * Insensitive to monotone transformation of inputs * Robust to outliers ] .pull-right[ Disadvantage * Level of predictive accuracy is lower than other models * Trees can be very non-robust. Small changes in the data can cause a large change in the final output. by `ISLR` ] --- # Simple tree solution .pull-left[ Can we divide these points into two groups by lines? ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-1-1.png" width="80%" /> ] --- # Simple tree solution .pull-left[ Can we divide these points into two groups by lines? `Yes` The following two lines * `\(x = 0.5\)` * `\(y = 0.5\)` ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-2-1.png" width="80%" /> ] --- # Simple tree regression .pull-left[ Can we predict this function by calculating mean of some sections? ] .pull-right[ ![](data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- # Simple tree regression .pull-left[ Can we predict this function by calculating mean of some sections? How about these four sections? * `\(R_1: \{x| 0< x < 25\}\)` * `\(R_2: \{x| 25 < x < 50\}\)` * `\(R_3: \{x| 50 < x < 75\}\)` * `\(R_4: \{x| 75 < x < 100\}\)` ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-4-1.png" width="80%" /> ] --- # Prediction via stratification of the feature space * Divide the predictior spaces into `\(J\)` distinct and non-overlapping regions, `\(R_1, ..., R_J\)` * Goal: find `\(R_1, ..., R_J\)` that minimize the RSS `$$\sum_{j=1}^{J}\sum_{i\in R_{j}}\left(y_{i}-\hat{y}_{R_{j}}\right)^{2}$$` --- # Let's be `greedy`! There are infinite number of splitting partitions. * top-down, greedy approach; a.k.a. recursive binary splitting. * For the predictor `\(X_j\)`, define * `\(R_1(j, s) = \{X|X_j \leq s\}\)` and `\(R_2(j, s) = \{X|X_j > s\}\)` * For `\(j\)` and `\(s\)`, find the value of `\(j\)` and `\(s\)` `$$\sum_{i:x_{i}\in R_{1}\left(j,s\right)}\left(y_{i}-\hat{y}_{R_{1}}\right)^{2}+\sum_{i:x_{i}\in R_{2}\left(j,s\right)}\left(y_{i}-\hat{y}_{R_{2}}\right)^{2}$$` --- class: middle, center # Overfitting Problem <img src="data:image/png;base64,#./overfit.png" width="30%" style="display: block; margin: auto;" /> --- # When there is noise.. .pull-left[ We have our data model `$$y_i = x_i^2 + e_i$$` where `\(e_i \sim \mathcal{N}(0, 10^2)\)` Do we want this? * No. These lines are too much fitted to the points. * This model is `overfitted` to the data. ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-5-1.png" width="80%" /> ] --- # We need something like this. .pull-left[ It captures the general data trend. * Simpler than before. * It can also be used for the `future data`. ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-6-1.png" width="80%" /> ] --- class: middle, center # How to avoid overfitting? ## `Prunning` <img src="data:image/png;base64,#./cut.svg" width="15%" style="display: block; margin: auto;" /> .footnote[ <div>Icons made by <a href="https://www.freepik.com" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div> ] --- # Build a tree with threshold 1. Build tree until the decrease in the RSS due to each split exceeds some threshold * will result smaller trees * short sighted trees.. 2. Grow a very large tree `\(T\)` first and cut (`prune`) the branches to make a subtree `\(T_0\)` * how to `prune` tho? --- # Cost complexity pruning For each value of `\(\alpha\)`, a tunning parameter, there is a subtree which minimizes `$$\sum_{m=1}^{\left|T\right|}\sum_{i:x_{i}\in R_{m}}\left(y_{i}-\hat{y}_{R_{m}}\right)^{2}+\alpha\left|T\right|$$` as small as possible. * `\(|T|\)`: number of the terminal nodes of the tree `\(T\)` * `\(R_m\)`: rectangle corresponding to the `\(m\)`th terminal node * `\(\hat{y}_{R_m}\)` predicted response associated with `\(R_m\)` * `\(alpha\)` can be found using CV. --- # Classification Trees Classification tree predict a qualitative response. * predict the most commonly occurring class * Given region, * Male : 50 * Female : 100 * Predict the label for the given region as `Female` --- # How to build? * We don't have the concept of `\(RSS\)` * Natural alternative to `\(RSS\)` is the `classification error rate` `$$\frac{\text{# of sample} \notin \text{the class label}}{\text{# of toal samples in the given } R}$$` more mathematically, `$$E=1-\underset{k}{max}(\hat{p}_{mk})$$` But, this is not sufficiently sensitive for tree-growing. --- # Gini impurity Measure of `total variance` across the `\(K\)` classes. .pull-left[ `$$\begin{align*} G & =\sum_{k=1}^{K}\hat{p}_{k}\left(1-\hat{p}_{k}\right)\\ & =1-\sum_{k=1}^{K}p_{k}^{2} \end{align*}$$` * Plot shows the Gini w.r.t. `p` when `\(K=2\)`. * We can easily see that G takes on a small value if all `\(p_k\)` are close to 0 or 1. ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-7-1.png" width="80%" /> ] --- # Example * Gini impurity for these? `$$G =1-\sum_{k=1}^{K}p_{k}^{2}$$` * 1, 1, 1, 0, 0, 0 * 1, 1, 1, 1, 1, 1 * 1, 1, 1, 1, 0, 0 --- # Gini index Given `\(R(j, s)\)`, the Gini index is the weighted sum of gini impurity of the resulting subsets; Which one is better split? <img src="data:image/png;base64,#./gini_index.png" width="100%" style="display: block; margin: auto;" /> --- # Entropy Measure of `impurity` of each node in a DT. .pull-left[ `$$E=-\sum_{k=1}^{K}\hat{p}_{k}log\hat{p}_{k}$$` * Plot shows the entropy w.r.t. `p` when `\(K=2\)`. ] .pull-right[ <img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-8-1.png" width="80%" /> ] --- # Example * Entropy for these? `$$E=-\sum_{k=1}^{K}\hat{p}_{k}log\hat{p}_{k}$$` * 1, 1, 1, 0, 0, 0 * 1, 1, 1, 1, 1, 1 * 1, 1, 1, 1, 0, 0 --- # Information Gain Given `\(R(j, s)\)`, the information gain: Entropy for parent node - the weighted average of entropy of the resulting subsets; Which one is better split? <img src="data:image/png;base64,#./gini_index.png" width="100%" style="display: block; margin: auto;" /> --- # Build a tree in classification Find `\(R(j, s)\)` which * `Minimizes` Gini Index or * `Maximizes` Information Gain Repeat this process until * `max depth` reaches or leaf has `100% purity`. --- # Control parameters `minsplit`: min number of observations that must exist in a node `minbucket`: min number of observations in any **terminal** node `cp`: minimum amount of impurity reduction required for a split to be made. `maxdepth`: maximum depth of any node of the final tree. ```r library(ISLR) library(tidyverse) library(MASS) # boston data set library(rpart) library(rpart.plot) attach(Boston) Boston %>% head() boston_tree <- rpart(medv ~ ., data = Boston, method = "anova", control = rpart.control( minbucket = 1, cp = 0, maxdepth = 10)) rpart.plot(boston_tree) printcp(boston_tree) bestcp <- boston_tree$cptable[which.min(boston_tree$cptable[, "xerror"]), "CP"] best_boston_tree <- prune(boston_tree, cp = bestcp) rpart.plot(best_boston_tree) decision_tree( mode = "unknown", cost_complexity = NULL, tree_depth = NULL, min_n = NULL ) ``` --- # Bagging * Bootstrapping the models! ![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Ensemble_Bagging.svg/660px-Ensemble_Bagging.svg.png) Image steals from [wiki](https://en.wikipedia.org/wiki/Bootstrap_aggregating) --- # Advantage Reduce the bias and variance of the prediction. 1. Generally Ensemble methods work well when the predictors are as independent from one another. * Bootstrapping increase the independence of the models by using replacement sampling 2. Can be done in parallel -> Gain a lot of popularity. --- # Stacking * Stacked generalization (1992) ![](https://miro.medium.com/max/703/0*GXMZ7SIXHyVzGCE_.) --- class: middle, center, inverse # Thanks!