Lecture 6 - Tree based models

class: center, middle, inverse, title-slide

# Lecture 6 - Tree based models
## Decision tree and Random forest
### Issac Lee
### 2021-04-22

---

class: center, middle

# Sungkyunkwan University

![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/en/thumb/4/40/Sungkyunkwan_University_seal.svg/225px-Sungkyunkwan_University_seal.svg.png)

## Actuarial Science

---
# Introduction to CART

Classification And Regression Tree

* One of the oldest and simplest ML algorithms.

* Idea: split the feature space into **exhaustive and mutually exclusive** segments under some rules.

---
# Advantage vs. Disadvantage

.pull-left[

Advantage

* Easy to explain to people.

* People believe that DT mirror human decision making than regression.

* Easy to handle categorical data without doing dummy coding.

* Insensitive to monotone transformation of inputs

* Robust to outliers

]
.pull-right[

Disadvantage

* Level of predictive accuracy is lower than other models

* Trees can be very non-robust. Small changes in the data can cause a large change in the final output.

by `ISLR`

]

---
# Simple tree solution

.pull-left[

Can we divide these points into two groups by lines?

]
.pull-right[
<img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-1-1.png" width="80%" />

]

---
# Simple tree solution

.pull-left[

Can we divide these points into two groups by lines? `Yes`

The following two lines

* `$x = 0.5$`
* `$y = 0.5$`

]
.pull-right[
<img src="data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-2-1.png" width="80%" />

]

---
# Simple tree regression

.pull-left[

Can we predict this function by calculating mean of some sections?

]
.pull-right[
![](data:image/png;base64,#lec6_files/figure-html/unnamed-chunk-3-1.png)

]

---
# Simple tree regression

.pull-left[

Can we predict this function by calculating mean of some sections?

How about these four sections?

* `$R_1: \{x| 0< x <   25\}$`
* `$R_2: \{x| 25 < x < 50\}$`
* `$R_3: \{x| 50 < x < 75\}$`
* `$R_4: \{x| 75 < x < 100\}$`

]
.pull-right[

]

---
# Prediction via stratification of the feature space

* Divide the predictior spaces into `$J$` distinct and non-overlapping regions, `$R_1, ..., R_J$`

* Goal: find `$R_1, ..., R_J$` that minimize the RSS

`$$\sum_{j=1}^{J}\sum_{i\in R_{j}}\left(y_{i}-\hat{y}_{R_{j}}\right)^{2}$$`
---
# Let's be `greedy`!

There are infinite number of splitting partitions.

* top-down, greedy approach; a.k.a. recursive binary splitting.

* For the predictor `$X_j$`, define

* `$R_1(j, s) = \{X|X_j \leq s\}$` and `$R_2(j, s) = \{X|X_j > s\}$`

* For `$j$` and `$s$`, find the value of `$j$` and `$s$`

`$$\sum_{i:x_{i}\in R_{1}\left(j,s\right)}\left(y_{i}-\hat{y}_{R_{1}}\right)^{2}+\sum_{i:x_{i}\in R_{2}\left(j,s\right)}\left(y_{i}-\hat{y}_{R_{2}}\right)^{2}$$`

---
class: middle, center

# Overfitting Problem

---
# When there is noise..

.pull-left[

We have our data model

`$$y_i = x_i^2 + e_i$$`

where `$e_i \sim \mathcal{N}(0, 10^2)$`

Do we want this?

* No. These lines are too much fitted to the points.

* This model is `overfitted` to the data.

]
.pull-right[

]

---
# We need something like this.

.pull-left[

It captures the general data trend.

* Simpler than before.

* It can also be used for the `future data`.

]
.pull-right[

]

---
class: middle, center

# How to avoid overfitting?

## `Prunning`

.footnote[
<div>Icons made by <a href="https://www.freepik.com" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a></div>
]

---
# Build a tree with threshold

1. Build tree until the decrease in the RSS due to each split exceeds some threshold

* will result smaller trees

* short sighted trees..

2. Grow a very large tree `$T$` first and cut (`prune`) the branches to make a subtree `$T_0$`

* how to `prune` tho?

---
# Cost complexity pruning

For each value of `$\alpha$`, a tunning parameter, there is a subtree which minimizes
  
`$$\sum_{m=1}^{\left|T\right|}\sum_{i:x_{i}\in R_{m}}\left(y_{i}-\hat{y}_{R_{m}}\right)^{2}+\alpha\left|T\right|$$`

as small as possible.

* `$|T|$`: number of the terminal nodes of the tree `$T$`

* `$R_m$`: rectangle corresponding to the `$m$`th terminal node

* `$\hat{y}_{R_m}$` predicted response associated with `$R_m$`

* `$alpha$` can be found using CV.

---
# Classification Trees

Classification tree predict a qualitative response.

* predict the most commonly occurring class

* Given region,

* Male : 50
  
  * Female : 100
  
  * Predict the label for the given region as `Female`

---
# How to build?

* We don't have the concept of `$RSS$`

* Natural alternative to `$RSS$` is the `classification error rate`

`$$\frac{\text{# of sample} \notin \text{the class label}}{\text{# of toal samples in the given } R}$$`
more mathematically,

`$$E=1-\underset{k}{max}(\hat{p}_{mk})$$`

But, this is not sufficiently sensitive for tree-growing.

---
# Gini impurity

Measure of `total variance` across the `$K$` classes.

.pull-left[

`$$\begin{align*}
G & =\sum_{k=1}^{K}\hat{p}_{k}\left(1-\hat{p}_{k}\right)\\
 & =1-\sum_{k=1}^{K}p_{k}^{2}
\end{align*}$$`

* Plot shows the Gini w.r.t. `p` when `$K=2$`.

* We can easily see that G takes on a small value if all `$p_k$` are close to 0 or 1.

]
.pull-right[

]

---
# Example

* Gini impurity for these?

`$$G	=1-\sum_{k=1}^{K}p_{k}^{2}$$`

* 1, 1, 1, 0, 0, 0
  
  * 1, 1, 1, 1, 1, 1
  
  * 1, 1, 1, 1, 0, 0

---
# Gini index

Given `$R(j, s)$`, the Gini index is the weighted sum of gini impurity of the resulting subsets; Which one is better split?

---
# Entropy

Measure of `impurity` of each node in a DT.

.pull-left[

`$$E=-\sum_{k=1}^{K}\hat{p}_{k}log\hat{p}_{k}$$`

* Plot shows the entropy w.r.t. `p` when `$K=2$`.

]
.pull-right[

]

---
# Example

* Entropy for these?

`$$E=-\sum_{k=1}^{K}\hat{p}_{k}log\hat{p}_{k}$$`
  * 1, 1, 1, 0, 0, 0
  
  * 1, 1, 1, 1, 1, 1
  
  * 1, 1, 1, 1, 0, 0

---
# Information Gain

Given `$R(j, s)$`, the information gain:

Entropy for parent node - the weighted average of entropy of the resulting subsets; Which one is better split?

---
# Build a tree in classification

Find `$R(j, s)$` which

* `Minimizes` Gini Index or

* `Maximizes` Information Gain

Repeat this process until

* `max depth` reaches or leaf has `100% purity`.

---
# Control parameters

`minsplit`: min number of observations that must exist in a node

`minbucket`: min number of observations in any **terminal** node

`cp`: minimum amount of impurity reduction required for a split to be made.

`maxdepth`: maximum depth of any node of the final tree.

```r
library(ISLR)
library(tidyverse)
library(MASS) # boston data set
library(rpart)
library(rpart.plot)
attach(Boston)

Boston %>% head()

boston_tree <- rpart(medv ~ ., data = Boston, 
                     method = "anova",
                     control = rpart.control(
                     minbucket = 1, cp = 0, 
                     maxdepth = 10))
rpart.plot(boston_tree)

printcp(boston_tree)

bestcp <- boston_tree$cptable[which.min(boston_tree$cptable[, "xerror"]), "CP"]

best_boston_tree <- prune(boston_tree, cp = bestcp)
rpart.plot(best_boston_tree)

decision_tree(
  mode = "unknown",
  cost_complexity = NULL,
  tree_depth = NULL,
  min_n = NULL
)
```

---
# Bagging

* Bootstrapping the models!

![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Ensemble_Bagging.svg/660px-Ensemble_Bagging.svg.png)

Image steals from [wiki](https://en.wikipedia.org/wiki/Bootstrap_aggregating)

---
# Advantage

Reduce the bias and variance of the prediction.

1. Generally Ensemble methods work well when the predictors are as independent from one another.
  * Bootstrapping increase the independence of the models by using replacement sampling
  
2. Can be done in parallel -> Gain a lot of popularity.

---
# Stacking

* Stacked generalization (1992)

![](https://miro.medium.com/max/703/0*GXMZ7SIXHyVzGCE_.)

---
class: middle, center, inverse

# Thanks!