Lecture 4 - Linear models

class: center, middle, inverse, title-slide

# Lecture 4 - Linear models
## Linear regression
### Issac Lee
### 2021-03-31

---

class: center, middle

# Sungkyunkwan University

![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/en/thumb/4/40/Sungkyunkwan_University_seal.svg/225px-Sungkyunkwan_University_seal.svg.png)

## Actuarial Science

---
class: center, middle, inverse

# Linear models

---
class: center, middle

# Matrix theory

## Definitions and results

---
# Matrix

`$\mathbf{A}_{n\times m} = [a_{ij}]$` is a rectangular array of elements.

* Demension of `$\mathbf{A}$`: `$n$` (rows) by `$m$` (columns)

* Square matrix if `$n = m$`.

* A vector `$\mathbf{a}_{n\times1} = [a_i]$` is a matrix consisting of one `column`.

* Our interests is on real matrices: whose elements are real numbers.

---
# Transpose

If `$\mathbf{A}_{n\times m} = [a_{ij}]$` is `$n \times m$`, the transpose of `$\mathbf{A}$`, `$\mathbf{A}^T$` is `$m \times n$` matrix `$[a_{ji}]$`.

* Symmetric if `$\mathbf{A} = \mathbf{A}^T$`

** Propsition 1** If `$\mathbf{A}$` is `$n \times m$` and `$\mathbf{B}$` is `$m \times n$`, the `$(\mathbf{A}\mathbf{B})^T=\mathbf{B}^T\mathbf{A}^T$`

T.B.D

---
# Simple linear regression

* Response variable `$y_i$` is linearly related to an independent variable `$x_i$`, given by

`$$y_{i}=\beta_{1}+\beta_{2}x_{i}+e_{i}, \quad i=1,...,n$$`

where `$e_{1},...,e_{n}$` are typically assumed to be uncorrelated random variables with mean zero and constrant variance `$\sigma^{2}$`.

`$$\mathbf{y}=\left(\begin{array}{c}
y_{1}\\
y_{2}\\
...\\
y_{n}
\end{array}\right),\boldsymbol{X}\beta=\left(\begin{array}{cc}
1 & x_{1}\\
1 & x_{2}\\
... & ...\\
1 & x_{n-1}\\
1 & x_{n}
\end{array}\right)\left(\begin{array}{c}
\beta_{1}\\
\beta_{2}
\end{array}\right),\boldsymbol{e}=\left(\begin{array}{c}
e_{1}\\
e_{2}\\
...\\
e_{n}
\end{array}\right)$$`

---
# Multiple linear regression

Response variable `$y_i$` is linearly related to `$p$` independent variables `$x_{ij}$`s, given by

`$$y_{i}=\beta_{1}x_{i1}+\beta_{2}x_{i2}+...+\beta_{p}x_{ij}+e_{i}, \quad i=1,...,n, j=1,...,p$$`
which is the same as

`$$y_{i}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}+e_{i},\quad i=1,...,n$$`

where

`$$\begin{array}{c}
\mathbf{x}_{1}^{T}=\left(x_{11},...,x_{1p}\right),\\
...\\
\mathbf{x}_{n}^{T}=\left(x_{n1},...,x_{np}\right),
\end{array} \quad \boldsymbol{\beta}=\left(\begin{array}{c}
\beta_{1}\\
...\\
\beta_{p}
\end{array}\right)$$`

---
# Multiple linear regression

We assume

`$$\mathbb{E}\left(\boldsymbol{e}\right)=\boldsymbol{0},Var\left(\boldsymbol{e}\right)=\sigma^{2}I_{n}$$`

where `$I_n$` is an identity matrix size of `$n$`.

---
# Regression problem

Linear model problem can be viewed as a best approximation `$\mathbf{X}\beta$` to the observed `$\mathbf{y}$`.

* If we define closeness or distance in Euclidean manner, then the problem becomes to find a value of the vector `$\beta$` that minimizes `$L(\beta)$` as follows;

`$$\begin{align*}
L\left(\beta\right) & =\left(\mathbf{y}-\boldsymbol{X}\beta\right)^{T}\left(\mathbf{y}-\boldsymbol{X}\beta\right)\\
 & =\left\Vert \mathbf{y}-\boldsymbol{X}\beta\right\Vert ^{2}
\end{align*}$$`

* Solution: Find the gradient vector of `$L(\beta)$` and set it equals to zero.

`$$\frac{\partial L}{\partial\beta}=\left(\begin{array}{c}
\frac{\partial L}{\partial\beta_{1}}\\
...\\
\frac{\partial L}{\partial\beta_{p}}
\end{array}\right)$$`

---
# Practice

Find `$\frac{\partial f}{\partial \beta}$`

`$$f\left(\beta\right)=\beta_{1}x_{1}+\beta_{2}x_{2}$$`
Find `$\frac{\partial g}{\partial \beta}$`

`$$g\left(\beta\right)=\beta_{1}^{2}+4\beta_{1}\beta_{2}+3\beta_{2}^{2}$$`

---
# Derivative rules

Let `$\mathbf{a}$` and `$\mathbf{b}$` be `$p\times 1$` vectors and `$\mathbf{A}$`  be `$p \times p$` matrix of constants. Then,

* `$\frac{\partial \mathbf{a}^T\mathbf{b}}{\partial \mathbf{b}} = \mathbf{a}$`

* `$\frac{\partial \mathbf{b}^T\mathbf{A}\mathbf{b}}{\partial \mathbf{b}} = (\mathbf{A} + \mathbf{A}^T) \mathbf{b}$`

What is the `$\frac{\partial L}{\partial \beta} = ?$`

---
# Normal equation

Setting the gradient to zero, we obtain Normal Equation;

`$$\boldsymbol{X}^{T}\boldsymbol{X}\beta=\boldsymbol{X}^{T}\mathbf{y}$$`

The solution of this equation is as follows;

`$$\hat{\beta}=\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}$$`

---
# Food for thought

Are we happy about this always?

What is the problem?

---
# Gradient descent

`$${\displaystyle \mathbf{\beta} _{n+1}=\mathbf {\beta} _{n}-\gamma \nabla L(\mathbf {\beta} _{n})}$$`
---
# Linear Basis function models

`$$f\left(x\right)=\sum_{j=0}^{M-1}\beta_{j}\phi_{j}\left(x\right)=\Phi\left(x\right)\beta$$`

where `$\phi_j(x)$` are known as **basis functions**.

typically, `$\phi_{0}(x) = 1$` so that `$\beta_0$` becomes a bias.

---
# Example of basis functions

.pull-left[

* Polynomial basis functions (global)
`$$\phi_j(x)=x^j$$`

* Gaussian basis (local)
`$$\phi_j(x)=exp\left(-\frac{(x-\mu_j)^2}{2\sigma^2}\right)$$`
]
.pull-right[
* Sigmoidal basis functions (local)

`$$\phi_j(x)=\sigma\left(\frac{x-\mu_j}{s}\right)$$`
where $$\sigma(x) = \frac{1}{1+exp(-x)} $$
]

---
# Easy Example

.pull-left[
Polynomial Curve Fitting

`$$y = sin(2 \pi x) + \epsilon$$`

```
## # A tibble: 10 x 2
##        x      y
##    <dbl>  <dbl>
##  1  0.3   1.00 
##  2  0.25  1.18 
##  3  0.65 -0.806
##  4  1     0.346
##  5  0.55 -0.525
##  6  0.15  0.754
##  7  0.95 -0.273
##  8  0.7  -0.649
##  9  0.5   0.321
## 10  0.9  -0.956
```

]
.pull-right[
![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-2-1.png)

]

---
# 0th order polynomial

.pull-left[
`$f(x) = \beta_0$`

]
.pull-right[

![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-3-1.png)

]

---
# 1th order polynomial

.pull-left[
`$f(x) = \beta_0 + \beta_1 x$`

]
.pull-right[
![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-4-1.png)

]

---
# 3th order polynomial

.pull-left[
`$f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3$`

]
.pull-right[
![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-5-1.png)

]

---
# Feel so good~! Let's do 9th!!

.pull-left[

Is this looks okay? Why?

]
.pull-right[
![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-6-1.png)

]

---
# Avoid Over fitting: Regularization

Priviously we looked at linear models. Let's extend our candidates!

`$$RSS(f) = \left(\mathbf{y}-f(X)\right)^T\left(\mathbf{y}-f(X)\right)$$`
To avoid the overfitting, we will consider the following penalized RSS, PRSS;

`$$PRSS(f;\lambda) = RSS\left(f\right)+\lambda J\left(f\right)$$`
where the functional `$J(f)$` represents a regularization term.

---
# Bias-Variance trade off

We observe a quantitative responds `$Y$` and `$p$` different perdictors, `$X_1, ..., X_p$`.

`$$Y = f(X) + \epsilon$$`
where `$X = (X_1, ..., X_p)$`. `$\epsilon$` is a random error term, which is independent of `$X$` and has mean zero.

We can predict `$Y$` using `$$\hat{Y}=\hat{f}(X),$$` where `$\hat{f}$` represents our estimate for `$f$`, and `$\hat{Y}$` represents the resulting prediction for `$Y$`.

---
# Accuracy of `$\hat{Y}$`

The accuracy of `$\hat{Y}$` as a predicton for `$Y$` depends on two quantities;

* Reducible error

* Irreducible error

`\begin{align*}
\mathbb{E}\left(Y-\hat{Y}\right)^{2} & =\mathbb{E}\left[\left(f\left(X\right)+\epsilon-\hat{f}\left(X\right)\right)^{2}\right]\\
 & =\left[f\left(X\right)-\hat{f}\left(X\right)\right]^{2}+Var\left(\epsilon\right)
\end{align*}`

---
# Expected test error

Expected test error can be decomposed as the following three terms;

* `$Variance$`, `$Noise$`, `$Bais^2$`

$$
`\begin{align*}
 & \mathbb{E}_{D,X,y}\left[\left(\hat{f}_{D}\left(X\right)-y\right)^{2}\right]\\
= & \mathbb{E}_{X,D}\left[\left(\hat{f}_{D}\left(X\right)-\bar{f}\left(X\right)\right)^{2}\right]+\\
= & \mathbb{E}_{X,y}\left[\left(\hat{f}\left(X\right)-y\right)^{2}\right]+\\
= & \mathbb{E}_{X}\left[\left(\bar{f}\left(X\right)-\hat{f}\left(X\right)\right)^{2}\right]
\end{align*}`
$$

---
# Ridge regression

Ridge regression use `$L_2$` norm

`$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left\Vert \beta\right\Vert _{2}^{2}$$`

H.W. What is the optimal `$\beta_{\star}$`?

# Lasso regression

Lasso regression use `$L_1$` norm

`$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left\Vert \beta\right\Vert _{1}$$`

---
# Elastic Net

Why don't we have the both of the two?

`$${\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).}$$`
The loss function can be parameterized with the two parameters; `$\lambda$`, `$\alpha$`

* `$\lambda$` controls the magnitude
* `$\alpha$` controls the weights of the two panalty functions

`$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left(\alpha\left\Vert \beta\right\Vert _{1}+\left(1-\alpha\right)\left\Vert \beta\right\Vert _{2}^{2}\right)$$`

---
# Problem

So we have the two models like `Lasso` and `Ridge` regression, and more extended model called `Elastic net`. These models have the parameters.

How do we determine these parameters?

* We can't use test dataset. (That's cheating and in Kaggle we don't know the dependent variables)

---
# Validation set

Make our own validation set using `train data set`.

* Assumption: train and test data set have the same data distribution.

---
# Hyperparameter Tuning

If our model perform well on the validation set, it will work well in the test data!

* Tunning the hyperparameter using validation set.

---
class: middle, center, inverse

# Thanks!