class: center, middle, inverse, title-slide # Lecture 4 - Linear models ## Linear regression ### Issac Lee ### 2021-03-31 --- class: center, middle # Sungkyunkwan University ![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/en/thumb/4/40/Sungkyunkwan_University_seal.svg/225px-Sungkyunkwan_University_seal.svg.png) ## Actuarial Science --- class: center, middle, inverse # Linear models --- class: center, middle # Matrix theory ## Definitions and results --- # Matrix `\(\mathbf{A}_{n\times m} = [a_{ij}]\)` is a rectangular array of elements. * Demension of `\(\mathbf{A}\)`: `\(n\)` (rows) by `\(m\)` (columns) * Square matrix if `\(n = m\)`. * A vector `\(\mathbf{a}_{n\times1} = [a_i]\)` is a matrix consisting of one `column`. * Our interests is on real matrices: whose elements are real numbers. --- # Transpose If `\(\mathbf{A}_{n\times m} = [a_{ij}]\)` is `\(n \times m\)`, the transpose of `\(\mathbf{A}\)`, `\(\mathbf{A}^T\)` is `\(m \times n\)` matrix `\([a_{ji}]\)`. * Symmetric if `\(\mathbf{A} = \mathbf{A}^T\)` ** Propsition 1** If `\(\mathbf{A}\)` is `\(n \times m\)` and `\(\mathbf{B}\)` is `\(m \times n\)`, the `\((\mathbf{A}\mathbf{B})^T=\mathbf{B}^T\mathbf{A}^T\)` T.B.D --- # Simple linear regression * Response variable `\(y_i\)` is linearly related to an independent variable `\(x_i\)`, given by `$$y_{i}=\beta_{1}+\beta_{2}x_{i}+e_{i}, \quad i=1,...,n$$` where `\(e_{1},...,e_{n}\)` are typically assumed to be uncorrelated random variables with mean zero and constrant variance `\(\sigma^{2}\)`. `$$\mathbf{y}=\left(\begin{array}{c} y_{1}\\ y_{2}\\ ...\\ y_{n} \end{array}\right),\boldsymbol{X}\beta=\left(\begin{array}{cc} 1 & x_{1}\\ 1 & x_{2}\\ ... & ...\\ 1 & x_{n-1}\\ 1 & x_{n} \end{array}\right)\left(\begin{array}{c} \beta_{1}\\ \beta_{2} \end{array}\right),\boldsymbol{e}=\left(\begin{array}{c} e_{1}\\ e_{2}\\ ...\\ e_{n} \end{array}\right)$$` --- # Multiple linear regression Response variable `\(y_i\)` is linearly related to `\(p\)` independent variables `\(x_{ij}\)`s, given by `$$y_{i}=\beta_{1}x_{i1}+\beta_{2}x_{i2}+...+\beta_{p}x_{ij}+e_{i}, \quad i=1,...,n, j=1,...,p$$` which is the same as `$$y_{i}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}+e_{i},\quad i=1,...,n$$` where `$$\begin{array}{c} \mathbf{x}_{1}^{T}=\left(x_{11},...,x_{1p}\right),\\ ...\\ \mathbf{x}_{n}^{T}=\left(x_{n1},...,x_{np}\right), \end{array} \quad \boldsymbol{\beta}=\left(\begin{array}{c} \beta_{1}\\ ...\\ \beta_{p} \end{array}\right)$$` --- # Multiple linear regression We assume `$$\mathbb{E}\left(\boldsymbol{e}\right)=\boldsymbol{0},Var\left(\boldsymbol{e}\right)=\sigma^{2}I_{n}$$` where `\(I_n\)` is an identity matrix size of `\(n\)`. --- # Regression problem Linear model problem can be viewed as a best approximation `\(\mathbf{X}\beta\)` to the observed `\(\mathbf{y}\)`. * If we define closeness or distance in Euclidean manner, then the problem becomes to find a value of the vector `\(\beta\)` that minimizes `\(L(\beta)\)` as follows; `$$\begin{align*} L\left(\beta\right) & =\left(\mathbf{y}-\boldsymbol{X}\beta\right)^{T}\left(\mathbf{y}-\boldsymbol{X}\beta\right)\\ & =\left\Vert \mathbf{y}-\boldsymbol{X}\beta\right\Vert ^{2} \end{align*}$$` * Solution: Find the gradient vector of `\(L(\beta)\)` and set it equals to zero. `$$\frac{\partial L}{\partial\beta}=\left(\begin{array}{c} \frac{\partial L}{\partial\beta_{1}}\\ ...\\ \frac{\partial L}{\partial\beta_{p}} \end{array}\right)$$` --- # Practice Find `\(\frac{\partial f}{\partial \beta}\)` `$$f\left(\beta\right)=\beta_{1}x_{1}+\beta_{2}x_{2}$$` Find `\(\frac{\partial g}{\partial \beta}\)` `$$g\left(\beta\right)=\beta_{1}^{2}+4\beta_{1}\beta_{2}+3\beta_{2}^{2}$$` --- # Derivative rules Let `\(\mathbf{a}\)` and `\(\mathbf{b}\)` be `\(p\times 1\)` vectors and `\(\mathbf{A}\)` be `\(p \times p\)` matrix of constants. Then, * `\(\frac{\partial \mathbf{a}^T\mathbf{b}}{\partial \mathbf{b}} = \mathbf{a}\)` * `\(\frac{\partial \mathbf{b}^T\mathbf{A}\mathbf{b}}{\partial \mathbf{b}} = (\mathbf{A} + \mathbf{A}^T) \mathbf{b}\)` What is the `\(\frac{\partial L}{\partial \beta} = ?\)` `$$\begin{align*} L\left(\beta\right) & =\left(\mathbf{y}-\boldsymbol{X}\beta\right)^{T}\left(\mathbf{y}-\boldsymbol{X}\beta\right)\\ & =\left\Vert \mathbf{y}-\boldsymbol{X}\beta\right\Vert ^{2} \end{align*}$$` --- # Normal equation Setting the gradient to zero, we obtain Normal Equation; `$$\boldsymbol{X}^{T}\boldsymbol{X}\beta=\boldsymbol{X}^{T}\mathbf{y}$$` The solution of this equation is as follows; `$$\hat{\beta}=\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}$$` --- # Food for thought Are we happy about this always? What is the problem? --- # Gradient descent <img src="data:image/png;base64,#https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Gradient_descent.svg/525px-Gradient_descent.svg.png" width="30%" style="display: block; margin: auto;" /> `$${\displaystyle \mathbf{\beta} _{n+1}=\mathbf {\beta} _{n}-\gamma \nabla L(\mathbf {\beta} _{n})}$$` --- # Linear Basis function models `$$f\left(x\right)=\sum_{j=0}^{M-1}\beta_{j}\phi_{j}\left(x\right)=\Phi\left(x\right)\beta$$` where `\(\phi_j(x)\)` are known as **basis functions**. typically, `\(\phi_{0}(x) = 1\)` so that `\(\beta_0\)` becomes a bias. --- # Example of basis functions .pull-left[ * Polynomial basis functions (global) `$$\phi_j(x)=x^j$$` * Gaussian basis (local) `$$\phi_j(x)=exp\left(-\frac{(x-\mu_j)^2}{2\sigma^2}\right)$$` ] .pull-right[ * Sigmoidal basis functions (local) `$$\phi_j(x)=\sigma\left(\frac{x-\mu_j}{s}\right)$$` where $$\sigma(x) = \frac{1}{1+exp(-x)} $$ ] --- # Easy Example .pull-left[ Polynomial Curve Fitting `$$y = sin(2 \pi x) + \epsilon$$` ``` ## # A tibble: 10 x 2 ## x y ## <dbl> <dbl> ## 1 0.3 1.00 ## 2 0.25 1.18 ## 3 0.65 -0.806 ## 4 1 0.346 ## 5 0.55 -0.525 ## 6 0.15 0.754 ## 7 0.95 -0.273 ## 8 0.7 -0.649 ## 9 0.5 0.321 ## 10 0.9 -0.956 ``` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- # 0th order polynomial .pull-left[ `\(f(x) = \beta_0\)` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- # 1th order polynomial .pull-left[ `\(f(x) = \beta_0 + \beta_1 x\)` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- # 3th order polynomial .pull-left[ `\(f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- # Feel so good~! Let's do 9th!! .pull-left[ Is this looks okay? Why? ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- # Avoid Over fitting: Regularization Priviously we looked at linear models. Let's extend our candidates! `$$RSS(f) = \left(\mathbf{y}-f(X)\right)^T\left(\mathbf{y}-f(X)\right)$$` To avoid the overfitting, we will consider the following penalized RSS, PRSS; `$$PRSS(f;\lambda) = RSS\left(f\right)+\lambda J\left(f\right)$$` where the functional `\(J(f)\)` represents a regularization term. --- # Bias-Variance trade off We observe a quantitative responds `\(Y\)` and `\(p\)` different perdictors, `\(X_1, ..., X_p\)`. `$$Y = f(X) + \epsilon$$` where `\(X = (X_1, ..., X_p)\)`. `\(\epsilon\)` is a random error term, which is independent of `\(X\)` and has mean zero. We can predict `\(Y\)` using `$$\hat{Y}=\hat{f}(X),$$` where `\(\hat{f}\)` represents our estimate for `\(f\)`, and `\(\hat{Y}\)` represents the resulting prediction for `\(Y\)`. --- # Accuracy of `\(\hat{Y}\)` The accuracy of `\(\hat{Y}\)` as a predicton for `\(Y\)` depends on two quantities; * Reducible error * Irreducible error `\begin{align*} \mathbb{E}\left(Y-\hat{Y}\right)^{2} & =\mathbb{E}\left[\left(f\left(X\right)+\epsilon-\hat{f}\left(X\right)\right)^{2}\right]\\ & =\left[f\left(X\right)-\hat{f}\left(X\right)\right]^{2}+Var\left(\epsilon\right) \end{align*}` --- # Expected test error Expected test error can be decomposed as the following three terms; * `\(Variance\)`, `\(Noise\)`, `\(Bais^2\)` $$ `\begin{align*} & \mathbb{E}_{D,X,y}\left[\left(\hat{f}_{D}\left(X\right)-y\right)^{2}\right]\\ = & \mathbb{E}_{X,D}\left[\left(\hat{f}_{D}\left(X\right)-\bar{f}\left(X\right)\right)^{2}\right]+\\ = & \mathbb{E}_{X,y}\left[\left(\hat{f}\left(X\right)-y\right)^{2}\right]+\\ = & \mathbb{E}_{X}\left[\left(\bar{f}\left(X\right)-\hat{f}\left(X\right)\right)^{2}\right] \end{align*}` $$ --- # Ridge regression Ridge regression use `\(L_2\)` norm `$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left\Vert \beta\right\Vert _{2}^{2}$$` H.W. What is the optimal `\(\beta_{\star}\)`? # Lasso regression Lasso regression use `\(L_1\)` norm `$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left\Vert \beta\right\Vert _{1}$$` --- # Elastic Net Why don't we have the both of the two? `$${\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).}$$` The loss function can be parameterized with the two parameters; `\(\lambda\)`, `\(\alpha\)` * `\(\lambda\)` controls the magnitude * `\(\alpha\)` controls the weights of the two panalty functions `$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left(\alpha\left\Vert \beta\right\Vert _{1}+\left(1-\alpha\right)\left\Vert \beta\right\Vert _{2}^{2}\right)$$` --- # Problem So we have the two models like `Lasso` and `Ridge` regression, and more extended model called `Elastic net`. These models have the parameters. How do we determine these parameters? * We can't use test dataset. (That's cheating and in Kaggle we don't know the dependent variables) --- # Validation set Make our own validation set using `train data set`. * Assumption: train and test data set have the same data distribution. <img src="data:image/png;base64,#./validationset.png" width="65%" style="display: block; margin: auto;" /> --- # Hyperparameter Tuning If our model perform well on the validation set, it will work well in the test data! * Tunning the hyperparameter using validation set. --- class: middle, center, inverse # Thanks!