# Matrix theory
## Definitions and results

---

# Matrix

`\(\mathbf{A}_{n\times m} = [a_{ij}]\)` is a rectangular array of elements.

* Demension of `\(\mathbf{A}\)`: `\(n\)` (rows) by `\(m\)` (columns)
* Square matrix if `\(n = m\)`.
* A vector `\(\mathbf{a}_{n\times1} = [a_i]\)` is a matrix consisting of one `column`.
* Our interests is on real matrices: whose elements are real numbers.

---

# Transpose

If `\(\mathbf{A}_{n\times m} = [a_{ij}]\)` is `\(n \times m\)`, the transpose of `\(\mathbf{A}\)`, `\(\mathbf{A}^T\)` is `\(m \times n\)` matrix `\([a_{ji}]\)`.

* Symmetric if `\(\mathbf{A} = \mathbf{A}^T\)`

** Propsition 1**

If `\(\mathbf{A}\)` is `\(n \times m\)` and `\(\mathbf{B}\)` is `\(m \times n\)`, the `\((\mathbf{A}\mathbf{B})^T=\mathbf{B}^T\mathbf{A}^T\)`

T.B.D

---

# Simple linear regression

* Response variable `\(y_i\)` is linearly related to an independent variable `\(x_i\)`, given by

`$$y_{i}=\beta_{1}+\beta_{2}x_{i}+e_{i}, \quad i=1,...,n$$`

where `\(e_{1},...,e_{n}\)` are typically assumed to be uncorrelated random variables with mean zero and constrant variance `\(\sigma^{2}\)`.

`$$\mathbf{y}=\left(\begin{array}{c}
y_{1}\\
y_{2}\\
...\\
y_{n}
\end{array}\right),\boldsymbol{X}\beta=\left(\begin{array}{cc}
1 & x_{1}\\
1 & x_{2}\\
... & ...\\
1 & x_{n-1}\\
1 & x_{n}
\end{array}\right)\left(\begin{array}{c}
\beta_{1}\\
\beta_{2}
\end{array}\right),\boldsymbol{e}=\left(\begin{array}{c}
e_{1}\\
e_{2}\\
...\\
e_{n}
\end{array}\right)$$`

---

# Multiple linear regression

Response variable `\(y_i\)` is linearly related to `\(p\)` independent variables `\(x_{ij}\)`s, given by

`$$y_{i}=\beta_{1}x_{i1}+\beta_{2}x_{i2}+...+\beta_{p}x_{ij}+e_{i}, \quad i=1,...,n, j=1,...,p$$`

which is the same as

`$$y_{i}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}+e_{i},\quad i=1,...,n$$`

where

`$$\begin{array}{c}
\mathbf{x}_{1}^{T}=\left(x_{11},...,x_{1p}\right),\\
...\\
\mathbf{x}_{n}^{T}=\left(x_{n1},...,x_{np}\right),
\end{array}
\quad
\boldsymbol{\beta}=\left(\begin{array}{c}
\beta_{1}\\
...\\
\beta_{p}
\end{array}\right)$$`

---

# Multiple linear regression

We assume

`$$\mathbb{E}\left(\boldsymbol{e}\right)=\boldsymbol{0},Var\left(\boldsymbol{e}\right)=\sigma^{2}I_{n}$$`

where `\(I_n\)` is an identity matrix size of `\(n\)`.

---

# Regression problem

Linear model problem can be viewed as a best approximation `\(\mathbf{X}\beta\)` to the observed `\(\mathbf{y}\)`.

* If we define closeness or distance in Euclidean manner, then the problem becomes to find a value of the vector `\(\beta\)` that minimizes `\(L(\beta)\)` as follows;

`$$\begin{align*}
L\left(\beta\right) & =\left(\mathbf{y}-\boldsymbol{X}\beta\right)^{T}\left(\mathbf{y}-\boldsymbol{X}\beta\right)\\
& =\left\Vert \mathbf{y}-\boldsymbol{X}\beta\right\Vert ^{2}
\end{align*}$$`

* Solution: Find the gradient vector of `\(L(\beta)\)` and set it equals to zero.

`$$\frac{\partial L}{\partial\beta}=\left(\begin{array}{c}
\frac{\partial L}{\partial\beta_{1}}\\
...\\
\frac{\partial L}{\partial\beta_{p}}
\end{array}\right)$$`

---

# Practice

Find `\(\frac{\partial f}{\partial \beta}\)`

`$$f\left(\beta\right)=\beta_{1}x_{1}+\beta_{2}x_{2}$$`

Find `\(\frac{\partial g}{\partial \beta}\)`

`$$g\left(\beta\right)=\beta_{1}^{2}+4\beta_{1}\beta_{2}+3\beta_{2}^{2}$$`

---

# Derivative rules

Let `\(\mathbf{a}\)` and `\(\mathbf{b}\)` be `\(p\times 1\)` vectors and `\(\mathbf{A}\)` be `\(p \times p\)` matrix of constants. Then, * `\(\frac{\partial \mathbf{a}^T\mathbf{b}}{\partial \mathbf{b}} = \mathbf{a}\)` * `\(\frac{\partial \mathbf{b}^T\mathbf{A}\mathbf{b}}{\partial \mathbf{b}} = (\mathbf{A} + \mathbf{A}^T) \mathbf{b}\)` What is the `\(\frac{\partial L}{\partial \beta} = ?\)` `$$\begin{align*} L\left(\beta\right) & =\left(\mathbf{y}-\boldsymbol{X}\beta\right)^{T}\left(\mathbf{y}-\boldsymbol{X}\beta\right)\\ & =\left\Vert \mathbf{y}-\boldsymbol{X}\beta\right\Vert ^{2} \end{align*}$$` --- # Normal equation Setting the gradient to zero, we obtain Normal Equation; `$$\boldsymbol{X}^{T}\boldsymbol{X}\beta=\boldsymbol{X}^{T}\mathbf{y}$$` The solution of this equation is as follows; `$$\hat{\beta}=\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}$$` --- # Food for thought Are we happy about this always? What is the problem? --- # Gradient descent <img src="data:image/png;base64,#" width="30%" style="display: block; margin: auto;" /> `$${\displaystyle \mathbf{\beta} _{n+1}=\mathbf {\beta} _{n}-\gamma \nabla L(\mathbf {\beta} _{n})}$$` --- # Linear Basis function models `$$f\left(x\right)=\sum_{j=0}^{M-1}\beta_{j}\phi_{j}\left(x\right)=\Phi\left(x\right)\beta$$` where `\(\phi_j(x)\)` are known as **basis functions**. typically, `\(\phi_{0}(x) = 1\)` so that `\(\beta_0\)` becomes a bias. --- # Example of basis functions .pull-left[ * Polynomial basis functions (global) `$$\phi_j(x)=x^j$$` * Gaussian basis (local) `$$\phi_j(x)=exp\left(-\frac{(x-\mu_j)^2}{2\sigma^2}\right)$$` ] .pull-right[ * Sigmoidal basis functions (local) `$$\phi_j(x)=\sigma\left(\frac{x-\mu_j}{s}\right)$$` where $$\sigma(x) = \frac{1}{1+exp(-x)} $$ ] --- # Easy Example .pull-left[ Polynomial Curve Fitting `$$y = sin(2 \pi x) + \epsilon$$` ``` ## # A tibble: 10 x 2 ## x y ## <dbl> <dbl> ## 1 0.3 1.00 ## 2 0.25 1.18 ## 3 0.65 -0.806 ## 4 1 0.346 ## 5 0.55 -0.525 ## 6 0.15 0.754 ## 7 0.95 -0.273 ## 8 0.7 -0.649 ## 9 0.5 0.321 ## 10 0.9 -0.956 ``` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- # 0th order polynomial .pull-left[ `\(f(x) = \beta_0\)` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- # 1th order polynomial .pull-left[ `\(f(x) = \beta_0 + \beta_1 x\)` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- # 3th order polynomial .pull-left[ `\(f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)` ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- # Feel so good~! Let's do 9th!! .pull-left[ Is this looks okay? Why? ] .pull-right[ ![](data:image/png;base64,#lec4_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- # Avoid Over fitting: Regularization Priviously we looked at linear models. Let's extend our candidates! `$$RSS(f) = \left(\mathbf{y}-f(X)\right)^T\left(\mathbf{y}-f(X)\right)$$` To avoid the overfitting, we will consider the following penalized RSS, PRSS; `$$PRSS(f;\lambda) = RSS\left(f\right)+\lambda J\left(f\right)$$` where the functional `\(J(f)\)` represents a regularization term. --- # Bias-Variance trade off We observe a quantitative responds `\(Y\)` and `\(p\)` different perdictors, `\(X_1, ..., X_p\)`. `$$Y = f(X) + \epsilon$$` where `\(X = (X_1, ..., X_p)\)`. `\(\epsilon\)` is a random error term, which is independent of `\(X\)` and has mean zero. We can predict `\(Y\)` using `$$\hat{Y}=\hat{f}(X),$$` where `\(\hat{f}\)` represents our estimate for `\(f\)`, and `\(\hat{Y}\)` represents the resulting prediction for `\(Y\)`. --- # Accuracy of `\(\hat{Y}\)` The accuracy of `\(\hat{Y}\)` as a predicton for `\(Y\)` depends on two quantities; * Reducible error * Irreducible error `\begin{align*} \mathbb{E}\left(Y-\hat{Y}\right)^{2} & =\mathbb{E}\left[\left(f\left(X\right)+\epsilon-\hat{f}\left(X\right)\right)^{2}\right]\\ & =\left[f\left(X\right)-\hat{f}\left(X\right)\right]^{2}+Var\left(\epsilon\right) \end{align*}` --- # Expected test error Expected test error can be decomposed as the following three terms; * `\(Variance\)`, `\(Noise\)`, `\(Bais^2\)` $$ `\begin{align*} & \mathbb{E}_{D,X,y}\left[\left(\hat{f}_{D}\left(X\right)-y\right)^{2}\right]\\ = & \mathbb{E}_{X,D}\left[\left(\hat{f}_{D}\left(X\right)-\bar{f}\left(X\right)\right)^{2}\right]+\\ = & \mathbb{E}_{X,y}\left[\left(\hat{f}\left(X\right)-y\right)^{2}\right]+\\ = & \mathbb{E}_{X}\left[\left(\bar{f}\left(X\right)-\hat{f}\left(X\right)\right)^{2}\right] \end{align*}` $$ --- # Ridge regression Ridge regression use `\(L_2\)` norm `$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left\Vert \beta\right\Vert _{2}^{2}$$` H.W. What is the optimal `\(\beta_{\star}\)`? # Lasso regression Lasso regression use `\(L_1\)` norm `$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left\Vert \beta\right\Vert _{1}$$` --- # Elastic Net Why don't we have the both of the two? `$${\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).}$$` The loss function can be parameterized with the two parameters; `\(\lambda\)`, `\(\alpha\)` * `\(\lambda\)` controls the magnitude * `\(\alpha\)` controls the weights of the two panalty functions `$$\underset{\beta}{min}\left(y-X\beta\right)^{T}\left(y-X\beta\right)+\frac{\lambda}{2}\left(\alpha\left\Vert \beta\right\Vert _{1}+\left(1-\alpha\right)\left\Vert \beta\right\Vert _{2}^{2}\right)$$` --- # Problem So we have the two models like `Lasso` and `Ridge` regression, and more extended model called `Elastic net`. These models have the parameters. # Problem

So we have the two models like `Lasso` and `Ridge` regression, and more extended model called `Elastic net`. These models have the parameters.

How do we determine these parameters?

* We can't use test dataset. (That's cheating and in Kaggle we don't know the dependent variables)

---

# Validation set

Make our own validation set using `train data set`.

* Assumption: train and test data set have the same data distribution.

<img src="data_image/png;base64,#./validationset.png" width="65%" style="display: block; margin: auto;" />

---

# Hyperparameter Tuning

If our model perform well on the validation set, it will work well in the test data!

* Tunning the hyperparameter using validation set.