Credits: This post records my learning of UIUC CS446@24SP.

Chinese version

Linear Regression

Simple Case

Linear regression is a simple method to solve a regression problem. It has a closed form solution. Let me first illustrate how to get the closed form solution. Suppose we have a dataset {\( {(x^1,y^1),…,(x^n,y^n)} \)}, where \(x^i\in\mathbb{R}^d\) and \(y^i\in\mathbb{R}\). We want to find a linear function \(f(x)=w^Tx+b\) to fit the data.

To make the question simple, we first discuss the case that d=1, i.e., \(x^i\) is a real number. In this case, we can write the linear function as \(f(x)=w_1x+w_2\). Therefore, what we want to find is the optimal \(w_1\) and \(w_2\) such that \(\underset{w_1,w_2}{\operatorname{argmin}} \frac{1}{2}\sum_{i=1}^n(y^i-w_1x^i-w_2)^2\). Translate the problem into matrix form, we have

\[\underset{w_1,w_2}{\operatorname{argmin}} \frac{1}{2} \left\| \begin{bmatrix} y^{1} \\ \vdots \\ y^{n} \end{bmatrix} - \begin{bmatrix} x^{1} & 1 \\ \vdots & \vdots \\ x^{n} & 1 \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \end{bmatrix} \right\|^2_2\]

Then, we just denote the matrix above as \(X^T, Y, w\) respectively. \( X^T\in \mathbb{R}^{n\times 2}\), Y \(\in \mathbb{R}^{n}\), w \(\in \mathbb{R}^{2}\). The problem becomes \(\underset{w}{\operatorname{argmin}} \frac{1}{2}|Y-Xw|^2_2\). To solve this problem, we just need to take the derivative of the objective function with respect to w and set it to 0. With knowledge of matrix calculus, we have \(L = \frac{1}{2}(Y-X^Tw)^T(Y-X^Tw)= \frac{1}{2}(Y^TY-Y^TX^Tw-w^TXY +w^TX^TXw)\\\)

\[\frac{\partial L}{\partial w} = \frac{1}{2}(0-XY-XY+2X^TXw) = 0 \\\] \[w = (X^TX)^{-1}XY\]

General Case

Let’s go bakc to general case. We also talk about higher order polynomial regression and $x^i$ is no longer a real number. Suppose we have a dataset \({(x^1,y^1),…,(x^n,y^n)}\), where \(x^i\in\mathbb{R}^d\) and \(y^i\in\mathbb{R}\). We want to find a polynomial function \(f(x)=w_0+w_1x+w_2x^2+…+w_dx^d\) to fit the data. \(\underset{w_0, w_1, \ldots, w_d}{\operatorname{argmin}} \frac{1}{2} \left\| \begin{bmatrix} y^{(1)} \\ \vdots \\ y^{(N)} \end{bmatrix} - \begin{bmatrix} (x^{(1)})^d & \cdots & x^{(1)} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ (x^{(N)})^d & \cdots & x^{(N)} & 1 \end{bmatrix} \cdot \begin{bmatrix} w_d \\ \vdots \\ w_1 \\ w_0 \end{bmatrix} \right\|^2\)

Then, X^T \(\in \mathbb{R}^{n\times d}\), Y \(\in \mathbb{R}^{n}\), w \(\in \mathbb{R}^{d}\). The problem is also \(\underset{w}{\operatorname{argmin}} \frac{1}{2}|Y-Xw|^2_2\). Similarly, we can take the derivative of the objective function with respect to w and set it to 0. We get the same closed form solution as above. \(w = (XX^T)^{-1}XY\)

Regularization

In practice, we may encounter the problem that \(n<d+1\). In this case, the matrix \(XX^T\) is not invertible. To solve this problem, we can add a regularization term to the objective function. The objective function becomes \(\underset{w}{\operatorname{argmin}} \frac{1}{2}|Y-Xw|^2_2+\frac{\lambda}{2}|w|^2_2\). The closed form solution becomes \(w = (XX^T+\lambda I)^{-1}XY\). Also, regularization can make the parameters smaller and avoid overfitting.