**Ordinary least squares** (OLS) is a method of estimating the parameters in a linear regression model.

We have a dataset of $n$ observations $(y_i, \mathbf x_i)$, where $\mathbf x_i$ are the column vectors of length $p$ containing the parameters and $y_i$ are the scalar responses. We set up the linear function for the $i$'th case as

$$ y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_i,$$

$$ y_i = \mathbf x^T_i\beta + \epsilon_i .$$

Here $\beta$ is an unknown $p\times 1$ vector and $\epsilon_i$ is the error of the model in the $i$'th case. So for all $i$ cases, we write the total equation on matrix form as

$$ \mathbf y = \mathbf X\mathbf \beta + \mathbf \epsilon ,$$

with $\mathbf X$ being an $n\times p$ matrix, often called the *design matrix* or *matrix of regressors*. It's worth noting that - contrary to what one might think - the $i$'th **row** of $\mathbf X$ equals $\mathbf x_i^T$. Also, the constant term is included in our equation by setting $x_{i1} = 1\ \forall\ i=1,\dots, n$.

Now because of the errors, $\mathbf y = \mathbf X\mathbf \beta$ does not have a unique solution. However, we can find the best fit by finding the vector $\hat \beta$ that minimizes some sort of cost function,

$$ \hat \beta = \underset{\beta}{\arg \min} S(\beta) .$$

$\hat \beta$ will give us an estimate of $\mathbf y$, namely $\hat{\mathbf y} = \mathbf X \hat \beta$. In the case of OLS, the cost function we use is the

$$ \text{RSS}(\beta) = \sum_{i=1}^N \left(y_i - \hat y_i\right)^2 .$$

Changing into matrix notation again, we get

$$ \text{RSS}(\beta) = (\mathbf y - \hat{\mathbf y})^T(\mathbf y - \hat{\mathbf y}) = (\mathbf y - \mathbf X\beta)^T(\mathbf y - \mathbf X\beta) ,$$

which we can differentiate with respect to $\beta$ to find the minimum.

$$ \frac{\partial \text{RSS}}{\partial \beta} = -2\mathbf X^T (\mathbf y - \mathbf X\beta) .$$

Assuming full column rank for $\mathbf X$, $(\mathbf X^T \mathbf X)$ is thus positive definite (and importantly, invertible). Setting the first derivative to $0$, we get^{1}:

$$ \mathbf X^T(\mathbf y - \mathbf X\beta) = 0$$

$$\Rightarrow \hat \beta = (\mathbf X^T\mathbf X)^{-1}\mathbf X^T \mathbf y $$

^{1}

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed). Springer.