Introduction to Gaussian Processes

Introduction to Gaussian Processes

In this post, I will cover the very basics of Gaussian Processes following the presentations in Murphy (2012) and Rasmussen and Williams (2006) — though hopefully easier to digest. I will approach the explanation in two ways: (1) derived from Bayesian Ordinary Linear Regression and (2) derived from the definition of Gaussian Processes in functional space (it is easier than it sounds). I am not sure why the mentioned books do not use “^” to denote estimation (if you do, please leave a comment), but I will stick to their notation assuming you may want to look into them despite running into the possibility of an statistical sin. Lastly, I would like to thank Jared Smith for reviewing this post and providing highly statistically significant insights.

If you are not familiar with Bayesian regression (see my previous post), skip the section below and begin reading from “Succinct derivation of Gaussian Processes from functional space.”

From Bayesian Ordinary Linear Regression to Gaussian Processes

Recapitulation of Bayesian Ordinary Regression

In my previous blog post, about Bayesian Ordinary Linear Regression (OLR), I showed that the predictive distribution for a point \boldsymbol{x}_* for which we are trying to estimate y_* with Bayesian OLR is

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where we are trying to regress a model over the data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, \boldsymbol{X} being the independent variables and \boldsymbol{y} the dependent variable, \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} means “a normal distribution of y_* with (or given) the regression mean \boldsymbol{\beta}_N^T\boldsymbol{x}_* and variance \sigma_N^2(\boldsymbol{x}_*),” where N is the number of points in the data set. The estimated variance of \boldsymbol{y}_*, \sigma_N^2(\boldsymbol{x}_*) and the estimated regression parameters \boldsymbol{\beta}_N can be calculated as

\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}^T\boldsymbol{X})^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}^T\boldsymbol{y}
\hat{\sigma}_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

where \boldsymbol{V}_0 and \boldsymbol{\beta}_0 are the mean and the covariance of the parameter prior distribution. All the above can be used to estimate y_* as

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* with an error variance of \sigma_N^2(\boldsymbol{x}_*)

If we assume a prior mean of 0 (\boldsymbol{\beta}_0=0), replace \boldsymbol{V}_N and \boldsymbol{\beta}_N by their definitions, and apply a function \phi(\boldsymbol{x}), e.g. \phi(\boldsymbol{x})=(1, x_1, x_2, x_1x_2, x_1^2 ...)^T, to add features to \boldsymbol{X} so that we can use a linear regression approach to fit a non-linear model over our data set \mathcal{D}, we would instead have that

\boldsymbol{\beta}_N = \boldsymbol{\phi}_*^T( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_X\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{\phi}_*^{T}( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_*

where \boldsymbol{\phi}_* = \phi(\boldsymbol{x}_*) and \boldsymbol{\phi}_X = \phi(\boldsymbol{X}). One problem with this approach is that as we add more terms (also called features) to function \phi(\cdot) to better capture non-linearities, the the dimension of the matrix we will have to invert increases. It is also not always obvious what features we should add to our data. Both problems can be handled with Gaussian Processes.

Now to Gaussian Processes

This is where we transition from Bayesian OLR to Gaussian Processes. What we want to accomplish in the rest of this derivation is to find a way of: (1) adding as many features to the data as we need to calculate \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) without increasing the size of the matrix \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T we have to invert (remember the dimensions of such matrix equals the number of features in \phi(\cdot)), and of (2) finding a way of implicitly adding features to \boldsymbol{X} and \boldsymbol{x}_* without having to do so manually — manually adding features to the data may take a long time, specially if we decide to add (literally) an infinite number of them to have an interpolator.

The first step is to make the dimensions of the matrix we want to invert equal to the number of data points instead of the number of features in \phi(\cdot). For this, we can re-arrange the equations for \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) as

\boldsymbol{\beta}_N =\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*( \sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_X)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*-\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_X(\sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{\phi}_X)^{-1}\boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_*

We now have our feature-expanded data and points for which we want point estimates always appearing in the form of \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})\boldsymbol{x}' is just another point \boldsymbol{x}_i either in the data set or for which we want to estimate y_*. From now on, \kappa(x, x') = \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'}) will be called a covariance or kernel function. Since the prior’s covariance \boldsymbol{V}_0 is positive semi-definite (as any covariance matrix), we can write

\kappa(\boldsymbol{x}, \boldsymbol{x}') = \phi(\boldsymbol{x})^T (\boldsymbol{V}_0^{1/2})^T(\boldsymbol{V}_0^{1/2})\phi(\boldsymbol{x}')=\psi(\boldsymbol{x})^T\psi(\boldsymbol{x}')

where \psi(\boldsymbol{x})=\boldsymbol{V}_0^{1/2}\phi(\boldsymbol{x'}). Using \kappa(\boldsymbol{x}, \boldsymbol{x}') and assuming a prior of \boldsymbol{I}, \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) then take a new shape in the predictor posterior of a Gaussian Process:

\boldsymbol{\beta}_N =K_*( \sigma_{\epsilon}^2\boldsymbol{I} + K)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = k_{**}-\boldsymbol{k}_*(\sigma_{\epsilon}^2\boldsymbol{I}+K)^{-1}\boldsymbol{k}_*^T

where K = \kappa(\boldsymbol{\phi}_X,\boldsymbol{\phi}_X), \boldsymbol{k}_* = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_X) and k_{**} = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_*). Now the features of the data are absorbed within the inner-products between observed \boldsymbol{x}‘s and or for \boldsymbol{x}_*, so we can add as many features as we want without impacting the dimensions of the matrix we will have to invert. Also, after this transformation, instead of \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) representing the mean and covariance between model parameters, they represent the mean and covariance of and among data points and/or points for which we want to get point estimates. The parameter posterior from Bayesian OLR would now be used to sample the values of y_* directly instead of model parameters. And now we have a Gaussian Process with which to estimate y_*. For plots of samples of \boldsymbol{y}_* from the prior and from the predictive posterior, of the mean plus or minus the error variances, and of models with different kernel parameters, see figures at the end of the post.

Kernel (or Covariance) matrices

The convenience of having \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) written in terms of \kappa(\cdot, \cdot), is that \kappa does not have to represent simply a matrix-matrix or vector-matrix multiplication of \boldsymbol{X} and \boldsymbol{x}_*. In fact, function \kappa can be any function that corresponds to an inner-product after some transformation from \boldsymbol{x}\rightarrow\phi(\boldsymbol{x}), which will necessarily return a positive semi-definite matrix and therefore be a valid covariance matrix. There are several kernels (or kernel functions or kernel shapes) available, each accounting in a different way for the nearness or similarity (at least in Gaussian Processes) between values of x_i:

  • the linear kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T\boldsymbol{x}'. This kernel is used when trying to fit a line going through the origin.
  • the polynomial kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = (1 + \boldsymbol{x}^T\boldsymbol{x}')^d, where d is the dimension of the polynomial function. This kernel is used when trying to fit a polynomial function (such as the fourth order polynomial in the blog post about Bayesian OLR), and
  • the Radial Basis Function, or RBF (aka Squared Exponential, or SE), \kappa(\boldsymbol{x}, \boldsymbol{x}') = e^{-\frac{||\boldsymbol{x}-\boldsymbol{x}'||^2}{2\sigma_{RBF}^2}}, where \sigma_{RBF} is a kernel parameter denoting the characteristic length scale of the function to be approximated. Using the RBF kernel is equivalent to adding an infinite (!) number of features the data, meaning \phi(\boldsymbol{x})=(1, x_1, ..., x_d, x_1x_2, x_1x_3,..., x_{d-1}x_d, x_1x_2x_3, ...)^T.

But this is all quite algebraic. Luckily, there is a more straight-forward derivation shown next.

Succinct derivation of Gaussian Processes from functional space.

Most regression we study in school is a way of estimating the parameters \boldsymbol{\beta} of a model. In Bayesian OLR, for example, we have a distribution (the parameter posterior) from which to sample model parameters. What if we instead derived a distribution from which to sample functions themselves? The first (and second and third) time I heard about the idea of sampling functions I thought right away of opening a box and from it drawing exponential and sinusoid functional forms. However, what is meant here is a function of an arbitrary shape without functional form. How is this possible? The answer is: instead of sampling a functional form itself, we sample the values of y_*=f(\boldsymbol{x}_*) at discrete points, with a collection of \boldsymbol{y}_i for all \boldsymbol{x}_i in our domain called a function \boldsymbol{f}. After all, don’t we want a function more often than not solely for the purpose of getting point estimates and associated errors for given values of \boldsymbol{x}_*?

In Gaussian Process regression, we sample functions \boldsymbol{f} from a distribution by considering each value of x_i in a discretized range along the x-axis as a random variable. That means that if we discretize our x-axis range into 100 equally spaced points, we will have 100 random variables x_0,..,x_{100}. The mean of all possible functions at \boldsymbol{x}_i therefore represents the mean value of y_i in \boldsymbol{x}_i, and each term in the covariance matrix (kernel matrix, see “Kernel (or Covariance) matrices” section above) represents how similar the values of y_i and y_j will be for each pair \boldsymbol{x}_i and \boldsymbol{x}_j based on the value of a distance metric between  \boldsymbol{x}_i and \boldsymbol{x}_j: if \boldsymbol{x}_i and \boldsymbol{x}_j are close to each other, \boldsymbol{y}_i and \boldsymbol{y}_j tend to be similar and vice-versa. We can therefore turn the sampled functions y = f(\boldsymbol{x}) into whatever functional form we want by choosing the appropriate covariance matrix (kernel) — e.g. a linear kernel will return a linear model, a polynomial kernel will return a polynomial function, and an RBF kernel will return a function with an all linear and an infinite number of non-linear terms. A Gaussian Process can then be written as

f(\boldsymbol{x}) \sim \mathcal{GP}(\mu(\boldsymbol{x}), \kappa(\boldsymbol{x}, \boldsymbol{x}'))

where \mu(\boldsymbol{x}) is the mean function (e.g. a horizontal line along the x-axis at y = 0 would mean that \mu(\boldsymbol{x}) = [0,..,0]^T) and \kappa(\boldsymbol{x}, \boldsymbol{x}') is the covariance matrix, which here expresses how similar the values of y_i will be based on the distance between two values of x_i.

To illustrate this point, assume we want to sample functions over 200 discretizations \boldsymbol{X}_* = x_{*0}, ..., x_{*199} along the x-axis from x=-5 to x=5 (\Delta x = 0.05) using a Gaussian Process. Assume the parameters of our Gaussian Process from which we will sample our functions are mean 0 and that it uses an RBF kernel for its covariance matrix with parameter \sigma_{RBF}=2 (not be confused with standard deviation), meaning that k_{i,i}=1.0 ,\:\forall i \in [0, 199] and k_{i,j}=e^{-\frac{||x_i-x_j||^2}{2\cdot 2^2}},\: \forall i,j \in [0, 199], or

\begin{aligned} p(\boldsymbol{f}_*|\boldsymbol{X_*})&=\mathcal{N}(\boldsymbol{f}|\boldsymbol{\mu},\boldsymbol{K_{**}})\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[\mu_{*0},..,\mu_{*199}], \begin{bmatrix} k_{*0,0} & k_{*0,1} & \cdots & k_{*0,199} \\ k_{*1,0} & k_{*1, 1} & \cdots & k_{*1,199}\\ \vdots & \vdots & \ddots & \vdots \\ k_{*199,0} & k_{*199,1} & \cdots & k_{*199,199} \end{bmatrix} \right)\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[0,..,0], \begin{bmatrix} 1 & 0.9997 & \cdots & 4.2\cdot 10^{-6} \\ 0.9997 & 1 & \cdots & 4.8\cdot 10^{-6}\\ \vdots & \vdots & \ddots & \vdots \\ 4.2\cdot 10^{-6} & 4.8\cdot 10^{-6} & \cdots & 1 \end{bmatrix} \right) \end{aligned}

Each sample from the normal distribution above will return 200 values: \boldsymbol{f}_* = [y_0, ..., y_{199}]. A draw of 3 such sample functions would look generally like the following:


The functions above look incredibly smooth because of the RBF kernel, which adds an infinite number of non-linear features to the data. The shaded region represents the prior distribution of functions. The grey line in the middle represents the mean of an infinite number of function samples, while the upper and lower bounds represent the corresponding prior mean plus or minus standard deviations.

These functions looks quite nice but pretty useless. We are actually interested in regressing our data assuming a prior, namely on a posterior distribution, not on the prior by itself. Another way of formulating this is to say that we are interested only in the functions that go through or close to certain known values of \boldsymbol{x} and y. All we have to do is to add random variables corresponding to our data and condition the resulting distribution on them, which means accounting only for samples of functions that exactly go through (or given) our data points. The resulting distribution would then look like p(\boldsymbol{f}_*|\mathcal{D}, \boldsymbol{X}_*). After adding our data \boldsymbol{X}, \boldsymbol{f}, or \boldsymbol{X}, \boldsymbol{y}, our distribution would look like

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}K(\boldsymbol{X},\boldsymbol{X}) & K(\boldsymbol{X},\boldsymbol{X}_*) \\ K(\boldsymbol{X}_*,\boldsymbol{X}) & K(\boldsymbol{X}_*,\boldsymbol{X}_*)\end{bmatrix}\right)

In this distribution we have random variables representing the \boldsymbol{X} from our data set \mathcal{D} and the \boldsymbol{X}_* from which we want to get point estimates and corresponding error standard deviations. The covariance matrix above accounts for correlations of all observations with all other observations, and correlations of all observations with all points to be predicted. We now just have to condition the multivariate normal distribution above over the points in our data set, so that the distribution accounts for the infinite number of functions that go through our \boldsymbol{y} at the corresponding \boldsymbol{X}. This yields

p(\boldsymbol{f}_*|\boldsymbol{X}_*,\boldsymbol{X},\boldsymbol{y})=\mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})


\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \boldsymbol{K} = K(\boldsymbol{X},\boldsymbol{X})\boldsymbol{K}_* = K(\boldsymbol{X},\boldsymbol{X}_*), \boldsymbol{K}_{**} = K(\boldsymbol{X_*},\boldsymbol{X_*}), and diag\left(\bar{\boldsymbol{\Sigma}}\right) is a vector containing the elements in the diagonal of the covariance matrix \bar{\boldsymbol{\Sigma}} — the variances \bar{\sigma}^2_i of each random variable x_i. If you do not want to use a prior with \boldsymbol{\mu} = \boldsymbol{\mu}_* = [0, ... , 0]^T, the mean of the Gaussian Process would be \bar{\boldsymbol{\mu}}=\boldsymbol{\mu}_* + \boldsymbol{K}_*^T\boldsymbol{K}^{-1}(\boldsymbol{f} - \boldsymbol{\mu}). A plot of \boldsymbol{f}_* = \bar{\boldsymbol{\mu}} estimated for 200 values of x_i — so that the resulting curve looks smooth — plus or minus associated uncertainty (\pm\bar{\boldsymbol{\sigma}}^2), regressed on 5 random data points \boldsymbol{X}, \boldsymbol{y}, should look generally like



A plot of three sampled functions (colored lines below, the grey line is the mean) \boldsymbol{f}_* \sim \mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}}) should look generally like



Several kernel have parameters that can strongly influence the regressed model and that can be estimated — see Murphy (2012) for a succinct introduction to kernel parameters estimation. One example is parameter \sigma_{RBF}^2 of the RBF kernel, which determines the correlation length scale of the fitted model and therefore how fast with increasing separation distance between data points the model will default to the prior (here, with \boldsymbol{\mu}=0). The figure below exemplifies this effect


All figures have the same data points but the resulting models look very different. Reasonable values for \sigma_{RBF}^2 depend on the spacing between the data points and are therefore data-specific. Also, if when you saw this plot you thought of fitting RBF functions as interpolators, that was not a coincidence: the equations solved when fitting RBFs to data as interpolators are the same solved when using Gaussian Processes with RBF kernels.

Lastly, in case of noisy observation of y in \mathcal{D} the Gaussian Process distribution and the expressions for the mean, covariance, and standard deviation become

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I} & \boldsymbol{K}_* \\ \boldsymbol{K}_*^T & \boldsymbol{K}_{**}\end{bmatrix}\right)

\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{y}

\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \sigma_{\epsilon}^2 is the error variance and \boldsymbol{I} is an identity matrix. The mean and error predictions would look like the following


Concluding Remarks

What I presented is the very basics of Gaussian Processes, leaving several important topics which were not covered here for the interested reader to look into. Examples are numerically more stable formulations, computationally more efficient formulations, kernel parameter estimation, more kernels, using a linear instead of zero-mean prior (semi-parametric Gaussian Processes), Gaussian Processes for classification and Poisson regression, the connection between Gaussian Processes and other methods, like krieging and kernel methods for nonstationary stochastic processes. Some of these can be found in great detail in Rasmussen and Williams (2006), and Murphy (2012) has a succinct presentation of all of the listed topics.

I was really confused about Gaussian Processes the first time I studied it and tried to make this blog post as accessible as possible. Please leave a comment if you think the post would benefit from any particular clarification.



Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press, ISBN 978-0262018029

Rasmussen, Carl Edward and Williams, Christopher K. I., 2006. Gaussian Processes for Machine Learning. The MIT Press, ISBN 978-0262182539.

From Frequentist to Bayesian Ordinary Linear Regression

In this post, I will briefly review the basic theory about Ordinary Linear Regression (OLR) using the frequentist approach and from it introduce the Bayesian approach. This will lay the terrain for a later post about Gaussian Processes. The reader may ask himself why “Ordinary Linear Regression” instead of “Ordinary Least Squares.” The answer is that least squares refers to the objective function to be minimized, the sum of the square errors, which is not used the Bayesian approach presented here. I want to thank Jared Smith and Jon Lamontagne for reviewing and improving this post. This post closely follows the presentation in Murphy (2012), which for some reason does not use “^” to denote estimated quantities (if you know why, please leave a comment).

Ordinary Linear Regression

OLR is used to fit a model that is linear in the parameters to a data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, where \boldsymbol{X} represents the independent variables and \boldsymbol{y} the dependent variable, and has the form

p(y|\boldsymbol{x}, \boldsymbol{\beta}) = \mathcal{N}(y|\boldsymbol{\beta}^T\boldsymbol{x}, \sigma_{\epsilon}^2)

where \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} should be read as “a normal distribution of \boldsymbol{y} with (or given) mean \boldsymbol{\beta}^T\boldsymbol{x} and variance \sigma_{\epsilon}^2.” One frequentist method of estimating the parameters of a linear regression model is the method of maximum likelihood estimation (MLE). MLE provides point estimates \boldsymbol{\hat{\beta}} for each of the regression model parameters \boldsymbol{\beta} by maximizing the likelihood function with respect to \boldsymbol{\beta} and \sigma_{\epsilon}^2 by solving the maximization problem

\boldsymbol{\hat{\beta}} = arg\,\underset{\boldsymbol{\beta}}{max}\;\mathcal{L}(\boldsymbol{\beta})

where the likelihood function \mathcal{L}(\boldsymbol{\beta}) is defined as

\mathcal{L}(\boldsymbol{\beta})=p(\mathcal{D}|\boldsymbol{\beta})={\displaystyle \prod_{i=1}^{N} p(y_i|\boldsymbol{x}_i, \boldsymbol{\beta}, \mu, \sigma_{\epsilon}^2)} = \mathcal{N}(\boldsymbol{y}|\mu + \boldsymbol{X}\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})

where \mu is the mean of the model error (normally \mu=0), N is the number of points in the data set, \boldsymbol{I} is an identity matrix, and \boldsymbol{x} is a vector of values of the independent variables for an individual data point of matrix \boldsymbol{X}. OLR assumes independence and the same model error variance among observations, which is expressed by the covariance matrix \sigma_{\epsilon}^2\boldsymbol{I}. Digression: in Generalized Linear Regression (GLR), observations may not be independent and may have different variances (heteroscedasticity), so the covariance matrix may have off-diagonal terms and the diagonal terms may not be equal.

If a non-linear function shape is sought, linear regression can be used to fit a model linear in the parameters over a function \phi(\cdot) of the data — I am not sure why the machine learning community chose \phi for this function, but do not confuse it with a normal distribution. This procedure is called basis function expansion and has a likelihood function of the form

\mathcal{L}(\boldsymbol{\beta}) = \mathcal{N}(\boldsymbol{y}|\mu + \phi(\boldsymbol{X})\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})

where \phi(\boldsymbol{x}) can have, for example, the form

\phi(\boldsymbol{x})=\begin{pmatrix}1\\ x_1\\ x_1^2\end{pmatrix}

for fitting a parabola to a one-dimensional \boldsymbol{X} over \boldsymbol{y}. When minimizing the squared residuals we get to the famous Ordinary Least Squares regression. Linear regression can be further developed, for example, into Ridge Regression to better handle multicollinearity by introducing bias to the parameter estimates, and into Kernel Ridge regression to implicitly add non-linear terms to the model. These formulations are beyond the scope of this post.

What is important to notice is that the standard approaches for linear regression described here, although able to fit linear and non-linear functions, do not provide much insight into the model errors. That is when Bayesian methods come to play.

Bayesian Ordinary Linear Regression

In Bayesian Linear regression (and in any Bayesian approach), the parameters \boldsymbol{\beta} are treated themselves as random variables. This allows for the consideration of model uncertainty, given that now instead of having the best point estimates for \boldsymbol{\beta} we have their full distributions from which to sample models — a sample of  \boldsymbol{\beta} corresponds to a model. The distribution of parameters \boldsymbol{\beta}, P(\boldsymbol{\beta}|\mathcal{D}), called the parameter posterior distribution, is calculated by multiplying the likelihood function used in MLE by a prior distribution for the parameters. A prior distribution is assumed from knowledge prior to analyzing new data. In Bayesian Linear Regression, a Gaussian distribution is commonly assumed for the parameters. For example, the prior on \boldsymbol{\beta} can be \mathcal{N}(\boldsymbol{\beta_0},\boldsymbol{V}_0) for algebraic simplicity. The parameter posterior distribution assuming a known \sigma_{\epsilon}^2 then has the form

p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) =\frac{p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})}{p(\mathcal{D})} \propto p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})  ,        given p(\mathcal{D}) is a constant depending only on the data.

We now have an expression from which to derive our parameter posterior for our linear model from which to sample \boldsymbol{\beta}. If the likelihood and the prior are Gaussian, the parameter posterior will also be a Gaussian, given by

p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) \propto \mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_0, \boldsymbol{V}_0)\mathcal{N}(\boldsymbol{y}|\boldsymbol{X\beta},\sigma_{\epsilon}^2\boldsymbol{I})=\mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_N,\boldsymbol{V}_N)
\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}\boldsymbol{X}^T)^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}\boldsymbol{y}

If we calculate the parameter posterior (distribution over parameters \boldsymbol{\beta}) for a simple linear model f(x) = \beta_0 + \beta_1x for a data set \mathcal{D} in which \boldsymbol{X} and \boldsymbol{y} are approximately linearly related given some noise \sigma_{\epsilon}^2, we can use it to sample values for \boldsymbol{\beta}. This is equivalent to sampling linear models f(x) for data set \mathcal{D}. As the number of data points increase, the variability of the sampled models should decrease, as in the figure below


This is all interesting but the parameter posterior per se is not of much use. We can, however, use the parameter posterior to find the posterior predictive distribution, which can be used to both get point estimates of y and the associated error. This is done by multiplying the likelihood by the parameter posterior and marginalizing the result over \boldsymbol{\beta}. This is equivalent to performing infinite sampling of blue lines in the example before to form density functions around a point estimate of \mu_{y_*}=f(\boldsymbol{x}_*), with (\boldsymbol{x}_*,y_*) denoting a new point that is not in \mathcal{D}. If the likelihood and the parameter posterior are Gaussian, the posterior predictive then takes the form below and will also be Gaussian (in Bayesian parlance, this means that the Gaussian distribution is conjugate to itself)!

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}


\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

or, to put it simply,

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* \pm \sigma_N^2(\boldsymbol{x}_*)

The posterior predictive, meaning final linear model and associated errors (the parameter uncertainty equivalent of frequentist statistics confidence intervals) are shown below


If, instead of having a data set \mathcal{D} in which \boldsymbol{X} and \boldsymbol{y} are related approximately according to a 4th order polynomial, we use a \phi(\cdot) function to artificially create more random variables (or features, in machine learning parlance) corresponding to the non-linear terms of the polynomial function. Our function \phi(\cdot) would be \phi(x) = [1, x, x^2, x^3, x^4]^T and the resulting X would therefore have five columns instead of two ([1, x]^T), so now the task is to find \boldsymbol{\beta}=[\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T. Following the same logic as before for X'=\phi(X) and x_*'=\phi(x_*), where prime denotes the new set random variable x from function \phi(\cdot), we get the following plots





This looks great, but there is a problem: what if we do not know the functional form of the model we are supposed to fit (e.g. a simple linear function or a 4th order polynomial)? This is often the case, such as when modeling the reliability of a water reservoir system contingent on stored volumes, inflows and evaporation rates, or when modeling topography based on samples surveyed points (we do not have detailed elevation information about the terrain, e.g. a DEM file). Gaussian Processes (GPs) provide a way of going around this difficulty.


Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.