# Introduction to Gaussian Processes

In this post, I will cover the very basics of Gaussian Processes following the presentations in Murphy (2012) and Rasmussen and Williams (2006) — though hopefully easier to digest. I will approach the explanation in two ways: (1) derived from Bayesian Ordinary Linear Regression and (2) derived from the definition of Gaussian Processes in functional space (it is easier than it sounds). I am not sure why the mentioned books do not use “^” to denote estimation (if you do, please leave a comment), but I will stick to their notation assuming you may want to look into them despite running into the possibility of an statistical sin. Lastly, I would like to thank Jared Smith for reviewing this post and providing highly statistically significant insights.

If you are not familiar with Bayesian regression (see my previous post), skip the section below and begin reading from “Succinct derivation of Gaussian Processes from functional space.”

## From Bayesian Ordinary Linear Regression to Gaussian Processes

### Recapitulation of Bayesian Ordinary Regression

In my previous blog post, about Bayesian Ordinary Linear Regression (OLR), I showed that the predictive distribution for a point $\boldsymbol{x}_*$ for which we are trying to estimate $y_*$ with Bayesian OLR is

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\ &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where we are trying to regress a model over the data set $\mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}$, $\boldsymbol{X}$ being the independent variables and $\boldsymbol{y}$ the dependent variable, $\boldsymbol{\beta}$ is a vector of model parameters and $\sigma_{\epsilon}^2$ is the model error variance. The unusual notation for the normal distribution $\mathcal{N}$ means “a normal distribution of $y_*$ with (or given) the regression mean $\boldsymbol{\beta}_N^T\boldsymbol{x}_*$ and variance $\sigma_N^2(\boldsymbol{x}_*)$,” where $N$ is the number of points in the data set. The estimated variance of $\boldsymbol{y}_*$, $\sigma_N^2(\boldsymbol{x}_*)$ and the estimated regression parameters $\boldsymbol{\beta}_N$ can be calculated as

$\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}^T\boldsymbol{X})^{-1}$
$\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}^T\boldsymbol{y}$
$\hat{\sigma}_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*$

where $\boldsymbol{V}_0$ and $\boldsymbol{\beta}_0$ are the mean and the covariance of the parameter prior distribution. All the above can be used to estimate $y_*$ as

$y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_*$ with an error variance of $\sigma_N^2(\boldsymbol{x}_*)$

If we assume a prior mean of 0 ($\boldsymbol{\beta}_0=0$), replace $\boldsymbol{V}_N$ and $\boldsymbol{\beta}_N$ by their definitions, and apply a function $\phi(\boldsymbol{x})$, e.g. $\phi(\boldsymbol{x})=(1, x_1, x_2, x_1x_2, x_1^2 ...)^T$, to add features to $\boldsymbol{X}$ so that we can use a linear regression approach to fit a non-linear model over our data set $\mathcal{D}$, we would instead have that

$\boldsymbol{\beta}_N = \boldsymbol{\phi}_*^T( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_X\boldsymbol{y}$
$\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{\phi}_*^{T}( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_*$

where $\boldsymbol{\phi}_* = \phi(\boldsymbol{x}_*)$ and $\boldsymbol{\phi}_X = \phi(\boldsymbol{X})$. One problem with this approach is that as we add more terms (also called features) to function $\phi(\cdot)$ to better capture non-linearities, the the dimension of the matrix we will have to invert increases. It is also not always obvious what features we should add to our data. Both problems can be handled with Gaussian Processes.

### Now to Gaussian Processes

This is where we transition from Bayesian OLR to Gaussian Processes. What we want to accomplish in the rest of this derivation is to find a way of: (1) adding as many features to the data as we need to calculate $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ without increasing the size of the matrix $\sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T$ we have to invert (remember the dimensions of such matrix equals the number of features in $\phi(\cdot)$), and of (2) finding a way of implicitly adding features to $\boldsymbol{X}$ and $\boldsymbol{x}_*$ without having to do so manually — manually adding features to the data may take a long time, specially if we decide to add (literally) an infinite number of them to have an interpolator.

The first step is to make the dimensions of the matrix we want to invert equal to the number of data points instead of the number of features in $\phi(\cdot)$. For this, we can re-arrange the equations for $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ as

$\boldsymbol{\beta}_N =\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*( \sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_X)^{-1}\boldsymbol{y}$
$\sigma_N^2(\boldsymbol{x}_*) = \boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*-\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_X(\sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{\phi}_X)^{-1}\boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_*$

We now have our feature-expanded data and points for which we want point estimates always appearing in the form of $\phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})$$\boldsymbol{x}'$ is just another point $\boldsymbol{x}_i$ either in the data set or for which we want to estimate $y_*$. From now on, $\kappa(x, x') = \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})$ will be called a covariance or kernel function. Since the prior’s covariance $\boldsymbol{V}_0$ is positive semi-definite (as any covariance matrix), we can write

$\kappa(\boldsymbol{x}, \boldsymbol{x}') = \phi(\boldsymbol{x})^T (\boldsymbol{V}_0^{1/2})^T(\boldsymbol{V}_0^{1/2})\phi(\boldsymbol{x}')=\psi(\boldsymbol{x})^T\psi(\boldsymbol{x}')$

where $\psi(\boldsymbol{x})=\boldsymbol{V}_0^{1/2}\phi(\boldsymbol{x'})$. Using $\kappa(\boldsymbol{x}, \boldsymbol{x}')$ and assuming a prior of $\boldsymbol{I}$, $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ then take a new shape in the predictor posterior of a Gaussian Process:

$\boldsymbol{\beta}_N =K_*( \sigma_{\epsilon}^2\boldsymbol{I} + K)^{-1}\boldsymbol{y}$
$\sigma_N^2(\boldsymbol{x}_*) = k_{**}-\boldsymbol{k}_*(\sigma_{\epsilon}^2\boldsymbol{I}+K)^{-1}\boldsymbol{k}_*^T$

where $K = \kappa(\boldsymbol{\phi}_X,\boldsymbol{\phi}_X)$, $\boldsymbol{k}_* = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_X)$ and $k_{**} = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_*)$. Now the features of the data are absorbed within the inner-products between observed $\boldsymbol{x}$‘s and or for $\boldsymbol{x}_*$, so we can add as many features as we want without impacting the dimensions of the matrix we will have to invert. Also, after this transformation, instead of $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ representing the mean and covariance between model parameters, they represent the mean and covariance of and among data points and/or points for which we want to get point estimates. The parameter posterior from Bayesian OLR would now be used to sample the values of $y_*$ directly instead of model parameters. And now we have a Gaussian Process with which to estimate $y_*$. For plots of samples of $\boldsymbol{y}_*$ from the prior and from the predictive posterior, of the mean plus or minus the error variances, and of models with different kernel parameters, see figures at the end of the post.

## Kernel (or Covariance) matrices

The convenience of having $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ written in terms of $\kappa(\cdot, \cdot)$, is that $\kappa$ does not have to represent simply a matrix-matrix or vector-matrix multiplication of $\boldsymbol{X}$ and $\boldsymbol{x}_*$. In fact, function $\kappa$ can be any function that corresponds to an inner-product after some transformation from $\boldsymbol{x}\rightarrow\phi(\boldsymbol{x})$, which will necessarily return a positive semi-definite matrix and therefore be a valid covariance matrix. There are several kernels (or kernel functions or kernel shapes) available, each accounting in a different way for the nearness or similarity (at least in Gaussian Processes) between values of $x_i$:

• the linear kernel: $\kappa(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T\boldsymbol{x}'$. This kernel is used when trying to fit a line going through the origin.
• the polynomial kernel: $\kappa(\boldsymbol{x}, \boldsymbol{x}') = (1 + \boldsymbol{x}^T\boldsymbol{x}')^d$, where d is the dimension of the polynomial function. This kernel is used when trying to fit a polynomial function (such as the fourth order polynomial in the blog post about Bayesian OLR), and
• the Radial Basis Function, or RBF (aka Squared Exponential, or SE), $\kappa(\boldsymbol{x}, \boldsymbol{x}') = e^{-\frac{||\boldsymbol{x}-\boldsymbol{x}'||^2}{2\sigma_{RBF}^2}}$, where $\sigma_{RBF}$ is a kernel parameter denoting the characteristic length scale of the function to be approximated. Using the RBF kernel is equivalent to adding an infinite (!) number of features the data, meaning $\phi(\boldsymbol{x})=(1, x_1, ..., x_d, x_1x_2, x_1x_3,..., x_{d-1}x_d, x_1x_2x_3, ...)^T$.

But this is all quite algebraic. Luckily, there is a more straight-forward derivation shown next.

## Succinct derivation of Gaussian Processes from functional space.

Most regression we study in school is a way of estimating the parameters $\boldsymbol{\beta}$ of a model. In Bayesian OLR, for example, we have a distribution (the parameter posterior) from which to sample model parameters. What if we instead derived a distribution from which to sample functions themselves? The first (and second and third) time I heard about the idea of sampling functions I thought right away of opening a box and from it drawing exponential and sinusoid functional forms. However, what is meant here is a function of an arbitrary shape without functional form. How is this possible? The answer is: instead of sampling a functional form itself, we sample the values of $y_*=f(\boldsymbol{x}_*)$ at discrete points, with a collection of $\boldsymbol{y}_i$ for all $\boldsymbol{x}_i$ in our domain called a function $\boldsymbol{f}$. After all, don’t we want a function more often than not solely for the purpose of getting point estimates and associated errors for given values of $\boldsymbol{x}_*$?

In Gaussian Process regression, we sample functions $\boldsymbol{f}$ from a distribution by considering each value of $x_i$ in a discretized range along the x-axis as a random variable. That means that if we discretize our x-axis range into 100 equally spaced points, we will have 100 random variables $x_0,..,x_{100}$. The mean of all possible functions at $\boldsymbol{x}_i$ therefore represents the mean value of $y_i$ in $\boldsymbol{x}_i$, and each term in the covariance matrix (kernel matrix, see “Kernel (or Covariance) matrices” section above) represents how similar the values of $y_i$ and $y_j$ will be for each pair $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$ based on the value of a distance metric between  $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$: if $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$ are close to each other, $\boldsymbol{y}_i$ and $\boldsymbol{y}_j$ tend to be similar and vice-versa. We can therefore turn the sampled functions $y = f(\boldsymbol{x})$ into whatever functional form we want by choosing the appropriate covariance matrix (kernel) — e.g. a linear kernel will return a linear model, a polynomial kernel will return a polynomial function, and an RBF kernel will return a function with an all linear and an infinite number of non-linear terms. A Gaussian Process can then be written as

$f(\boldsymbol{x}) \sim \mathcal{GP}(\mu(\boldsymbol{x}), \kappa(\boldsymbol{x}, \boldsymbol{x}'))$

where $\mu(\boldsymbol{x})$ is the mean function (e.g. a horizontal line along the x-axis at y = 0 would mean that $\mu(\boldsymbol{x}) = [0,..,0]^T$) and $\kappa(\boldsymbol{x}, \boldsymbol{x}')$ is the covariance matrix, which here expresses how similar the values of $y_i$ will be based on the distance between two values of $x_i$.

To illustrate this point, assume we want to sample functions over 200 discretizations $\boldsymbol{X}_* = x_{*0}, ..., x_{*199}$ along the x-axis from x=-5 to x=5 ($\Delta x = 0.05$) using a Gaussian Process. Assume the parameters of our Gaussian Process from which we will sample our functions are mean 0 and that it uses an RBF kernel for its covariance matrix with parameter $\sigma_{RBF}=2$ (not be confused with standard deviation), meaning that $k_{i,i}=1.0 ,\:\forall i \in [0, 199]$ and $k_{i,j}=e^{-\frac{||x_i-x_j||^2}{2\cdot 2^2}},\: \forall i,j \in [0, 199]$, or

\begin{aligned} p(\boldsymbol{f}_*|\boldsymbol{X_*})&=\mathcal{N}(\boldsymbol{f}|\boldsymbol{\mu},\boldsymbol{K_{**}})\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[\mu_{*0},..,\mu_{*199}], \begin{bmatrix} k_{*0,0} & k_{*0,1} & \cdots & k_{*0,199} \\ k_{*1,0} & k_{*1, 1} & \cdots & k_{*1,199}\\ \vdots & \vdots & \ddots & \vdots \\ k_{*199,0} & k_{*199,1} & \cdots & k_{*199,199} \end{bmatrix} \right)\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[0,..,0], \begin{bmatrix} 1 & 0.9997 & \cdots & 4.2\cdot 10^{-6} \\ 0.9997 & 1 & \cdots & 4.8\cdot 10^{-6}\\ \vdots & \vdots & \ddots & \vdots \\ 4.2\cdot 10^{-6} & 4.8\cdot 10^{-6} & \cdots & 1 \end{bmatrix} \right) \end{aligned}

Each sample from the normal distribution above will return 200 values: $\boldsymbol{f}_* = [y_0, ..., y_{199}]$. A draw of 3 such sample functions would look generally like the following:

The functions above look incredibly smooth because of the RBF kernel, which adds an infinite number of non-linear features to the data. The shaded region represents the prior distribution of functions. The grey line in the middle represents the mean of an infinite number of function samples, while the upper and lower bounds represent the corresponding prior mean plus or minus standard deviations.

These functions looks quite nice but pretty useless. We are actually interested in regressing our data assuming a prior, namely on a posterior distribution, not on the prior by itself. Another way of formulating this is to say that we are interested only in the functions that go through or close to certain known values of $\boldsymbol{x}$ and $y$. All we have to do is to add random variables corresponding to our data and condition the resulting distribution on them, which means accounting only for samples of functions that exactly go through (or given) our data points. The resulting distribution would then look like $p(\boldsymbol{f}_*|\mathcal{D}, \boldsymbol{X}_*)$. After adding our data $\boldsymbol{X}, \boldsymbol{f}$, or $\boldsymbol{X}, \boldsymbol{y}$, our distribution would look like

$\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}K(\boldsymbol{X},\boldsymbol{X}) & K(\boldsymbol{X},\boldsymbol{X}_*) \\ K(\boldsymbol{X}_*,\boldsymbol{X}) & K(\boldsymbol{X}_*,\boldsymbol{X}_*)\end{bmatrix}\right)$

In this distribution we have random variables representing the $\boldsymbol{X}$ from our data set $\mathcal{D}$ and the $\boldsymbol{X}_*$ from which we want to get point estimates and corresponding error standard deviations. The covariance matrix above accounts for correlations of all observations with all other observations, and correlations of all observations with all points to be predicted. We now just have to condition the multivariate normal distribution above over the points in our data set, so that the distribution accounts for the infinite number of functions that go through our $\boldsymbol{y}$ at the corresponding $\boldsymbol{X}$. This yields

$p(\boldsymbol{f}_*|\boldsymbol{X}_*,\boldsymbol{X},\boldsymbol{y})=\mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})$

$\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{y}$

$\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{K}_*$

$\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)$

where $\boldsymbol{K} = K(\boldsymbol{X},\boldsymbol{X})$$\boldsymbol{K}_* = K(\boldsymbol{X},\boldsymbol{X}_*)$, $\boldsymbol{K}_{**} = K(\boldsymbol{X_*},\boldsymbol{X_*})$, and $diag\left(\bar{\boldsymbol{\Sigma}}\right)$ is a vector containing the elements in the diagonal of the covariance matrix $\bar{\boldsymbol{\Sigma}}$ — the variances $\bar{\sigma}^2_i$ of each random variable $x_i$. If you do not want to use a prior with $\boldsymbol{\mu} = \boldsymbol{\mu}_* = [0, ... , 0]^T$, the mean of the Gaussian Process would be $\bar{\boldsymbol{\mu}}=\boldsymbol{\mu}_* + \boldsymbol{K}_*^T\boldsymbol{K}^{-1}(\boldsymbol{f} - \boldsymbol{\mu})$. A plot of $\boldsymbol{f}_* = \bar{\boldsymbol{\mu}}$ estimated for 200 values of $x_i$ — so that the resulting curve looks smooth — plus or minus associated uncertainty ($\pm\bar{\boldsymbol{\sigma}}^2$), regressed on 5 random data points $\boldsymbol{X}$, $\boldsymbol{y}$, should look generally like

A plot of three sampled functions (colored lines below, the grey line is the mean) $\boldsymbol{f}_* \sim \mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})$ should look generally like

Several kernel have parameters that can strongly influence the regressed model and that can be estimated — see Murphy (2012) for a succinct introduction to kernel parameters estimation. One example is parameter $\sigma_{RBF}^2$ of the RBF kernel, which determines the correlation length scale of the fitted model and therefore how fast with increasing separation distance between data points the model will default to the prior (here, with $\boldsymbol{\mu}=0$). The figure below exemplifies this effect

All figures have the same data points but the resulting models look very different. Reasonable values for $\sigma_{RBF}^2$ depend on the spacing between the data points and are therefore data-specific. Also, if when you saw this plot you thought of fitting RBF functions as interpolators, that was not a coincidence: the equations solved when fitting RBFs to data as interpolators are the same solved when using Gaussian Processes with RBF kernels.

Lastly, in case of noisy observation of $y$ in $\mathcal{D}$ the Gaussian Process distribution and the expressions for the mean, covariance, and standard deviation become

$\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I} & \boldsymbol{K}_* \\ \boldsymbol{K}_*^T & \boldsymbol{K}_{**}\end{bmatrix}\right)$

$\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{y}$

$\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{K}_*$

$\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)$

where $\sigma_{\epsilon}^2$ is the error variance and $\boldsymbol{I}$ is an identity matrix. The mean and error predictions would look like the following

## Concluding Remarks

What I presented is the very basics of Gaussian Processes, leaving several important topics which were not covered here for the interested reader to look into. Examples are numerically more stable formulations, computationally more efficient formulations, kernel parameter estimation, more kernels, using a linear instead of zero-mean prior (semi-parametric Gaussian Processes), Gaussian Processes for classification and Poisson regression, the connection between Gaussian Processes and other methods, like krieging and kernel methods for nonstationary stochastic processes. Some of these can be found in great detail in Rasmussen and Williams (2006), and Murphy (2012) has a succinct presentation of all of the listed topics.

I was really confused about Gaussian Processes the first time I studied it and tried to make this blog post as accessible as possible. Please leave a comment if you think the post would benefit from any particular clarification.

## References

Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press, ISBN 978-0262018029

Rasmussen, Carl Edward and Williams, Christopher K. I., 2006. Gaussian Processes for Machine Learning. The MIT Press, ISBN 978-0262182539.