# Google Earth Pro: the cool, the trade-offs and the limits

This post is an account of my recent experience with Google Earth Pro (GPE), a free online tool meant to make visualization of geographical information intuitive and accessible. This account contains tidbits that I thought others might find useful, but it is not meant to be an unbiased or comprehensive resource.

My goal in using GPE was to set up a short video tour of the reservoirs on the main stem of the Upper Snake River, which has its upper reaches around the Teton Range (WY, USA) before going through much of southern Idaho. Here is the video, produced with instructions from this post from the blog:

## The cool

GPE is indeed an intuitive way for people that have only a limited experience of geographic information systems (GIS) to put together nice-looking videos in a limited amount of time. It relies on the increasingly large amounts of geographic data available online. For instance, polygons made of points with geographical coordinates, used to delineate political boundaries or catchments among others, can be found increasingly easily, be it under formats specific to GPE (KML, KMZ), or traditional GIS format such as SHP (shapefile format, which can be imported to GPE with… “File => Import”, I told you it was intuitive). This capability to find the right data could be improved by the recent launch of a dataset search engine by Google.

This video, after several tests, superposed three layers on the satellite images of the land surface:

1. Pins of the dams’ locations. Please refer to this post to learn all you need to know about setting and customizing such new placemarks.
2. The network of all streams with an average flow of more than 10 m3/s in the Columbia River basin. This is mainly for easy visualization of where the Snake River and its affluents are. That data was obtained from this useful USGS resource, with all the major basins in the US having data in GPE-ready KML format (basin boundaries, tributaries with their average flows, landuse, etc.).
3. An invisible layer contains the shapes of the reservoirs’ lakes, in order to zoom in on the reservoir and not the dam itself. I got the shapes of waterbodies in the Upper Snake River basin from an amazing resource for water resources practitioners (see “The limits” below to see how to access this resource, but also to understand why it must be handled with care). Hint: since the list of waterbodies is dauntingly large and includes every little puddle in the area, I had to zoom in to the desired reservoir so as to only select its shape, by doing “File => Import”, then when prompted, by choosing to only import the features in my line of sight.

The trade-offs when doing this kind of short intro video to your study area, are between video quality and required memory. This is especially true if the video is of a rather large area, as the GPE satellite images embed a level of detail that is amazing at the scale of that basin. Therefore, when creating my video, I had to resort to several tries before getting  a correct quality, and this video eats up 500Mo of disk space for 43 seconds (!!!!). Any video with lower resolution just looked downright disgusting, whereas any larger video may have serious problem running on your laptop. Think of it as a Goldilocks zone that you have to find if you want to have a video you can embed in an oral presentation.

(NB: to embed this video in a presentation in a fullproof kind of way, the easiest is to embed a link from the PPT file to the video in the same folder).

## The limits

The limits of Google Earth are with its plain refusal to display features it deems too big. To understand this, let us look at this USGS webpage where a sizable of hydrological information can be downloaded. In particular it is possible to download data exactly for the upper Snake area (Subregion 1704 on the picture below). This data can include waterbodies as discussed above (in “the cool”) and that is contained in the NHD data ticket in the picture. By ticking “Watershed Boundary Dataset (WBD)”, one can also download basin and subbasin boundaries.

Why then did I not represent Upper Snake boundaries in my video? Well, the Upper Snake polygon, under SHP format at least, is too big so GPE just… refused to display it. I tried to represent the HU-6 and HU-8 subbasins (smaller than the HU-4-sized Upper Snake under this classification system, and:

• The HU-6 subbasin just divides the basin into “headwaters” (smaller part upstream of Palisades) and “the rest”; the former feature was smaller and GPE plotted it, but it did not plot the latter.
• The HU-8 subbasins all are individually smaller features that GPE accepts to plot. So I could have plotted the watershed boundaries by plotting ALL of the HU-8 subbasins, but spoiler alert: this looked horrendous.

Takeaway: GPE only displays features that individually have limited size. So maybe using a more powerful and recent desktop computer than the one I had would have done the trick… but be aware that there will always be a limit. Also note that assuming that GPE is just taking its time loading the data, and that staring blankly at the screen in the meantime is not going to be too painful, is not a good idea. Take a nap instead, or do something else, then admit that the damn thing just did not display.

A solution for this, of course, is to load the larger features on GIS software and making them lighter by eliminating points in the polygon without altering the shape… but that kinda beats the purpose of GPE, which is to avoid having to become a GIS geek, doesn’t it?

# A Beginner’s Guide to Visualizing Trade Offs across Multiple Objectives using the Python Bokeh Library

Python libraries can be very helpful in making visualizations. One particular library I like using is the Bokeh library. In this post, I will share how this library can be used for visualizing trade offs across multiple objectives.

First import the libraries we need for our plots:


import pandas as pd
from bokeh.io import output_file, show #for outputting and showing plots
from bokeh.plotting import figure #for creating figures, which are the plot area in this case
from bokeh.models.tools import HoverTool #for importing the hover tool
from bokeh.models import ColumnDataSource #which we will ultimately use for linking plots
from bokeh.models.widgets import Panel, Tabs #for making tabs



## Basic Plotting

Let’s start by creating a basic plot for a three objective DTLZ2, where the first and second objectives are plotted on the x and y axes respectively, and the third objective is the size of the plotted points.


# create a new plot using the figure function
p = figure(title="DTLZ2 with 3 objectives",plot_width=400, plot_height=400)

# add axis titles
p.xaxis.axis_label = "Objective 1"
p.yaxis.axis_label = "Objective 2"

# assign data to arrays
obj1=paretoFront['Obj1']
obj2=paretoFront['Obj2']
obj3=paretoFront['Obj3']

# plots points
paretoPlot = p.circle(obj1, obj2, size=obj3*10, # objective 3 is the shown by the size of the points; I multiplied by 10 to make them clearer
fill_color="grey", # for more color options, check out https://bokeh.pydata.org/en/latest/docs/reference/colors.html
fill_alpha=0.6, # alpha is the transparency of the points
line_color=None) # line_color is the outline of the points

# show the results
show(p)



The figure below shows what your plot will look like. Notice the default tools that are outlined on the right of the plot.

## More useful tools

Although Bokeh plots come with a default set of useful tools, there are other tools which you could add to make your plots more interactive.

To demonstrate some of these useful tools, I will create a figure with three tabs. Each tab will contain a view of the Pareto front for two of the three objectives. In addition, these tabs will be linked, so that if you make a selection on the Pareto front in one of the tabs, the same points will be selected in the other views. I will also demonstrate the use of the vertical, horizontal, and mouse over hover tools, and add data labels for the mouse over hover tool.



obj1=paretoFront['Obj1']
obj2=paretoFront['Obj2']
obj3=paretoFront['Obj3']

# create a column data source for the plots; this will allow us to link the plots we create
source = ColumnDataSource(data=dict(x=obj1, y=obj2, z=obj3))

# define the tools you want to use; if you are defining tools, you need to list all the tools you want, including the default Bokeh tools
TOOLS = "box_select,lasso_select,help,pan,wheel_zoom,box_zoom,reset,save"

# create a new plot for each view/tab using figure
p1 = figure(tools=TOOLS,title="DTLZ2 (Objective 1, Objective 2)",plot_width=400, plot_height=400)
p2 = figure(tools=TOOLS,title="DTLZ2 (Objective 1, Objective 3)",plot_width=400, plot_height=400)
p3 = figure(tools=TOOLS,title="DTLZ2 (Objective 2, Objective 3)",plot_width=400, plot_height=400)

# define functions for hovering
def verticalhover(p,paretoPlot):

def horizontalhover(p,paretoPlot):

def pointhover(p, paretoPlot, obja, objb):
p.add_tools(HoverTool(tooltips=[(obja, "$x" ), (objb, "$y")],mode="mouse", point_policy="follow_mouse", renderers=[paretoPlot]))
# note that we added 'tooltips' here, this gives the values (like excel data labels) of the points when you hover over them

# plot 1
paretoPlot1 = p1.circle('x', 'y',
fill_color='grey',
fill_alpha=0.7,
hover_fill_color="crimson", hover_alpha=0.8, # this changes the color and transparecy of the points when we hover over them
line_color=None, hover_line_color="white", source=source) # we will use 'source' in all our plots, because we want to link them by having them share the same sources data

p1.xaxis.axis_label = "Objective 1"
p1.yaxis.axis_label = "Objective 2"

# add the hover tools
verticalhover(p1,paretoPlot1)
horizontalhover(p1,paretoPlot1)
pointhover(p1, paretoPlot1, "objective 1", "objective 2")

# add the panel for the first tab
tab1 = Panel(child=p1, title="obj1, obj2")

# repeat the same steps for tabs 2 and 3 (i.e. objective 1 vs. objective 3 and objective 2 vs. objective 3)
# plot 2

paretoPlot2 = p2.circle('x','z',
fill_color='grey', hover_fill_color="crimson",
fill_alpha=0.7, hover_alpha=0.8,
line_color=None, hover_line_color="white", source=source)

p2.xaxis.axis_label = "Objective 1"
p2.yaxis.axis_label = "Objective 3"

verticalhover(p2,paretoPlot2)
horizontalhover(p2,paretoPlot2)
pointhover(p2, paretoPlot2, "objective 1", "objective 3")

tab2 = Panel(child=p2, title="obj1, obj3")

# plot 3

paretoPlot3 = p3.circle('y', 'z',
fill_color='grey', hover_fill_color="crimson",
fill_alpha=0.7, hover_alpha=0.8,
line_color=None, hover_line_color="white", source=source)

p3.xaxis.axis_label = "Objective 2"
p3.yaxis.axis_label = "Objective 3"

verticalhover(p3,paretoPlot3)
horizontalhover(p3,paretoPlot3)
pointhover(p3, paretoPlot3, "objective 2", "objective 3")

tab3 = Panel(child=p3, title="obj2, obj3")

# finally create the tabs and show your plot!
tabs = Tabs(tabs=[ tab1, tab2, tab3 ])
show(tabs)



The figure below shows the plot you will obtain:

Now, let’s explore the tools that we added:

• Selection across multiple tabs: use lasso or box select to select some points in your first tab. When you move to the second and third tabs, the same points will be selected in different views.
• Vertical and Horizontal Hover: select either the vertical or horizontal hover tool from the right toolbar. Make sure you deselect all other tools. The vertical hover will color in all the points that lie in the vertical line that crosses the point you are hovering over, and the horizontal hover does the same thing horizontally.
• Point Hover and Data Labels: Select the point hover tool, deselect all the other tools, and hover over the points. When you hover over a point, you will get the values for the objectives for that point.

# Introduction to Gaussian Processes

In this post, I will cover the very basics of Gaussian Processes following the presentations in Murphy (2012) and Rasmussen and Williams (2006) — though hopefully easier to digest. I will approach the explanation in two ways: (1) derived from Bayesian Ordinary Linear Regression and (2) derived from the definition of Gaussian Processes in functional space (it is easier than it sounds). I am not sure why the mentioned books do not use “^” to denote estimation (if you do, please leave a comment), but I will stick to their notation assuming you may want to look into them despite running into the possibility of an statistical sin. Lastly, I would like to thank Jared Smith for reviewing this post and providing highly statistically significant insights.

If you are not familiar with Bayesian regression (see my previous post), skip the section below and begin reading from “Succinct derivation of Gaussian Processes from functional space.”

## From Bayesian Ordinary Linear Regression to Gaussian Processes

### Recapitulation of Bayesian Ordinary Regression

In my previous blog post, about Bayesian Ordinary Linear Regression (OLR), I showed that the predictive distribution for a point $\boldsymbol{x}_*$ for which we are trying to estimate $y_*$ with Bayesian OLR is

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\ &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where we are trying to regress a model over the data set $\mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}$, $\boldsymbol{X}$ being the independent variables and $\boldsymbol{y}$ the dependent variable, $\boldsymbol{\beta}$ is a vector of model parameters and $\sigma_{\epsilon}^2$ is the model error variance. The unusual notation for the normal distribution $\mathcal{N}$ means “a normal distribution of $y_*$ with (or given) the regression mean $\boldsymbol{\beta}_N^T\boldsymbol{x}_*$ and variance $\sigma_N^2(\boldsymbol{x}_*)$,” where $N$ is the number of points in the data set. The estimated variance of $\boldsymbol{y}_*$, $\sigma_N^2(\boldsymbol{x}_*)$ and the estimated regression parameters $\boldsymbol{\beta}_N$ can be calculated as

$\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}^T\boldsymbol{X})^{-1}$
$\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}^T\boldsymbol{y}$
$\hat{\sigma}_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*$

where $\boldsymbol{V}_0$ and $\boldsymbol{\beta}_0$ are the mean and the covariance of the parameter prior distribution. All the above can be used to estimate $y_*$ as

$y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_*$ with an error variance of $\sigma_N^2(\boldsymbol{x}_*)$

If we assume a prior mean of 0 ($\boldsymbol{\beta}_0=0$), replace $\boldsymbol{V}_N$ and $\boldsymbol{\beta}_N$ by their definitions, and apply a function $\phi(\boldsymbol{x})$, e.g. $\phi(\boldsymbol{x})=(1, x_1, x_2, x_1x_2, x_1^2 ...)^T$, to add features to $\boldsymbol{X}$ so that we can use a linear regression approach to fit a non-linear model over our data set $\mathcal{D}$, we would instead have that

$\boldsymbol{\beta}_N = \boldsymbol{\phi}_*^T( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_X\boldsymbol{y}$
$\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{\phi}_*^{T}( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_*$

where $\boldsymbol{\phi}_* = \phi(\boldsymbol{x}_*)$ and $\boldsymbol{\phi}_X = \phi(\boldsymbol{X})$. One problem with this approach is that as we add more terms (also called features) to function $\phi(\cdot)$ to better capture non-linearities, the the dimension of the matrix we will have to invert increases. It is also not always obvious what features we should add to our data. Both problems can be handled with Gaussian Processes.

### Now to Gaussian Processes

This is where we transition from Bayesian OLR to Gaussian Processes. What we want to accomplish in the rest of this derivation is to find a way of: (1) adding as many features to the data as we need to calculate $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ without increasing the size of the matrix $\sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T$ we have to invert (remember the dimensions of such matrix equals the number of features in $\phi(\cdot)$), and of (2) finding a way of implicitly adding features to $\boldsymbol{X}$ and $\boldsymbol{x}_*$ without having to do so manually — manually adding features to the data may take a long time, specially if we decide to add (literally) an infinite number of them to have an interpolator.

The first step is to make the dimensions of the matrix we want to invert equal to the number of data points instead of the number of features in $\phi(\cdot)$. For this, we can re-arrange the equations for $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ as

$\boldsymbol{\beta}_N =\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*( \sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_X)^{-1}\boldsymbol{y}$
$\sigma_N^2(\boldsymbol{x}_*) = \boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*-\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_X(\sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{\phi}_X)^{-1}\boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_*$

We now have our feature-expanded data and points for which we want point estimates always appearing in the form of $\phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})$$\boldsymbol{x}'$ is just another point $\boldsymbol{x}_i$ either in the data set or for which we want to estimate $y_*$. From now on, $\kappa(x, x') = \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})$ will be called a covariance or kernel function. Since the prior’s covariance $\boldsymbol{V}_0$ is positive semi-definite (as any covariance matrix), we can write

$\kappa(\boldsymbol{x}, \boldsymbol{x}') = \phi(\boldsymbol{x})^T (\boldsymbol{V}_0^{1/2})^T(\boldsymbol{V}_0^{1/2})\phi(\boldsymbol{x}')=\psi(\boldsymbol{x})^T\psi(\boldsymbol{x}')$

where $\psi(\boldsymbol{x})=\boldsymbol{V}_0^{1/2}\phi(\boldsymbol{x'})$. Using $\kappa(\boldsymbol{x}, \boldsymbol{x}')$ and assuming a prior of $\boldsymbol{I}$, $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ then take a new shape in the predictor posterior of a Gaussian Process:

$\boldsymbol{\beta}_N =K_*( \sigma_{\epsilon}^2\boldsymbol{I} + K)^{-1}\boldsymbol{y}$
$\sigma_N^2(\boldsymbol{x}_*) = k_{**}-\boldsymbol{k}_*(\sigma_{\epsilon}^2\boldsymbol{I}+K)^{-1}\boldsymbol{k}_*^T$

where $K = \kappa(\boldsymbol{\phi}_X,\boldsymbol{\phi}_X)$, $\boldsymbol{k}_* = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_X)$ and $k_{**} = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_*)$. Now the features of the data are absorbed within the inner-products between observed $\boldsymbol{x}$‘s and or for $\boldsymbol{x}_*$, so we can add as many features as we want without impacting the dimensions of the matrix we will have to invert. Also, after this transformation, instead of $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ representing the mean and covariance between model parameters, they represent the mean and covariance of and among data points and/or points for which we want to get point estimates. The parameter posterior from Bayesian OLR would now be used to sample the values of $y_*$ directly instead of model parameters. And now we have a Gaussian Process with which to estimate $y_*$. For plots of samples of $\boldsymbol{y}_*$ from the prior and from the predictive posterior, of the mean plus or minus the error variances, and of models with different kernel parameters, see figures at the end of the post.

## Kernel (or Covariance) matrices

The convenience of having $\boldsymbol{\beta}_N$ and $\sigma_N^2(\boldsymbol{x}_*)$ written in terms of $\kappa(\cdot, \cdot)$, is that $\kappa$ does not have to represent simply a matrix-matrix or vector-matrix multiplication of $\boldsymbol{X}$ and $\boldsymbol{x}_*$. In fact, function $\kappa$ can be any function that corresponds to an inner-product after some transformation from $\boldsymbol{x}\rightarrow\phi(\boldsymbol{x})$, which will necessarily return a positive semi-definite matrix and therefore be a valid covariance matrix. There are several kernels (or kernel functions or kernel shapes) available, each accounting in a different way for the nearness or similarity (at least in Gaussian Processes) between values of $x_i$:

• the linear kernel: $\kappa(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T\boldsymbol{x}'$. This kernel is used when trying to fit a line going through the origin.
• the polynomial kernel: $\kappa(\boldsymbol{x}, \boldsymbol{x}') = (1 + \boldsymbol{x}^T\boldsymbol{x}')^d$, where d is the dimension of the polynomial function. This kernel is used when trying to fit a polynomial function (such as the fourth order polynomial in the blog post about Bayesian OLR), and
• the Radial Basis Function, or RBF (aka Squared Exponential, or SE), $\kappa(\boldsymbol{x}, \boldsymbol{x}') = e^{-\frac{||\boldsymbol{x}-\boldsymbol{x}'||^2}{2\sigma_{RBF}^2}}$, where $\sigma_{RBF}$ is a kernel parameter denoting the characteristic length scale of the function to be approximated. Using the RBF kernel is equivalent to adding an infinite (!) number of features the data, meaning $\phi(\boldsymbol{x})=(1, x_1, ..., x_d, x_1x_2, x_1x_3,..., x_{d-1}x_d, x_1x_2x_3, ...)^T$.

But this is all quite algebraic. Luckily, there is a more straight-forward derivation shown next.

## Succinct derivation of Gaussian Processes from functional space.

Most regression we study in school is a way of estimating the parameters $\boldsymbol{\beta}$ of a model. In Bayesian OLR, for example, we have a distribution (the parameter posterior) from which to sample model parameters. What if we instead derived a distribution from which to sample functions themselves? The first (and second and third) time I heard about the idea of sampling functions I thought right away of opening a box and from it drawing exponential and sinusoid functional forms. However, what is meant here is a function of an arbitrary shape without functional form. How is this possible? The answer is: instead of sampling a functional form itself, we sample the values of $y_*=f(\boldsymbol{x}_*)$ at discrete points, with a collection of $\boldsymbol{y}_i$ for all $\boldsymbol{x}_i$ in our domain called a function $\boldsymbol{f}$. After all, don’t we want a function more often than not solely for the purpose of getting point estimates and associated errors for given values of $\boldsymbol{x}_*$?

In Gaussian Process regression, we sample functions $\boldsymbol{f}$ from a distribution by considering each value of $x_i$ in a discretized range along the x-axis as a random variable. That means that if we discretize our x-axis range into 100 equally spaced points, we will have 100 random variables $x_0,..,x_{100}$. The mean of all possible functions at $\boldsymbol{x}_i$ therefore represents the mean value of $y_i$ in $\boldsymbol{x}_i$, and each term in the covariance matrix (kernel matrix, see “Kernel (or Covariance) matrices” section above) represents how similar the values of $y_i$ and $y_j$ will be for each pair $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$ based on the value of a distance metric between  $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$: if $\boldsymbol{x}_i$ and $\boldsymbol{x}_j$ are close to each other, $\boldsymbol{y}_i$ and $\boldsymbol{y}_j$ tend to be similar and vice-versa. We can therefore turn the sampled functions $y = f(\boldsymbol{x})$ into whatever functional form we want by choosing the appropriate covariance matrix (kernel) — e.g. a linear kernel will return a linear model, a polynomial kernel will return a polynomial function, and an RBF kernel will return a function with an all linear and an infinite number of non-linear terms. A Gaussian Process can then be written as

$f(\boldsymbol{x}) \sim \mathcal{GP}(\mu(\boldsymbol{x}), \kappa(\boldsymbol{x}, \boldsymbol{x}'))$

where $\mu(\boldsymbol{x})$ is the mean function (e.g. a horizontal line along the x-axis at y = 0 would mean that $\mu(\boldsymbol{x}) = [0,..,0]^T$) and $\kappa(\boldsymbol{x}, \boldsymbol{x}')$ is the covariance matrix, which here expresses how similar the values of $y_i$ will be based on the distance between two values of $x_i$.

To illustrate this point, assume we want to sample functions over 200 discretizations $\boldsymbol{X}_* = x_{*0}, ..., x_{*199}$ along the x-axis from x=-5 to x=5 ($\Delta x = 0.05$) using a Gaussian Process. Assume the parameters of our Gaussian Process from which we will sample our functions are mean 0 and that it uses an RBF kernel for its covariance matrix with parameter $\sigma_{RBF}=2$ (not be confused with standard deviation), meaning that $k_{i,i}=1.0 ,\:\forall i \in [0, 199]$ and $k_{i,j}=e^{-\frac{||x_i-x_j||^2}{2\cdot 2^2}},\: \forall i,j \in [0, 199]$, or

\begin{aligned} p(\boldsymbol{f}_*|\boldsymbol{X_*})&=\mathcal{N}(\boldsymbol{f}|\boldsymbol{\mu},\boldsymbol{K_{**}})\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[\mu_{*0},..,\mu_{*199}], \begin{bmatrix} k_{*0,0} & k_{*0,1} & \cdots & k_{*0,199} \\ k_{*1,0} & k_{*1, 1} & \cdots & k_{*1,199}\\ \vdots & \vdots & \ddots & \vdots \\ k_{*199,0} & k_{*199,1} & \cdots & k_{*199,199} \end{bmatrix} \right)\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[0,..,0], \begin{bmatrix} 1 & 0.9997 & \cdots & 4.2\cdot 10^{-6} \\ 0.9997 & 1 & \cdots & 4.8\cdot 10^{-6}\\ \vdots & \vdots & \ddots & \vdots \\ 4.2\cdot 10^{-6} & 4.8\cdot 10^{-6} & \cdots & 1 \end{bmatrix} \right) \end{aligned}

Each sample from the normal distribution above will return 200 values: $\boldsymbol{f}_* = [y_0, ..., y_{199}]$. A draw of 3 such sample functions would look generally like the following:

The functions above look incredibly smooth because of the RBF kernel, which adds an infinite number of non-linear features to the data. The shaded region represents the prior distribution of functions. The grey line in the middle represents the mean of an infinite number of function samples, while the upper and lower bounds represent the corresponding prior mean plus or minus standard deviations.

These functions looks quite nice but pretty useless. We are actually interested in regressing our data assuming a prior, namely on a posterior distribution, not on the prior by itself. Another way of formulating this is to say that we are interested only in the functions that go through or close to certain known values of $\boldsymbol{x}$ and $y$. All we have to do is to add random variables corresponding to our data and condition the resulting distribution on them, which means accounting only for samples of functions that exactly go through (or given) our data points. The resulting distribution would then look like $p(\boldsymbol{f}_*|\mathcal{D}, \boldsymbol{X}_*)$. After adding our data $\boldsymbol{X}, \boldsymbol{f}$, or $\boldsymbol{X}, \boldsymbol{y}$, our distribution would look like

$\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}K(\boldsymbol{X},\boldsymbol{X}) & K(\boldsymbol{X},\boldsymbol{X}_*) \\ K(\boldsymbol{X}_*,\boldsymbol{X}) & K(\boldsymbol{X}_*,\boldsymbol{X}_*)\end{bmatrix}\right)$

In this distribution we have random variables representing the $\boldsymbol{X}$ from our data set $\mathcal{D}$ and the $\boldsymbol{X}_*$ from which we want to get point estimates and corresponding error standard deviations. The covariance matrix above accounts for correlations of all observations with all other observations, and correlations of all observations with all points to be predicted. We now just have to condition the multivariate normal distribution above over the points in our data set, so that the distribution accounts for the infinite number of functions that go through our $\boldsymbol{y}$ at the corresponding $\boldsymbol{X}$. This yields

$p(\boldsymbol{f}_*|\boldsymbol{X}_*,\boldsymbol{X},\boldsymbol{y})=\mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})$

$\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{y}$

$\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{K}_*$

$\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)$

where $\boldsymbol{K} = K(\boldsymbol{X},\boldsymbol{X})$$\boldsymbol{K}_* = K(\boldsymbol{X},\boldsymbol{X}_*)$, $\boldsymbol{K}_{**} = K(\boldsymbol{X_*},\boldsymbol{X_*})$, and $diag\left(\bar{\boldsymbol{\Sigma}}\right)$ is a vector containing the elements in the diagonal of the covariance matrix $\bar{\boldsymbol{\Sigma}}$ — the variances $\bar{\sigma}^2_i$ of each random variable $x_i$. If you do not want to use a prior with $\boldsymbol{\mu} = \boldsymbol{\mu}_* = [0, ... , 0]^T$, the mean of the Gaussian Process would be $\bar{\boldsymbol{\mu}}=\boldsymbol{\mu}_* + \boldsymbol{K}_*^T\boldsymbol{K}^{-1}(\boldsymbol{f} - \boldsymbol{\mu})$. A plot of $\boldsymbol{f}_* = \bar{\boldsymbol{\mu}}$ estimated for 200 values of $x_i$ — so that the resulting curve looks smooth — plus or minus associated uncertainty ($\pm\bar{\boldsymbol{\sigma}}^2$), regressed on 5 random data points $\boldsymbol{X}$, $\boldsymbol{y}$, should look generally like

A plot of three sampled functions (colored lines below, the grey line is the mean) $\boldsymbol{f}_* \sim \mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})$ should look generally like

Several kernel have parameters that can strongly influence the regressed model and that can be estimated — see Murphy (2012) for a succinct introduction to kernel parameters estimation. One example is parameter $\sigma_{RBF}^2$ of the RBF kernel, which determines the correlation length scale of the fitted model and therefore how fast with increasing separation distance between data points the model will default to the prior (here, with $\boldsymbol{\mu}=0$). The figure below exemplifies this effect

All figures have the same data points but the resulting models look very different. Reasonable values for $\sigma_{RBF}^2$ depend on the spacing between the data points and are therefore data-specific. Also, if when you saw this plot you thought of fitting RBF functions as interpolators, that was not a coincidence: the equations solved when fitting RBFs to data as interpolators are the same solved when using Gaussian Processes with RBF kernels.

Lastly, in case of noisy observation of $y$ in $\mathcal{D}$ the Gaussian Process distribution and the expressions for the mean, covariance, and standard deviation become

$\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I} & \boldsymbol{K}_* \\ \boldsymbol{K}_*^T & \boldsymbol{K}_{**}\end{bmatrix}\right)$

$\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{y}$

$\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{K}_*$

$\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)$

where $\sigma_{\epsilon}^2$ is the error variance and $\boldsymbol{I}$ is an identity matrix. The mean and error predictions would look like the following

## Concluding Remarks

What I presented is the very basics of Gaussian Processes, leaving several important topics which were not covered here for the interested reader to look into. Examples are numerically more stable formulations, computationally more efficient formulations, kernel parameter estimation, more kernels, using a linear instead of zero-mean prior (semi-parametric Gaussian Processes), Gaussian Processes for classification and Poisson regression, the connection between Gaussian Processes and other methods, like krieging and kernel methods for nonstationary stochastic processes. Some of these can be found in great detail in Rasmussen and Williams (2006), and Murphy (2012) has a succinct presentation of all of the listed topics.

I was really confused about Gaussian Processes the first time I studied it and tried to make this blog post as accessible as possible. Please leave a comment if you think the post would benefit from any particular clarification.

## References

Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press, ISBN 978-0262018029

Rasmussen, Carl Edward and Williams, Christopher K. I., 2006. Gaussian Processes for Machine Learning. The MIT Press, ISBN 978-0262182539.

# Time series forecasting in Python for beginners

This semester I am teaching Engineering Management Methods here at Cornell University. The course is aimed at introducing engineering students to systems thinking and a variety of tools and analyses they can use to analyze data. The first chapter has been on time series forecasting, where we discussed some of the simpler models one can use and apply for forecasting purposes, including Simple and Weighted Moving Average, Single and Double Exponential Smoothing, Additive and Multiplicative Seasonal Models, and Holt Winter’s Method.

The class applications as well as the homework are primarily performed in Excel, but I have been trying, with limited success, to encourage the use of programming languages for the assignments. One comment I’ve received by a student has been that it takes significantly more time to perform the calculations by coding; they feel that it’s a waste of time. I initially attributed the comment to the fact that the student was new to coding and it takes time in the beginning, but on later reflection I realized that, in fact, the student was probably simply manually repeating the same Excel operations by using code: take a set of 30 observations, create an array to store forecasts, loop through every value and calculate forecast using model formula, calculate error metrics, print results, repeat steps for next set of data. It occurred to me that of course they think it’s a waste of time, because doing it that way completely negates what programming is all about: designing and building an executable program or function to accomplish a specific computing task. In this instance, the task is to forecast using each of the models we learn in class and the advantage of coding comes with the development of some sort of program or function that performs these operations for us, given a set of data as input. Simply going through the steps of performing a set of calculations for a problem using code is not much different than doing so manually or in Excel. What is different (and beneficial) is designing a code so that it can then be effortlessly applied to all similar problems without having to re-perform all calculations. I realize this is obvious to the coding virtuosos frequenting this blog, but it’s not immediately obvious to the uninitiated who are rather confused on why Dr. Hadjimichael is asking them to waste so much time for a meager bonus on the homework.

So this blog post, is aimed at demonstrating to coding beginners how one can transition from one way of thinking to the other, and providing a small time-series-forecasting toolkit for users that simply want to apply the models to their data.

The code and data for this example can be found on my GitHub page and I will discuss it below. I will be using a wine sales dataset that lists Australian wine sales (in kiloliters) from January 1980 to October 1991. The data looks like this:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Date Sales 1/1/80 464 2/1/80 675 3/1/80 703 4/1/80 887 5/1/80 1139 6/1/80 1077 7/1/80 1318 8/1/80 1260 9/1/80 1120 10/1/80 963 11/1/80 996 12/1/80 960
view raw wine_sales.csv hosted with ❤ by GitHub

And this is what the time series looks like:

We first need to import the packages we’ll be using and load the data. I will be using Pandas in this example (but there’s other ways). I’m also defining the number of seasonal periods in a cycle, in this case 12.

import numpy as np #Package we'll use for numerical calculations
import matplotlib.pyplot as plt #From matplotlib package we import pyplot for plots
import pandas #Package to data manipulation
import scipy.optimize #Package we'll use to optimize
plt.style.use('seaborn-colorblind') #This is a pyplot style (optional)

'''Load the data into a pandas series with the name wine_sales'''
time_series = pandas.Series.from_csv("wine_sales.csv", header=0)

P=12 #number of seasonal periods in a cycle

In class, I’ve always mentioned that one should use a training and a validation set for model development, primarily to avoid overfitting our model to the specific training set. In this example, the functions are written as they apply to the training set. Should you choose to apply the functions listed here, you should apply the functions for the training set, extract forecasts and then use those to initialize your validation period. To divide the observations, you would do something like this:
training = time_series[0:108] # Up to December '88
validation = time_series[108:] # From January '89 until end


Now, if say, we wanted to apply the Naive model of the next steps forecast being equal to the current observation, i.e., $\hat{y}_{t+1}=y_t$, we’d do something like:


y_hat=pandas.Series().reindex_like(time_series) # Create an array to store forecasts
y_hat[0]= time_series[0] # Initialize forecasting array with first observation
''' Loop through every month using the model to forecast y_hat'''
for t in range(len(y_hat)-1): # Set a range for the index to loop through
y_hat[t+1]= time_series[t] # Apply model to forecast time i+1


Now if we’d like to use this for any time series, so we don’t have to perform our calculations every time, we need to reformat this a bit so it’s a function:


def naive(time_series):
y_hat=pandas.Series().reindex_like(time_series)
y_hat[0]= time_series[0] # Initialize forecasting array with first observation
''' Loop through every month using the model to forecast y'''
#This sets a range for the index to loop through
for t in range(len(y_hat)-1):
y_hat[t+1]= time_series[t] # Apply model to forecast time i+1
return y_hat


Now we can just call define this function at the top of our code and just call it with any time series as an input. The function as I’ve defined it returns a pandas.Series with all our forecasts. We can then do the same for all the other modeling methods (below). Some things to note:

• The data we read in the top, outside the functions, as well as any parameters defined (P in this case) are global variables and do not need to be defined as an input to the function. The functions below only need a list of parameter values as inputs.
• For the models with seasonality and/or trend we need to create separate series to store those estimates for E, S, and T.
• Each model has its own initialization formulas and if we wanted to apply them to the validation set that follows our training set, we’d need to initialize with the last values of our training.
'''SIMPLE MOVING AVERAGE
Using this model, y_hat(t+1)=(y(t)+y(t-1)...+y(t-k+1))/k (i.e., the predicted
next value is equal to the average of the last k observed values).'''
def SMA(params):
k=int(np.array(params))
y_hat=pandas.Series().reindex_like(time_series)
y_hat[0:k]=time_series[0:k]
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
y_hat[t+1]= np.sum(time_series[t-k+1:t+1])/k # Apply model to forecast time i+1
return y_hat

'''WEIGHTED MOVING AVERAGE
Using this model, y_hat(t+1)=w(1)*y(t)+w(2)*y(t-1)...+w(k)*y(t-k+1) (i.e., the
predicted next value is equal to the weighted average of the last k observed
values).'''
def WMA(params):
weights = np.array(params)
k=len(weights)
y_hat=pandas.Series().reindex_like(time_series)
y_hat[0:k]=time_series[0:k] # Initialize values
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
y_hat[t+1]= np.sum(time_series[t-k+1:t+1].multiply(weights)) # Apply model to forecast time i+1
return y_hat
'''This model includes the constraint that all our weights should sum to one.
To include this in our optimization later, we need to define it as a function of our
weights.'''
def WMAcon(params):
weights = np.array(params)
return np.sum(weights)-1

'''SINGLE EXPONENTIAL SMOOTHING
Using this model, y_hat(t+1)=y_hat(t)+a*(y(t)-y_hat(t))(i.e., the
predicted next value is equal to the weighted average of the last forecasted value and its
difference from the observed).'''
def SES(params):
a = np.array(params)
y_hat=pandas.Series().reindex_like(time_series)
y_hat[0]=time_series[0] # Initialize values
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(len(y_hat)-1): #This sets a range for the index to loop through
y_hat[t+1]=  y_hat[t]+a*(time_series[t]-y_hat[t])# Apply model to forecast time i+1
return y_hat

'''DOUBLE EXPONENTIAL SMOOTHING (Holts Method)
Using this model, y_hat(t+1)=E(t)+T(t) (i.e., the
predicted next value is equal to the expected level of the time series plus the
trend).'''
def DES(params):
a,b = np.array(params)
y_hat=pandas.Series().reindex_like(time_series)
'''We need to create series to store our E and T values.'''
E = pandas.Series().reindex_like(time_series)
T = pandas.Series().reindex_like(time_series)
y_hat[0]=E[0]=time_series[0] # Initialize values
T[0]=0
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(len(y_hat)-1): #This sets a range for the index to loop through
E[t+1] = a*time_series[t]+(1-a)*(E[t]+T[t])
T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
y_hat[t+1] = E[t] + T[t] # Apply model to forecast time i+1
return y_hat

Using this model, y_hat(t+1)=E(t)+S(t-p) (i.e., the
predicted next value is equal to the expected level of the time series plus the
appropriate seasonal factor). We first need to create an array to store our
forecast values.'''
def ASM(params):
a,b = np.array(params)
p = P
y_hat=pandas.Series().reindex_like(time_series)
'''We need to create series to store our E and S values.'''
E = pandas.Series().reindex_like(time_series)
S = pandas.Series().reindex_like(time_series)
y_hat[:p]=time_series[0] # Initialize values
'''We need to initialize the first p number of E and S values'''
E[:p] = np.sum(time_series[:p])/p
S[:p] = time_series[:p]-E[:p]
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*E[t]
S[t+1] = b*(time_series[t]-E[t])+(1-b)*S[t+1-p]
y_hat[t+1] = E[t] + S[t+1-p] # Apply model to forecast time i+1
return y_hat

'''MULTIPLICATIVE SEASONAL
Using this model, y_hat(t+1)=E(t)*S(t-p) (i.e., the
predicted next value is equal to the expected level of the time series times
the appropriate seasonal factor). We first need to create an array to store our
forecast values.'''
def MSM(params):
a,b = np.array(params)
p = P
y_hat=pandas.Series().reindex_like(time_series)
'''We need to create series to store our E and S values.'''
E = pandas.Series().reindex_like(time_series)
S = pandas.Series().reindex_like(time_series)
y_hat[:p]=time_series[0] # Initialize values
'''We need to initialize the first p number of E and S values'''
E[:p] = np.sum(time_series[:p])/p
S[:p] = time_series[:p]/E[:p]
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*E[t]
S[t+1] = b*(time_series[t]/E[t])+(1-b)*S[t+1-p]
y_hat[t+1] = E[t]*S[t+1-p] # Apply model to forecast time i+1
return y_hat

Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the
predicted next value is equal to the expected level of the time series plus the
trend, times the appropriate seasonal factor). We first need to create an array
to store our forecast values.'''
def AHW(params):
a, b, g = np.array(params)
p = P
y_hat=pandas.Series().reindex_like(time_series)
'''We need to create series to store our E and S values.'''
E = pandas.Series().reindex_like(time_series)
S = pandas.Series().reindex_like(time_series)
T = pandas.Series().reindex_like(time_series)
y_hat[:p]=time_series[0] # Initialize values
'''We need to initialize the first p number of E and S values'''
E[:p] = np.sum(time_series[:p])/p
S[:p] = time_series[:p]-E[:p]
T[:p] = 0
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*(E[t]+T[t])
T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
S[t+1] = g*(time_series[t]-E[t])+(1-g)*S[t+1-p]
y_hat[t+1] = E[t]+T[t]+S[t+1-p] # Apply model to forecast time i+1
return y_hat

'''MUTLIPLICATIVE HOLT-WINTERS METHOD
Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the
predicted next value is equal to the expected level of the time series plus the
trend, times the appropriate seasonal factor). We first need to create an array
to store our forecast values.'''
def MHW(params):
a, b, g = np.array(params)
p = P
y_hat=pandas.Series().reindex_like(time_series)
'''We need to create series to store our E and S values.'''
E = pandas.Series().reindex_like(time_series)
S = pandas.Series().reindex_like(time_series)
T = pandas.Series().reindex_like(time_series)
y_hat[:p]=time_series[0] # Initialize values
'''We need to initialize the first p number of E and S values'''
S[:p] = time_series[:p]/(np.sum(time_series[:p])/p)
E[:p] = time_series[:p]/S[:p]
T[:p] = 0
''' Loop through every month using the model to forecast y.
Be careful with Python indexing!'''
for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*(E[t]+T[t])
T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
S[t+1] = g*(time_series[t]/E[t])+(1-g)*S[t+1-p]
y_hat[t+1] = (E[t]+T[t])*S[t+1-p] # Apply model to forecast time i+1
return y_hat


Having defined this, I can then, for example, call the Multiplicative Holt Winters method by simply typing:

MHW([0.5,0.5,0.5])


This will produce a forecast using the Multiplicative Holt Winters method with those default parameters, but we would like to calibrate them to get the “best” forecasts from our model. To do so, we need to define what we mean by “best”, and in this example I’m choosing to use Mean Square Error as my performance metric. I define it below as a function that receives the parameters and some additional arguments as inputs. I only need to set it up this way because my optimization function is trying to minimize the MSE function by use of those parameters. I’m using the “args” array to simply tell the function which model it’s using to forecast.

def MSE(params, args):
model, = args
t_error = np.zeros(len(time_series))
forecast = model(params)
for t in range(len(time_series)):
t_error[t] = time_series[t]-forecast[t]
MSE = np.mean(np.square(t_error))
return MSE


To perform the optimization in Excel, we’d use Solver, but in Python we have other options. SciPy is a Python package that allows us, among many other things, to optimize such single-objective problems. What I’m doing here is that I define a list of all the models I want to optimize, their default parameters, and the parameters’ bounds. I then use a loop to go through my list of models and run the optimization. To store the minimized MSE values as well as the parameter values that produce them, we can create an array to store the MSEs and a list to store the parameter values for each model. The optimization function produces a “dictionary” item that contains the minimized MSE value (under ‘fun’), the parameters that produce it (under ‘x’) and other information.

''' List of all the models we will be optimizing'''
models = [SES, DES, ASM, MSM, AHW, MHW]
''' This is a list of all the default parameters for the models we will be
optimizing. '''
#SES,  DES,     ASM
default_parameters = [[0.5],[0.5,0.5],[0.5,0.5],
#MSM,        AHW,            MHW
[0.5,0.5],[0.5,0.5,0.5],[0.5,0.5,0.5]]
''' This is a list of all the bounds for the default parameters we will be
optimizing. All the a,b,g's are weights between 0 and 1. '''
bounds = [[(0,1)],[(0,1)]*2, [(0,1)]*2,
[(0,1)]*2,[(0,1)]*3,[(0,1)]*3]
min_MSEs = np.zeros(len(models)) # Array to store minimized MSEs
opt_params = [None]*len(models) # Empty list to store optim. parameters
for i in range(len(models)):
res = scipy.optimize.minimize(MSE, # Function we're minimizing (MSE in this case)
default_parameters[i], # Default parameters to use
# Additional arguments that the optimizer
# won't be changing (model in this case)
args=[models[i]],
method='L-BFGS-B', # Optimization method to use
bounds=bounds[i]) # Parameter bounds
min_MSEs[i] = res['fun'] #Store minimized MSE value
opt_params[i] = res['x'] #Store parameter values identified by optimizer


Note: For the WMA model, the weights should sum to 1 and this should be input to our optimization as a constraint. To do so, we need to define the constraint function as a dictionary and include the following in our minimization call: constraints=[{‘type’:’eq’,’fun’: WMAcon}]. The number of periods to consider cannot be optimized by this type of optimizer.

Finally, we’d like to present our results. I’ll do so by plotting the observations and all my models as well as their minimized MSE values:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1) # Create figure
ax.set_title("Australian wine sales (kilolitres)") # Set figure title
l1 = ax.plot(time_series, color='black', linewidth=3.0, label='Observations') # Plot observations
for i in range(len(models)):
ax.plot(time_series.index,models[i](opt_params[i]), label = models[i].__name__)
ax.legend() # Activate figure legend
plt.show()
print('The estimated MSEs for all the models are:')
for i in range(len(models)):
print(models[i].__name__ +': '+str(min_MSEs[i]))


This snippet of code should produce this figure of all our forecasts, as well as a report of all MSEs:

The estimated MSEs for all the models are:
SES: 133348.78
DES: 245436.67
ASM: 80684.00
MSM: 64084.48
AHW: 72422.34
MHW: 64031.19

The Multiplicative Holt Winters method appears to give the smallest MSE when applied to these data.

# From Frequentist to Bayesian Ordinary Linear Regression

In this post, I will briefly review the basic theory about Ordinary Linear Regression (OLR) using the frequentist approach and from it introduce the Bayesian approach. This will lay the terrain for a later post about Gaussian Processes. The reader may ask himself why “Ordinary Linear Regression” instead of “Ordinary Least Squares.” The answer is that least squares refers to the objective function to be minimized, the sum of the square errors, which is not used the Bayesian approach presented here. I want to thank Jared Smith and Jon Lamontagne for reviewing and improving this post. This post closely follows the presentation in Murphy (2012), which for some reason does not use “^” to denote estimated quantities (if you know why, please leave a comment).

# Ordinary Linear Regression

OLR is used to fit a model that is linear in the parameters to a data set $\mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}$, where $\boldsymbol{X}$ represents the independent variables and $\boldsymbol{y}$ the dependent variable, and has the form

$p(y|\boldsymbol{x}, \boldsymbol{\beta}) = \mathcal{N}(y|\boldsymbol{\beta}^T\boldsymbol{x}, \sigma_{\epsilon}^2)$

where $\boldsymbol{\beta}$ is a vector of model parameters and $\sigma_{\epsilon}^2$ is the model error variance. The unusual notation for the normal distribution $\mathcal{N}$ should be read as “a normal distribution of $\boldsymbol{y}$ with (or given) mean $\boldsymbol{\beta}^T\boldsymbol{x}$ and variance $\sigma_{\epsilon}^2$.” One frequentist method of estimating the parameters of a linear regression model is the method of maximum likelihood estimation (MLE). MLE provides point estimates $\boldsymbol{\hat{\beta}}$ for each of the regression model parameters $\boldsymbol{\beta}$ by maximizing the likelihood function with respect to $\boldsymbol{\beta}$ and $\sigma_{\epsilon}^2$ by solving the maximization problem

$\boldsymbol{\hat{\beta}} = arg\,\underset{\boldsymbol{\beta}}{max}\;\mathcal{L}(\boldsymbol{\beta})$

where the likelihood function $\mathcal{L}(\boldsymbol{\beta})$ is defined as

$\mathcal{L}(\boldsymbol{\beta})=p(\mathcal{D}|\boldsymbol{\beta})={\displaystyle \prod_{i=1}^{N} p(y_i|\boldsymbol{x}_i, \boldsymbol{\beta}, \mu, \sigma_{\epsilon}^2)} = \mathcal{N}(\boldsymbol{y}|\mu + \boldsymbol{X}\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})$

where $\mu$ is the mean of the model error (normally $\mu=0$), $N$ is the number of points in the data set, $\boldsymbol{I}$ is an identity matrix, and $\boldsymbol{x}$ is a vector of values of the independent variables for an individual data point of matrix $\boldsymbol{X}$. OLR assumes independence and the same model error variance among observations, which is expressed by the covariance matrix $\sigma_{\epsilon}^2\boldsymbol{I}$. Digression: in Generalized Linear Regression (GLR), observations may not be independent and may have different variances (heteroscedasticity), so the covariance matrix may have off-diagonal terms and the diagonal terms may not be equal.

If a non-linear function shape is sought, linear regression can be used to fit a model linear in the parameters over a function $\phi(\cdot)$ of the data — I am not sure why the machine learning community chose $\phi$ for this function, but do not confuse it with a normal distribution. This procedure is called basis function expansion and has a likelihood function of the form

$\mathcal{L}(\boldsymbol{\beta}) = \mathcal{N}(\boldsymbol{y}|\mu + \phi(\boldsymbol{X})\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})$

where $\phi(\boldsymbol{x})$ can have, for example, the form

$\phi(\boldsymbol{x})=\begin{pmatrix}1\\ x_1\\ x_1^2\end{pmatrix}$

for fitting a parabola to a one-dimensional $\boldsymbol{X}$ over $\boldsymbol{y}$. When minimizing the squared residuals we get to the famous Ordinary Least Squares regression. Linear regression can be further developed, for example, into Ridge Regression to better handle multicollinearity by introducing bias to the parameter estimates, and into Kernel Ridge regression to implicitly add non-linear terms to the model. These formulations are beyond the scope of this post.

What is important to notice is that the standard approaches for linear regression described here, although able to fit linear and non-linear functions, do not provide much insight into the model errors. That is when Bayesian methods come to play.

# Bayesian Ordinary Linear Regression

In Bayesian Linear regression (and in any Bayesian approach), the parameters $\boldsymbol{\beta}$ are treated themselves as random variables. This allows for the consideration of model uncertainty, given that now instead of having the best point estimates for $\boldsymbol{\beta}$ we have their full distributions from which to sample models — a sample of  $\boldsymbol{\beta}$ corresponds to a model. The distribution of parameters $\boldsymbol{\beta}$, $P(\boldsymbol{\beta}|\mathcal{D})$, called the parameter posterior distribution, is calculated by multiplying the likelihood function used in MLE by a prior distribution for the parameters. A prior distribution is assumed from knowledge prior to analyzing new data. In Bayesian Linear Regression, a Gaussian distribution is commonly assumed for the parameters. For example, the prior on $\boldsymbol{\beta}$ can be $\mathcal{N}(\boldsymbol{\beta_0},\boldsymbol{V}_0)$ for algebraic simplicity. The parameter posterior distribution assuming a known $\sigma_{\epsilon}^2$ then has the form

$p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) =\frac{p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})}{p(\mathcal{D})} \propto p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})$  ,        given $p(\mathcal{D})$ is a constant depending only on the data.

We now have an expression from which to derive our parameter posterior for our linear model from which to sample $\boldsymbol{\beta}$. If the likelihood and the prior are Gaussian, the parameter posterior will also be a Gaussian, given by

$p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) \propto \mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_0, \boldsymbol{V}_0)\mathcal{N}(\boldsymbol{y}|\boldsymbol{X\beta},\sigma_{\epsilon}^2\boldsymbol{I})=\mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_N,\boldsymbol{V}_N)$
where
$\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}\boldsymbol{X}^T)^{-1}$
$\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}\boldsymbol{y}$

If we calculate the parameter posterior (distribution over parameters $\boldsymbol{\beta}$) for a simple linear model $f(x) = \beta_0 + \beta_1x$ for a data set $\mathcal{D}$ in which $\boldsymbol{X}$ and $\boldsymbol{y}$ are approximately linearly related given some noise $\sigma_{\epsilon}^2$, we can use it to sample values for $\boldsymbol{\beta}$. This is equivalent to sampling linear models $f(x)$ for data set $\mathcal{D}$. As the number of data points increase, the variability of the sampled models should decrease, as in the figure below

This is all interesting but the parameter posterior per se is not of much use. We can, however, use the parameter posterior to find the posterior predictive distribution, which can be used to both get point estimates of $y$ and the associated error. This is done by multiplying the likelihood by the parameter posterior and marginalizing the result over $\boldsymbol{\beta}$. This is equivalent to performing infinite sampling of blue lines in the example before to form density functions around a point estimate of $\mu_{y_*}=f(\boldsymbol{x}_*)$, with $(\boldsymbol{x}_*,y_*)$ denoting a new point that is not in $\mathcal{D}$. If the likelihood and the parameter posterior are Gaussian, the posterior predictive then takes the form below and will also be Gaussian (in Bayesian parlance, this means that the Gaussian distribution is conjugate to itself)!

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\ &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where

$\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*$

or, to put it simply,

$y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* \pm \sigma_N^2(\boldsymbol{x}_*)$

The posterior predictive, meaning final linear model and associated errors (the parameter uncertainty equivalent of frequentist statistics confidence intervals) are shown below

If, instead of having a data set $\mathcal{D}$ in which $\boldsymbol{X}$ and $\boldsymbol{y}$ are related approximately according to a 4th order polynomial, we use a $\phi(\cdot)$ function to artificially create more random variables (or features, in machine learning parlance) corresponding to the non-linear terms of the polynomial function. Our function $\phi(\cdot)$ would be $\phi(x) = [1, x, x^2, x^3, x^4]^T$ and the resulting $X$ would therefore have five columns instead of two ($[1, x]^T$), so now the task is to find $\boldsymbol{\beta}=[\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T$. Following the same logic as before for $X'=\phi(X)$ and $x_*'=\phi(x_*)$, where prime denotes the new set random variable $x$ from function $\phi(\cdot)$, we get the following plots

This looks great, but there is a problem: what if we do not know the functional form of the model we are supposed to fit (e.g. a simple linear function or a 4th order polynomial)? This is often the case, such as when modeling the reliability of a water reservoir system contingent on stored volumes, inflows and evaporation rates, or when modeling topography based on samples surveyed points (we do not have detailed elevation information about the terrain, e.g. a DEM file). Gaussian Processes (GPs) provide a way of going around this difficulty.

# References

Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.