Welcome to our blog!

Welcome to Water Programming! This blog is a collaborative effort by Pat Reed’s group at Cornell, Joe Kasprzyk’s group at CU Boulder, Jon Herman’s group at UC Davis, and others who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here. To gain access please email joseph.kasprzyk “at” colorado.edu.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email joseph.kasprzyk “at” colorado.edu. You’ll need a WordPress.com account.


Introduction to DiscoveryDV

In this blog post, I will outline some of the capabilities of DiscoveryDV, a software created by Josh Kollat, a former PhD student of Dr. Reed and the CEO and founder of DecisionVis. DiscoveryDV allows users to visualize and explore tradeoffs in high dimensional data using pareto sort techniques, a variety of plotting tools, and animations. I will demonstrate these functionalities using a problem that Josh first explored when he was developing VIDEO (Visually Interactive Decision-Making and Design Using Evolutionary Multi-Objective Optimization), a precursor to DiscoveryDV. DiscoveryDV is not available for direct download, but you can request access here.


The above linked paper uses a long-term groundwater monitoring case study to demonstrate how visualization tools can be helpful in exploring large multi-objective solution sets to aid in decision making. The case study premise is as follows: A PCE contamination plume from an underground storage tank is spreading through an aquifer. PCE data samples are available at 25 pre-determined well sampling locations and there are 1-3 sampling points along the vertical axis of the well that can also be sampled. The issue is addressed through a multi-objective optimization problem with four conflicting objectives.

  1. Minimize Cost: Cost is equivalent to the number of wells chosen to be sampled.
  2. Minimize Concentration Error: Concentration estimation error is calculated between a Kriged map of the plume utilizing all available sampling locations and a Kriged map of the plume which utilizes a subset of the sampling locations.
  3. Minimize Uncertainty: The sum of the estimated variances attained for each estimation location
  4. Minimize Mass: Reflects the error between the total mass of PCE estimated by Kriging the domain based on all available well sampling locations, and the estimated mass of PCE obtained by Kriging the domain based on a subset of well sampling locations.

There are 33 million possible solutions to this optimization problem, so let’s use DiscoveryDV to dig in!

Starting a Project in DiscoveryDV

Figure 1 is a screenshot of the DiscoveryDV interface which is very simple and easy to use. All data sets and plots can be accessed from the storyboard on the right.


Figure 1: DiscoveryDV Interface

There are many pre-loaded data sets in DiscoveryDV and we will be using the LTM example.

  1. From the Examples tab in the toolbar, choose LTM. Notice that a new page has been created in your storyboard.
  2. If you click the > by LTM, you will see the attributes for the project: data files, plots, brushing rules, snapshots, etc.

By clicking the example, DiscoveryDV will load in a CSV file with the pareto optimal results. There are 2569 pareto optimal solutions, so what solution do you choose? This is a very real dilemma for stakeholders. The first step is to visualize the pareto front.

Creating a 3D Plot of the Pareto Surface

Since the problem formulation has 4 objectives, you should expect to see a 3D surface. We can plot the results of the 4th objective as the color of our surface.

In order to make this plot, you can use the N-Dimensional Plot tool. Right click on Plots in the storyboard and choose Create plot -> N-Dimensional Plot. Plot Cost, Error, and Risk on the X,Y, and Z axes respectively and Mass as color. You can rotate the graph and zoom in and out. You can show the utopia point by choosing it in the Glyphs menu. You can also save this image into a PowerPoint by choosing Share>Send To PowerPoint from the plot menu. Finally, you can create an animation, as shown in Figure 2.


Figure 2: Pareto Surface GIF

It is interesting to note that by solving for the high-order Pareto-optimal surface for four objectives, all the sub-problems based on three objectives, two objectives, or even a single objective are implicitly solved at the same time. It can be unwieldy to look at a whole pareto surface, so sometimes it is advantageous to just look at two-objective pareto fronts to gain a better understanding of the behavior of the objectives.

Creating a 2D Plot to Showcase Two-Objective Tradeoffs

A quick way to visualize the 2D tradeoffs is by using the 2D scatter matrix plot (Figure 3). Along the diagonals are histograms that show the breakdown of solutions in bins. The rest of the plots show the tradeoffs. By hovering over a point, you can see what bin that point lies in along with its location on the other relevant 2D plots.

Figure 3: Two-Objective Tradeoffs

However, we are not necessarily interested in all of the solutions portrayed in the 2D scatter. Rather we would like to see the pareto fronts of the 2-objective formulations to better dissect what is happening in the 3D surface. The 2D scatter matrix doesn’t have pareto sorting functionality, so let’s look at one of these tradeoffs more closely using the XY Plot tool.

  1. Right click on Plots in the storyboard and choose Create plot -> XY Plot.
  2. Plot Cost and Error on the X and Y axis respectively
  3. Perform a pareto sort. The dominated data points should turn transparent.

Figure 4: Two-objective Pareto front

From Figure 4, it is clear that increasing cost will decrease error which is somewhat intuitive, but maybe wasn’t clear from the 3D plot. This is the advantage of looking at two-objective tradeoffs. It is up to the decision maker to determine what criteria are important to them, and 2D plots can help them elicit these preferences. Is it acceptable to be cheap but have 35% error? When cost is low, increasing the cost marginally will decrease the error a lot, but when does it become too costly to minimize error? Pareto sorting on a 2D plot gives a clear idea of the optimal solutions for different combinations of objectives and an idea of the tradeoffs associated with those decisions. Furthermore, in DiscoveryDV, a pareto sort can be performed with a single click.

Brushing to Elicit Preferences

At this point, the stakeholder should have a better understanding of the tradeoffs that exist, but we need some way to help narrow down the set of possible options. One way to do this is to brush solutions based on stakeholder preference. The stakeholder can decide on an acceptable range for each objective from the full ranges shown below.



Minimize Cost 7-46 sampled locations
Minimize Concentration Error 0-34.5 (Kg PCE per Cubic Meter of Aquifer)
Minimize Uncertainty 1284-1563 (Kg PCE per Cubic Meter Aquifer)2
Minimize Mass


0-48.64 (Kg PCE)

In order to implement the ranges, expand the brushing tab and add brushing rules for each objective. Adjust to your pre-specified ranges. On the “hiding” axis of the 3D plot, pick the “brushing” option. Only the solutions that meet the range requirements will remain (Figure 5).

Figure 5: Example of a Brushed Pareto Front

Parallel Axis Plots to View Tradeoffs

Now let’s create a set from our brushed data for further inspection. Click the analysis menu option and then click Create Set From > Brushed on Data. On your storyboard, you will see a new member, A, under your set membership. Now we will use a new type of plot, a parallel axis plot, to view the tradeoffs among these solutions.

  1. Right click on Plots in the storyboard and choose Create plot -> Parallel Coordinate Plot.
  2. Place your objectives of interest on the plot: Cost, Error, Risk, and Mass. Perhaps the stakeholder is particularly interested in cost. Let’s place cost on the color axis for extra visibility.

Each line in the parallel axis plot represents one of our solutions and where the lines cross between two axes indicates a tradeoff. Now, on the hiding axis, specify “brushing”. This will narrow down your parallel axis plot to only your brushed solutions.

Figure 6: Full and Brushed Parallel Axis Plots

You can also record an animation of your parallel axis plot and save it as a GIF (Figure 7)! In this GIF, I cycled through the solutions from lowest to highest cost.


Figure 7: Parallel Axis Plot GIF

From here, it is completely up to the decision maker to decide which solution is best. You can double click on solutions of interest in your parallel axis plot to mark them. They will show up as M1 and M2 the marking tab of your storyboard. Then you can then mark them in your 3D plot using the marking axis and add captions for visualization purposes (Figure 8).


Figure 8: Brushed 3D Plot with Caption

A user would typically take on the order of hours to create these visualizations from scratch, but in DiscoveryDV, most graphs can be created with a few clicks of the mouse. This is especially usefully when you are trying to quickly visualize large amounts of data. There is quite a bit more functionality that DiscoveryDV offers that wasn’t explored in this blog post, but once the user understands the interface, more complicated graphs can be made such as Figure 9, which is the culminating figure from the linked paper.

Figure 9: Negotiated Design Selection using 2D Tradeoffs


Google Earth Pro: the cool, the trade-offs and the limits

This post is an account of my recent experience with Google Earth Pro (GPE), a free online tool meant to make visualization of geographical information intuitive and accessible. This account contains tidbits that I thought others might find useful, but it is not meant to be an unbiased or comprehensive resource.

My goal in using GPE was to set up a short video tour of the reservoirs on the main stem of the Upper Snake River, which has its upper reaches around the Teton Range (WY, USA) before going through much of southern Idaho. Here is the video, produced with instructions from this post from the blog:


The cool

GPE is indeed an intuitive way for people that have only a limited experience of geographic information systems (GIS) to put together nice-looking videos in a limited amount of time. It relies on the increasingly large amounts of geographic data available online. For instance, polygons made of points with geographical coordinates, used to delineate political boundaries or catchments among others, can be found increasingly easily, be it under formats specific to GPE (KML, KMZ), or traditional GIS format such as SHP (shapefile format, which can be imported to GPE with… “File => Import”, I told you it was intuitive). This capability to find the right data could be improved by the recent launch of a dataset search engine by Google.

This video, after several tests, superposed three layers on the satellite images of the land surface:

  1. Pins of the dams’ locations. Please refer to this post to learn all you need to know about setting and customizing such new placemarks.
  2. The network of all streams with an average flow of more than 10 m3/s in the Columbia River basin. This is mainly for easy visualization of where the Snake River and its affluents are. That data was obtained from this useful USGS resource, with all the major basins in the US having data in GPE-ready KML format (basin boundaries, tributaries with their average flows, landuse, etc.).
  3. An invisible layer contains the shapes of the reservoirs’ lakes, in order to zoom in on the reservoir and not the dam itself. I got the shapes of waterbodies in the Upper Snake River basin from an amazing resource for water resources practitioners (see “The limits” below to see how to access this resource, but also to understand why it must be handled with care). Hint: since the list of waterbodies is dauntingly large and includes every little puddle in the area, I had to zoom in to the desired reservoir so as to only select its shape, by doing “File => Import”, then when prompted, by choosing to only import the features in my line of sight.

The trade-offs

The trade-offs when doing this kind of short intro video to your study area, are between video quality and required memory. This is especially true if the video is of a rather large area, as the GPE satellite images embed a level of detail that is amazing at the scale of that basin. Therefore, when creating my video, I had to resort to several tries before getting  a correct quality, and this video eats up 500Mo of disk space for 43 seconds (!!!!). Any video with lower resolution just looked downright disgusting, whereas any larger video may have serious problem running on your laptop. Think of it as a Goldilocks zone that you have to find if you want to have a video you can embed in an oral presentation.

(NB: to embed this video in a presentation in a fullproof kind of way, the easiest is to embed a link from the PPT file to the video in the same folder).

The limits

The limits of Google Earth are with its plain refusal to display features it deems too big. To understand this, let us look at this USGS webpage where a sizable of hydrological information can be downloaded. In particular it is possible to download data exactly for the upper Snake area (Subregion 1704 on the picture below). This data can include waterbodies as discussed above (in “the cool”) and that is contained in the NHD data ticket in the picture. By ticking “Watershed Boundary Dataset (WBD)”, one can also download basin and subbasin boundaries.


Why then did I not represent Upper Snake boundaries in my video? Well, the Upper Snake polygon, under SHP format at least, is too big so GPE just… refused to display it. I tried to represent the HU-6 and HU-8 subbasins (smaller than the HU-4-sized Upper Snake under this classification system, and:

  • The HU-6 subbasin just divides the basin into “headwaters” (smaller part upstream of Palisades) and “the rest”; the former feature was smaller and GPE plotted it, but it did not plot the latter.
  • The HU-8 subbasins all are individually smaller features that GPE accepts to plot. So I could have plotted the watershed boundaries by plotting ALL of the HU-8 subbasins, but spoiler alert: this looked horrendous.

Takeaway: GPE only displays features that individually have limited size. So maybe using a more powerful and recent desktop computer than the one I had would have done the trick… but be aware that there will always be a limit. Also note that assuming that GPE is just taking its time loading the data, and that staring blankly at the screen in the meantime is not going to be too painful, is not a good idea. Take a nap instead, or do something else, then admit that the damn thing just did not display.

A solution for this, of course, is to load the larger features on GIS software and making them lighter by eliminating points in the polygon without altering the shape… but that kinda beats the purpose of GPE, which is to avoid having to become a GIS geek, doesn’t it?

A Beginner’s Guide to Visualizing Trade Offs across Multiple Objectives using the Python Bokeh Library

Python libraries can be very helpful in making visualizations. One particular library I like using is the Bokeh library. In this post, I will share how this library can be used for visualizing trade offs across multiple objectives.

First import the libraries we need for our plots:

import pandas as pd
from bokeh.io import output_file, show #for outputting and showing plots
from bokeh.plotting import figure #for creating figures, which are the plot area in this case
from bokeh.models.tools import HoverTool #for importing the hover tool
from bokeh.models import ColumnDataSource #which we will ultimately use for linking plots
from bokeh.models.widgets import Panel, Tabs #for making tabs

Basic Plotting

Let’s start by creating a basic plot for a three objective DTLZ2, where the first and second objectives are plotted on the x and y axes respectively, and the third objective is the size of the plotted points.

# create a new plot using the figure function
p = figure(title="DTLZ2 with 3 objectives",plot_width=400, plot_height=400)

# add axis titles
p.xaxis.axis_label = "Objective 1"
p.yaxis.axis_label = "Objective 2"

# upload your data
paretoFront = pd.read_csv("C:/Users/Sarah/Desktop/dmh309-serial-borg-moea-2c7702638d42/dtlz2obj.csv") 

# assign data to arrays

# plots points
paretoPlot = p.circle(obj1, obj2, size=obj3*10, # objective 3 is the shown by the size of the points; I multiplied by 10 to make them clearer
       fill_color="grey", # for more color options, check out https://bokeh.pydata.org/en/latest/docs/reference/colors.html
       fill_alpha=0.6, # alpha is the transparency of the points
       line_color=None) # line_color is the outline of the points

# show the results

The figure below shows what your plot will look like. Notice the default tools that are outlined on the right of the plot.

figure 1

More useful tools

Although Bokeh plots come with a default set of useful tools, there are other tools which you could add to make your plots more interactive.

To demonstrate some of these useful tools, I will create a figure with three tabs. Each tab will contain a view of the Pareto front for two of the three objectives. In addition, these tabs will be linked, so that if you make a selection on the Pareto front in one of the tabs, the same points will be selected in the other views. I will also demonstrate the use of the vertical, horizontal, and mouse over hover tools, and add data labels for the mouse over hover tool.

paretoFront = pd.read_csv("C:/Users/Sarah/Desktop/dmh309-serial-borg-moea-2c7702638d42/dtlz2obj_reduced.csv") 


# create a column data source for the plots; this will allow us to link the plots we create
source = ColumnDataSource(data=dict(x=obj1, y=obj2, z=obj3))

# define the tools you want to use; if you are defining tools, you need to list all the tools you want, including the default Bokeh tools
TOOLS = "box_select,lasso_select,help,pan,wheel_zoom,box_zoom,reset,save"

# create a new plot for each view/tab using figure
p1 = figure(tools=TOOLS,title="DTLZ2 (Objective 1, Objective 2)",plot_width=400, plot_height=400)
p2 = figure(tools=TOOLS,title="DTLZ2 (Objective 1, Objective 3)",plot_width=400, plot_height=400)
p3 = figure(tools=TOOLS,title="DTLZ2 (Objective 2, Objective 3)",plot_width=400, plot_height=400)

# define functions for hovering
def verticalhover(p,paretoPlot):
    p.add_tools(HoverTool(tooltips=None, renderers=[paretoPlot], mode='vline'))

def horizontalhover(p,paretoPlot):
    p.add_tools(HoverTool(tooltips=None, renderers=[paretoPlot], mode='hline'))

def pointhover(p, paretoPlot, obja, objb):
    p.add_tools(HoverTool(tooltips=[(obja, "$x" ), (objb, "$y")],mode="mouse", point_policy="follow_mouse", renderers=[paretoPlot]))
# note that we added 'tooltips' here, this gives the values (like excel data labels) of the points when you hover over them

# plot 1
paretoPlot1 = p1.circle('x', 'y',
                        hover_fill_color="crimson", hover_alpha=0.8, # this changes the color and transparecy of the points when we hover over them
                        line_color=None, hover_line_color="white", source=source) # we will use 'source' in all our plots, because we want to link them by having them share the same sources data

p1.xaxis.axis_label = "Objective 1"
p1.yaxis.axis_label = "Objective 2"

# add the hover tools
pointhover(p1, paretoPlot1, "objective 1", "objective 2")

# add the panel for the first tab
tab1 = Panel(child=p1, title="obj1, obj2")

# repeat the same steps for tabs 2 and 3 (i.e. objective 1 vs. objective 3 and objective 2 vs. objective 3)
# plot 2

paretoPlot2 = p2.circle('x','z',
                        fill_color='grey', hover_fill_color="crimson",
                        fill_alpha=0.7, hover_alpha=0.8,
                        line_color=None, hover_line_color="white", source=source)

p2.xaxis.axis_label = "Objective 1"
p2.yaxis.axis_label = "Objective 3"

pointhover(p2, paretoPlot2, "objective 1", "objective 3")

tab2 = Panel(child=p2, title="obj1, obj3")

# plot 3

paretoPlot3 = p3.circle('y', 'z',
                        fill_color='grey', hover_fill_color="crimson",
                        fill_alpha=0.7, hover_alpha=0.8,
                        line_color=None, hover_line_color="white", source=source)

p3.xaxis.axis_label = "Objective 2"
p3.yaxis.axis_label = "Objective 3"

pointhover(p3, paretoPlot3, "objective 2", "objective 3")

tab3 = Panel(child=p3, title="obj2, obj3")

# finally create the tabs and show your plot!
tabs = Tabs(tabs=[ tab1, tab2, tab3 ])

The figure below shows the plot you will obtain:

figure 2

Now, let’s explore the tools that we added:

  • Selection across multiple tabs: use lasso or box select to select some points in your first tab. When you move to the second and third tabs, the same points will be selected in different views.figure 3
  • Vertical and Horizontal Hover: select either the vertical or horizontal hover tool from the right toolbar. Make sure you deselect all other tools. The vertical hover will color in all the points that lie in the vertical line that crosses the point you are hovering over, and the horizontal hover does the same thing horizontally.figure 4
  • Point Hover and Data Labels: Select the point hover tool, deselect all the other tools, and hover over the points. When you hover over a point, you will get the values for the objectives for that point.figure 5



Introduction to Gaussian Processes

Introduction to Gaussian Processes

In this post, I will cover the very basics of Gaussian Processes following the presentations in Murphy (2012) and Rasmussen and Williams (2006) — though hopefully easier to digest. I will approach the explanation in two ways: (1) derived from Bayesian Ordinary Linear Regression and (2) derived from the definition of Gaussian Processes in functional space (it is easier than it sounds). I am not sure why the mentioned books do not use “^” to denote estimation (if you do, please leave a comment), but I will stick to their notation assuming you may want to look into them despite running into the possibility of an statistical sin. Lastly, I would like to thank Jared Smith for reviewing this post and providing highly statistically significant insights.

If you are not familiar with Bayesian regression (see my previous post), skip the section below and begin reading from “Succinct derivation of Gaussian Processes from functional space.”

From Bayesian Ordinary Linear Regression to Gaussian Processes

Recapitulation of Bayesian Ordinary Regression

In my previous blog post, about Bayesian Ordinary Linear Regression (OLR), I showed that the predictive distribution for a point \boldsymbol{x}_* for which we are trying to estimate y_* with Bayesian OLR is

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where we are trying to regress a model over the data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, \boldsymbol{X} being the independent variables and \boldsymbol{y} the dependent variable, \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} means “a normal distribution of y_* with (or given) the regression mean \boldsymbol{\beta}_N^T\boldsymbol{x}_* and variance \sigma_N^2(\boldsymbol{x}_*),” where N is the number of points in the data set. The estimated variance of \boldsymbol{y}_*, \sigma_N^2(\boldsymbol{x}_*) and the estimated regression parameters \boldsymbol{\beta}_N can be calculated as

\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}^T\boldsymbol{X})^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}^T\boldsymbol{y}
\hat{\sigma}_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

where \boldsymbol{V}_0 and \boldsymbol{\beta}_0 are the mean and the covariance of the parameter prior distribution. All the above can be used to estimate y_* as

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* with an error variance of \sigma_N^2(\boldsymbol{x}_*)

If we assume a prior mean of 0 (\boldsymbol{\beta}_0=0), replace \boldsymbol{V}_N and \boldsymbol{\beta}_N by their definitions, and apply a function \phi(\boldsymbol{x}), e.g. \phi(\boldsymbol{x})=(1, x_1, x_2, x_1x_2, x_1^2 ...)^T, to add features to \boldsymbol{X} so that we can use a linear regression approach to fit a non-linear model over our data set \mathcal{D}, we would instead have that

\boldsymbol{\beta}_N = \boldsymbol{\phi}_*^T( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_X\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{\phi}_*^{T}( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_*

where \boldsymbol{\phi}_* = \phi(\boldsymbol{x}_*) and \boldsymbol{\phi}_X = \phi(\boldsymbol{X}). One problem with this approach is that as we add more terms (also called features) to function \phi(\cdot) to better capture non-linearities, the the dimension of the matrix we will have to invert increases. It is also not always obvious what features we should add to our data. Both problems can be handled with Gaussian Processes.

Now to Gaussian Processes

This is where we transition from Bayesian OLR to Gaussian Processes. What we want to accomplish in the rest of this derivation is to find a way of: (1) adding as many features to the data as we need to calculate \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) without increasing the size of the matrix \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T we have to invert (remember the dimensions of such matrix equals the number of features in phi(\cdot)), and of (2) finding a way of implicitly adding features to \boldsymbol{X} and \boldsymbol{x}_* without having to do so manually — manually adding features to the data may take a long time, specially if we decide to add (literally) an infinite number of them to have an interpolator.

The first step is to make the dimensions of the matrix we want to invert equal to the number of data points instead of the number of features in $\latex \phi(\cdot)$. For this, we can re-arrange the equations for \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) as

\boldsymbol{\beta}_N =\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*( \sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_X)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*-\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_X(\sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{\phi}_X)^{-1}\boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_*

We now have our feature-expanded data and points for which we want point estimates always appearing in the form of \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})\boldsymbol{x}' is just another point \boldsymbol{x}_i either in the data set or for which we want to estimate y_*. From now on, \kappa(x, x') = \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'}) will be called a covariance or kernel function. Since the prior’s covariance \boldsymbol{V}_0 is positive semi-definite (as any covariance matrix), we can write

\kappa(\boldsymbol{x}, \boldsymbol{x}') = \phi(\boldsymbol{x})^T (\boldsymbol{V}_0^{1/2})^T(\boldsymbol{V}_0^{1/2})\phi(\boldsymbol{x}')=\psi(\boldsymbol{x})^T\psi(\boldsymbol{x}')

where \psi(\boldsymbol{x})=\boldsymbol{V}_0^{1/2}\phi(\boldsymbol{x'}). Using \kappa(\boldsymbol{x}, \boldsymbol{x}') and assuming a prior of \boldsymbol{I}, \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) then take a new shape in the predictor posterior of a Gaussian Process:

\boldsymbol{\beta}_N =K_*( \sigma_{\epsilon}^2\boldsymbol{I} + K)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = k_{**}-\boldsymbol{k}_*(\sigma_{\epsilon}^2\boldsymbol{I}+K)^{-1}\boldsymbol{k}_*^T

where K = \kappa(\boldsymbol{\phi}_X,\boldsymbol{\phi}_X), \boldsymbol{k}_* = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_X) and k_{**} = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_*). Now the features of the data are absorbed within the inner-products between observed \boldsymbol{x}‘s and or for \boldsymbol{x}_*, so we can add as many features as we want without impacting the dimensions of the matrix we will have to invert. Also, after this transformation, instead of \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) representing the mean and covariance between model parameters, they represent the mean and covariance of and among data points and/or points for which we want to get point estimates. The parameter posterior from Bayesian OLR would now be used to sample the values of y_* directly instead of model parameters. And now we have a Gaussian Process with which to estimate y_*. For plots of samples of \boldsymbol{y}_* from the prior and from the predictive posterior, of the mean plus or minus the error variances, and of models with different kernel parameters, see figures at the end of the post.

Kernel (or Covariance) matrices

The convenience of having \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) written in terms of \kappa(\cdot, \cdot), is that \kappa does not have to represent simply a matrix-matrix or vector-matrix multiplication of \boldsymbol{X} and \boldsymbol{x}_*. In fact, function \kappa can be any function that corresponds to an inner-product after some transformation from \boldsymbol{x}\rightarrow\phi(\boldsymbol{x}), which will necessarily return a positive semi-definite matrix and therefore be a valid covariance matrix. There are several kernels (or kernel functions or kernel shapes) available, each accounting in a different way for the nearness or similarity (at least in Gaussian Processes) between values of x_i:

  • the linear kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T\boldsymbol{x}'. This kernel is used when trying to fit a line going through the origin.
  • the polynomial kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = (1 + \boldsymbol{x}^T\boldsymbol{x}')^d, where d is the dimension of the polynomial function. This kernel is used when trying to fit a polynomial function (such as the fourth order polynomial in the blog post about Bayesian OLR), and
  • the Radial Basis Function, or RBF (aka Squared Exponential, or SE), \kappa(\boldsymbol{x}, \boldsymbol{x}') = e^{-\frac{||\boldsymbol{x}-\boldsymbol{x}'||^2}{2\sigma_{RBF}^2}}, where \sigma_{RBF} is a kernel parameter denoting the characteristic length scale of the function to be approximated. Using the RBF kernel is equivalent to adding an infinite (!) number of features the data, meaning \phi(\boldsymbol{x})=(1, x_1, ..., x_d, x_1x_2, x_1x_3,..., x_{d-1}x_d, x_1x_2x_3, ...)^T.

But this is all quite algebraic. Luckily, there is a more straight-forward derivation shown next.

Succinct derivation of Gaussian Processes from functional space.

Most regression we study in school is a way of estimating the parameters \boldsymbol{\beta} of a model. In Bayesian OLR, for example, we have a distribution (the parameter posterior) from which to sample model parameters. What if we instead derived a distribution from which to sample functions themselves? The first (and second and third) time I heard about the idea of sampling functions I thought right away of opening a box and from it drawing exponential and sinusoid functional forms. However, what is meant here is a function of an arbitrary shape without functional form. How is this possible? The answer is: instead of sampling a functional form itself, we sample the values of y_*=f(\boldsymbol{x}_*) at discrete points, with a collection of \boldsymbol{y}_i for all \boldsymbol{x}_i in our domain called a function \boldsymbol{f}. After all, don’t we want a function more often than not solely for the purpose of getting point estimates and associated errors for given values of \boldsymbol{x}_*?

In Gaussian Process regression, we sample functions \boldsymbol{f} from a distribution by considering each value of x_i in a discretized range along the x-axis as a random variable. That means that if we discretize our x-axis range into 100 equally spaced points, we will have 100 random variables x_0,..,x_{100}. The mean of all possible functions at \boldsymbol{x}_i therefore represents the mean value of y_i in \boldsymbol{x}_i, and each term in the covariance matrix (kernel matrix, see “Kernel (or Covariance) matrices” section above) represents how similar the values of y_i and y_j will be for each pair \boldsymbol{x}_i and \boldsymbol{x}_j based on the value of a distance metric between  \boldsymbol{x}_i and \boldsymbol{x}_j: if \boldsymbol{x}_i and \boldsymbol{x}_j are close to each other, \boldsymbol{y}_i and \boldsymbol{y}_j tend to be similar and vice-versa. We can therefore turn the sampled functions y = f(\boldsymbol{x}) into whatever functional form we want by choosing the appropriate covariance matrix (kernel) — e.g. a linear kernel will return a linear model, a polynomial kernel will return a polynomial function, and an RBF kernel will return a function with an all linear and an infinite number of non-linear terms. A Gaussian Process can then be written as

f(\boldsymbol{x}) \sim \mathcal{GP}(\mu(\boldsymbol{x}), \kappa(\boldsymbol{x}, \boldsymbol{x}'))

where \mu(\boldsymbol{x}) is the mean function (e.g. a horizontal line along the x-axis at y = 0 would mean that \mu(\boldsymbol{x}) = [0,..,0]^T) and \kappa(\boldsymbol{x}, \boldsymbol{x}') is the covariance matrix, which here expresses how similar the values of y_i will be based on the distance between two values of x_i.

To illustrate this point, assume we want to sample functions over 200 discretizations \boldsymbol{X}_* = x_{*0}, ..., x_{*199} along the x-axis from x=-5 to x=5 (\Delta x = 0.05) using a Gaussian Process. Assume the parameters of our Gaussian Process from which we will sample our functions are mean 0 and that it uses an RBF kernel for its covariance matrix with parameter \sigma_{RBF}=2 (not be confused with standard deviation), meaning that k_{i,i}=1.0 ,\:\forall i \in [0, 199] and k_{i,j}=e^{-\frac{||x_i-x_j||^2}{2\cdot 2^2}},\: \forall i,j \in [0, 199], or

\begin{aligned} p(\boldsymbol{f}_*|\boldsymbol{X_*})&=\mathcal{N}(\boldsymbol{f}|\boldsymbol{\mu},\boldsymbol{K_{**}})\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[\mu_{*0},..,\mu_{*199}], \begin{bmatrix} k_{*0,0} & k_{*0,1} & \cdots & k_{*0,199} \\ k_{*1,0} & k_{*1, 1} & \cdots & k_{*1,199}\\ \vdots & \vdots & \ddots & \vdots \\ k_{*199,0} & k_{*199,1} & \cdots & k_{*199,199} \end{bmatrix} \right)\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[0,..,0], \begin{bmatrix} 1 & 0.9997 & \cdots & 4.2\cdot 10^{-6} \\ 0.9997 & 1 & \cdots & 4.8\cdot 10^{-6}\\ \vdots & \vdots & \ddots & \vdots \\ 4.2\cdot 10^{-6} & 4.8\cdot 10^{-6} & \cdots & 1 \end{bmatrix} \right) \end{aligned}

Each sample from the normal distribution above will return 200 values: \boldsymbol{f}_* = [y_0, ..., y_{199}]. A draw of 3 such sample functions would look generally like the following:


The functions above look incredibly smooth because of the RBF kernel, which adds an infinite number of non-linear features to the data. The shaded region represents the prior distribution of functions. The grey line in the middle represents the mean of an infinite number of function samples, while the upper and lower bounds represent the corresponding prior mean plus or minus standard deviations.

These functions looks quite nice but pretty useless. We are actually interested in regressing our data assuming a prior, namely on a posterior distribution, not on the prior by itself. Another way of formulating this is to say that we are interested only in the functions that go through or close to certain known values of \boldsymbol{x} and y. All we have to do is to add random variables corresponding to our data and condition the resulting distribution on them, which means accounting only for samples of functions that exactly go through (or given) our data points. The resulting distribution would then look like p(\boldsymbol{f}_*|\mathcal{D}, \boldsymbol{X}_*). After adding our data \boldsymbol{X}, \boldsymbol{f}, or \boldsymbol{X}, \boldsymbol{y}, our distribution would look like

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}K(\boldsymbol{X},\boldsymbol{X}) & K(\boldsymbol{X},\boldsymbol{X}_*) \\ K(\boldsymbol{X}_*,\boldsymbol{X}) & K(\boldsymbol{X}_*,\boldsymbol{X}_*)\end{bmatrix}\right)

In this distribution we have random variables representing the \boldsymbol{X} from our data set \mathcal{D} and the \boldsymbol{X}_* from which we want to get point estimates and corresponding error standard deviations. The covariance matrix above accounts for correlations of all observations with all other observations, and correlations of all observations with all points to be predicted. We now just have to condition the multivariate normal distribution above over the points in our data set, so that the distribution accounts for the infinite number of functions that go through our \boldsymbol{y} at the corresponding \boldsymbol{X}. This yields

p(\boldsymbol{f}_*|\boldsymbol{X}_*,\boldsymbol{X},\boldsymbol{y})=\mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})


\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \boldsymbol{K} = K(\boldsymbol{X},\boldsymbol{X})\boldsymbol{K}_* = K(\boldsymbol{X},\boldsymbol{X}_*), \boldsymbol{K}_{**} = K(\boldsymbol{X_*},\boldsymbol{X_*}), and diag\left(\bar{\boldsymbol{\Sigma}}\right) is a vector containing the elements in the diagonal of the covariance matrix \bar{\boldsymbol{\Sigma}} — the variances \bar{\sigma}^2_i of each random variable x_i. If you do not want to use a prior with \boldsymbol{\mu} = \boldsymbol{\mu}_* = [0, ... , 0]^T, the mean of the Gaussian Process would be \bar{\boldsymbol{\mu}}=\boldsymbol{\mu}_* + \boldsymbol{K}_*^T\boldsymbol{K}^{-1}(\boldsymbol{f} - \boldsymbol{\mu}). A plot of \boldsymbol{f}_* = \bar{\boldsymbol{\mu}} estimated for 200 values of x_i — so that the resulting curve looks smooth — plus or minus associated uncertainty (\pm\bar{\boldsymbol{\sigma}}^2), regressed on 5 random data points \boldsymbol{X}, \boldsymbol{y}, should look generally like



A plot of 3 sampled functions (colored lines below, the grey line is the mean) \boldsymbol{f}_* \sim \mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}}) should look generally like



Several kernel have parameters that can strongly influence the regressed model and that can be estimated — see Murphy (2012) for a succinct introduction to kernel parameters estimation. One example is parameter \sigma_{RBF}^2 of the RBF kernel, which determines the correlation length scale of the fitted model and therefore how fast with increasing separation distance between data points the model will default to the prior (here, with \boldsymbol{\mu}=0). The figure below exemplifies this effect


All figures have the same data points but the resulting models look very different. Reasonable values for \sigma_{RBF}^2 depend on the spacing between the data points and are therefore data-specific. Also, if when you saw this plot you thought of fitting RBF functions as interpolators, that was not a coincidence: the equations solved when fitting RBFs to data as interpolators are the same solved when using Gaussian Processes with RBF kernels.

Lastly, in case of noisy observation of y in \latex \mathcal{D}$ the Gaussian Process distribution and the expressions for the mean, covariance, and standard deviation become

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I} & \boldsymbol{K}_* \\ \boldsymbol{K}_*^T & \boldsymbol{K}_{**}\end{bmatrix}\right)

\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{y}

\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \sigma_{\epsilon}^2 is the error variance and \boldsymbol{I} is an identity matrix. The mean and error predictions would look like the following


Concluding Remarks

What I presented is the very basics of Gaussian Processes, leaving several important topics which were not covered here for the interested reader to look into. Examples are numerically more stable formulations, computationally more efficient formulations, kernel parameter estimation, more kernels, using a linear instead of zero-mean prior (semi-parametric Gaussian Processes), Gaussian Processes for classification and Poisson regression, the connection between Gaussian Processes and other methods, like kriging and kernel methods for nonstationary stochastic processes. Some of these can be found in great detail in Rasmussen and Williams (2006), and Murphy (2012) has a succinct presentation of all of the listed topics.

I was really confused about Gaussian Processes the first time I studied it and tried to make this blog post as accessible as possible. Please leave a comment if you think the post would benefit from any particular clarification.



Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press, ISBN 978-0262018029

Rasmussen, Carl Edward and Williams, Christopher K. I., 2006. Gaussian Processes for Machine Learning. The MIT Press, ISBN 978-0262182539.

Time series forecasting in Python for beginners

Time series forecasting in Python for beginners

This semester I am teaching Engineering Management Methods here at Cornell University. The course is aimed at introducing engineering students to systems thinking and a variety of tools and analyses they can use to analyze data. The first chapter has been on time series forecasting, where we discussed some of the simpler models one can use and apply for forecasting purposes, including Simple and Weighted Moving Average, Single and Double Exponential Smoothing, Additive and Multiplicative Seasonal Models, and Holt Winter’s Method.

The class applications as well as the homework are primarily performed in Excel, but I have been trying, with limited success, to encourage the use of programming languages for the assignments. One comment I’ve received by a student has been that it takes significantly more time to perform the calculations by coding; they feel that it’s a waste of time. I initially attributed the comment to the fact that the student was new to coding and it takes time in the beginning, but on later reflection I realized that, in fact, the student was probably simply manually repeating the same Excel operations by using code: take a set of 30 observations, create an array to store forecasts, loop through every value and calculate forecast using model formula, calculate error metrics, print results, repeat steps for next set of data. It occurred to me that of course they think it’s a waste of time, because doing it that way completely negates what programming is all about: designing and building an executable program or function to accomplish a specific computing task. In this instance, the task is to forecast using each of the models we learn in class and the advantage of coding comes with the development of some sort of program or function that performs these operations for us, given a set of data as input. Simply going through the steps of performing a set of calculations for a problem using code is not much different than doing so manually or in Excel. What is different (and beneficial) is designing a code so that it can then be effortlessly applied to all similar problems without having to re-perform all calculations. I realize this is obvious to the coding virtuosos frequenting this blog, but it’s not immediately obvious to the uninitiated who are rather confused on why Dr. Hadjimichael is asking them to waste so much time for a meager bonus on the homework.

So this blog post, is aimed at demonstrating to coding beginners how one can transition from one way of thinking to the other, and providing a small time-series-forecasting toolkit for users that simply want to apply the models to their data.

The code and data for this example can be found on my GitHub page and I will discuss it below. I will be using a wine sales dataset that lists Australian wine sales (in kiloliters) from January 1980 to October 1991. The data looks like this:

And this is what the time series looks like:


We first need to import the packages we’ll be using and load the data. I will be using Pandas in this example (but there’s other ways). I’m also defining the number of seasonal periods in a cycle, in this case 12.

import numpy as np #Package we'll use for numerical calculations
import matplotlib.pyplot as plt #From matplotlib package we import pyplot for plots
import pandas #Package to data manipulation
import scipy.optimize #Package we'll use to optimize
plt.style.use('seaborn-colorblind') #This is a pyplot style (optional)

'''Load the data into a pandas series with the name wine_sales'''
time_series = pandas.Series.from_csv("wine_sales.csv", header=0)

P=12 #number of seasonal periods in a cycle
In class, I’ve always mentioned that one should use a training and a validation set for model development, primarily to avoid overfitting our model to the specific training set. In this example, the functions are written as they apply to the training set. Should you choose to apply the functions listed here, you should apply the functions for the training set, extract forecasts and then use those to initialize your validation period. To divide the observations, you would do something like this:
training = time_series[0:108] # Up to December '88
validation = time_series[108:] # From January '89 until end

Now, if say, we wanted to apply the Naive model of the next steps forecast being equal to the current observation, i.e., \hat{y}_{t+1}=y_t, we’d do something like:

y_hat=pandas.Series().reindex_like(time_series) # Create an array to store forecasts
y_hat[0]= time_series[0] # Initialize forecasting array with first observation
''' Loop through every month using the model to forecast y_hat'''
for t in range(len(y_hat)-1): # Set a range for the index to loop through
    y_hat[t+1]= time_series[t] # Apply model to forecast time i+1

Now if we’d like to use this for any time series, so we don’t have to perform our calculations every time, we need to reformat this a bit so it’s a function:

def naive(time_series):
    y_hat[0]= time_series[0] # Initialize forecasting array with first observation
    ''' Loop through every month using the model to forecast y'''
    #This sets a range for the index to loop through
    for t in range(len(y_hat)-1): 
        y_hat[t+1]= time_series[t] # Apply model to forecast time i+1
    return y_hat

Now we can just call define this function at the top of our code and just call it with any time series as an input. The function as I’ve defined it returns a pandas.Series with all our forecasts. We can then do the same for all the other modeling methods (below). Some things to note:

  • The data we read in the top, outside the functions, as well as any parameters defined (P in this case) are global variables and do not need to be defined as an input to the function. The functions below only need a list of parameter values as inputs.
  • For the models with seasonality and/or trend we need to create separate series to store those estimates for E, S, and T.
  • Each model has its own initialization formulas and if we wanted to apply them to the validation set that follows our training set, we’d need to initialize with the last values of our training.
Using this model, y_hat(t+1)=(y(t)+y(t-1)...+y(t-k+1))/k (i.e., the predicted 
next value is equal to the average of the last k observed values).'''
def SMA(params):
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]= np.sum(time_series[t-k+1:t+1])/k # Apply model to forecast time i+1
    return y_hat

Using this model, y_hat(t+1)=w(1)*y(t)+w(2)*y(t-1)...+w(k)*y(t-k+1) (i.e., the 
predicted next value is equal to the weighted average of the last k observed 
def WMA(params):
    weights = np.array(params)
    y_hat[0:k]=time_series[0:k] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]= np.sum(time_series[t-k+1:t+1].multiply(weights)) # Apply model to forecast time i+1
    return y_hat
'''This model includes the constraint that all our weights should sum to one. 
To include this in our optimization later, we need to define it as a function of our 
def WMAcon(params):
    weights = np.array(params)
    return np.sum(weights)-1

Using this model, y_hat(t+1)=y_hat(t)+a*(y(t)-y_hat(t))(i.e., the 
predicted next value is equal to the weighted average of the last forecasted value and its
difference from the observed).'''
def SES(params):
    a = np.array(params)
    y_hat[0]=time_series[0] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]=  y_hat[t]+a*(time_series[t]-y_hat[t])# Apply model to forecast time i+1
    return y_hat

Using this model, y_hat(t+1)=E(t)+T(t) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
def DES(params):
    a,b = np.array(params)
    '''We need to create series to store our E and T values.'''
    E = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[0]=E[0]=time_series[0] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*time_series[t]+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        y_hat[t+1] = E[t] + T[t] # Apply model to forecast time i+1
    return y_hat

Using this model, y_hat(t+1)=E(t)+S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
appropriate seasonal factor). We first need to create an array to store our 
forecast values.'''
def ASM(params):
    a,b = np.array(params)
    p = P
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]-E[:p]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*E[t]
        S[t+1] = b*(time_series[t]-E[t])+(1-b)*S[t+1-p]
        y_hat[t+1] = E[t] + S[t+1-p] # Apply model to forecast time i+1
    return y_hat

Using this model, y_hat(t+1)=E(t)*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series times 
the appropriate seasonal factor). We first need to create an array to store our 
forecast values.'''
def MSM(params):
    a,b = np.array(params)
    p = P
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]/E[:p]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*E[t]
        S[t+1] = b*(time_series[t]/E[t])+(1-b)*S[t+1-p]
        y_hat[t+1] = E[t]*S[t+1-p] # Apply model to forecast time i+1
    return y_hat

Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend, times the appropriate seasonal factor). We first need to create an array 
to store our forecast values.'''
def AHW(params):
    a, b, g = np.array(params)
    p = P
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]-E[:p]
    T[:p] = 0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        S[t+1] = g*(time_series[t]-E[t])+(1-g)*S[t+1-p]
        y_hat[t+1] = E[t]+T[t]+S[t+1-p] # Apply model to forecast time i+1
    return y_hat

Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend, times the appropriate seasonal factor). We first need to create an array 
to store our forecast values.'''
def MHW(params):
    a, b, g = np.array(params)
    p = P
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    S[:p] = time_series[:p]/(np.sum(time_series[:p])/p)
    E[:p] = time_series[:p]/S[:p]
    T[:p] = 0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        S[t+1] = g*(time_series[t]/E[t])+(1-g)*S[t+1-p]
        y_hat[t+1] = (E[t]+T[t])*S[t+1-p] # Apply model to forecast time i+1
    return y_hat

Having defined this, I can then, for example, call the Multiplicative Holt Winters method by simply typing:


This will produce a forecast using the Multiplicative Holt Winters method with those default parameters, but we would like to calibrate them to get the “best” forecasts from our model. To do so, we need to define what we mean by “best”, and in this example I’m choosing to use Mean Square Error as my performance metric. I define it below as a function that receives the parameters and some additional arguments as inputs. I only need to set it up this way because my optimization function is trying to minimize the MSE function by use of those parameters. I’m using the “args” array to simply tell the function which model it’s using to forecast.

def MSE(params, args):
    model, = args
    t_error = np.zeros(len(time_series))
    forecast = model(params)
    for t in range(len(time_series)):
        t_error[t] = time_series[t]-forecast[t]
    MSE = np.mean(np.square(t_error))
    return MSE

To perform the optimization in Excel, we’d use Solver, but in Python we have other options. SciPy is a Python package that allows us, among many other things, to optimize such single-objective problems. What I’m doing here is that I define a list of all the models I want to optimize, their default parameters, and the parameters’ bounds. I then use a loop to go through my list of models and run the optimization. To store the minimized MSE values as well as the parameter values that produce them, we can create an array to store the MSEs and a list to store the parameter values for each model. The optimization function produces a “dictionary” item that contains the minimized MSE value (under ‘fun’), the parameters that produce it (under ‘x’) and other information.

''' List of all the models we will be optimizing'''
models = [SES, DES, ASM, MSM, AHW, MHW]
''' This is a list of all the default parameters for the models we will be 
optimizing. '''
                      #SES,  DES,     ASM
default_parameters = [[0.5],[0.5,0.5],[0.5,0.5],
                      #MSM,        AHW,            MHW
''' This is a list of all the bounds for the default parameters we will be 
optimizing. All the a,b,g's are weights between 0 and 1. '''
bounds = [[(0,1)],[(0,1)]*2, [(0,1)]*2,
min_MSEs = np.zeros(len(models)) # Array to store minimized MSEs
opt_params = [None]*len(models) # Empty list to store optim. parameters
for i in range(len(models)):
    res = scipy.optimize.minimize(MSE, # Function we're minimizing (MSE in this case)
                                  default_parameters[i], # Default parameters to use
                                  # Additional arguments that the optimizer 
                                  # won't be changing (model in this case)
                                  method='L-BFGS-B', # Optimization method to use
                                  bounds=bounds[i]) # Parameter bounds
    min_MSEs[i] = res['fun'] #Store minimized MSE value
    opt_params[i] = res['x'] #Store parameter values identified by optimizer

Note: For the WMA model, the weights should sum to 1 and this should be input to our optimization as a constraint. To do so, we need to define the constraint function as a dictionary and include the following in our minimization call: constraints=[{‘type’:’eq’,’fun’: WMAcon}]. The number of periods to consider cannot be optimized by this type of optimizer.

Finally, we’d like to present our results. I’ll do so by plotting the observations and all my models as well as their minimized MSE values:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1) # Create figure
ax.set_title("Australian wine sales (kilolitres)") # Set figure title
l1 = ax.plot(time_series, color='black', linewidth=3.0, label='Observations') # Plot observations
for i in range(len(models)):
    ax.plot(time_series.index,models[i](opt_params[i]), label = models[i].__name__)
ax.legend() # Activate figure legend
print('The estimated MSEs for all the models are:')
for i in range(len(models)):
    print(models[i].__name__ +': '+str(min_MSEs[i]))

This snippet of code should produce this figure of all our forecasts, as well as a report of all MSEs:Figure_1

The estimated MSEs for all the models are:
SES: 133348.78
DES: 245436.67
ASM: 80684.00
MSM: 64084.48
AHW: 72422.34
MHW: 64031.19

The Multiplicative Holt Winters method appears to give the smallest MSE when applied to these data.

From Frequentist to Bayesian Ordinary Linear Regression

In this post, I will briefly review the basic theory about Ordinary Linear Regression (OLR) using the frequentist approach and from it introduce the Bayesian approach. This will lay the terrain for a later post about Gaussian Processes. The reader may ask himself why “Ordinary Linear Regression” instead of “Ordinary Least Squares.” The answer is that least squares refers to the objective function to be minimized, the sum of the square errors, which is not used the Bayesian approach presented here. I want to thank Jared Smith and Jon Lamontagne for reviewing and improving this post. This post closely follows the presentation in Murphy (2012), which for some reason does not use “^” to denote estimated quantities (if you know why, please leave a comment).

Ordinary Linear Regression

OLR is used to fit a model that is linear in the parameters to a data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, where \boldsymbol{X} represents the independent variables and \boldsymbol{y} the dependent variable, and has the form

p(y|\boldsymbol{x}, \boldsymbol{\beta}) = \mathcal{N}(y|\boldsymbol{\beta}^T\boldsymbol{x}, \sigma_{\epsilon}^2)

where \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} should be read as “a normal distribution of \boldsymbol{y} with (or given) mean \boldsymbol{\beta}^T\boldsymbol{x} and variance \sigma_{\epsilon}^2.” One frequentist method of estimating the parameters of a linear regression model is the method of maximum likelihood estimation (MLE). MLE provides point estimates \boldsymbol{\hat{\beta}} for each of the regression model parameters \boldsymbol{\beta} by maximizing the likelihood function with respect to \boldsymbol{\beta} and \sigma_{\epsilon}^2 by solving the maximization problem

\boldsymbol{\hat{\beta}} = arg\,\underset{\boldsymbol{\beta}}{max}\;\mathcal{L}(\boldsymbol{\beta})

where the likelihood function \mathcal{L}(\boldsymbol{\beta}) is defined as

\mathcal{L}(\boldsymbol{\beta})=p(\mathcal{D}|\boldsymbol{\beta})={\displaystyle \prod_{i=1}^{N} p(y_i|\boldsymbol{x}_i, \boldsymbol{\beta}, \mu, \sigma_{\epsilon}^2)} = \mathcal{N}(\boldsymbol{y}|\mu + \boldsymbol{X}\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})

where \mu is the mean of the model error (normally \mu=0), N is the number of points in the data set, \boldsymbol{I} is an identity matrix, and \boldsymbol{x} is a vector of values of the independent variables for an individual data point of matrix \boldsymbol{X}. OLR assumes independence and the same model error variance among observations, which is expressed by the covariance matrix \sigma_{\epsilon}^2\boldsymbol{I}. Digression: in Generalized Linear Regression (GLR), observations may not be independent and may have different variances (heteroscedasticity), so the covariance matrix may have off-diagonal terms and the diagonal terms may not be equal.

If a non-linear function shape is sought, linear regression can be used to fit a model linear in the parameters over a function \phi(\cdot) of the data — I am not sure why the machine learning community chose \phi for this function, but do not confuse it with a normal distribution. This procedure is called basis function expansion and has a likelihood function of the form

\mathcal{L}(\boldsymbol{\beta}) = \mathcal{N}(\boldsymbol{y}|\mu + \phi(\boldsymbol{X})\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})

where \phi(\boldsymbol{x}) can have, for example, the form

\phi(\boldsymbol{x})=\begin{pmatrix}1\\ x_1\\ x_1^2\end{pmatrix}

for fitting a parabola to a one-dimensional \boldsymbol{X} over \boldsymbol{y}. When minimizing the squared residuals we get to the famous Ordinary Least Squares regression. Linear regression can be further developed, for example, into Ridge Regression to better handle multicollinearity by introducing bias to the parameter estimates, and into Kernel Ridge regression to implicitly add non-linear terms to the model. These formulations are beyond the scope of this post.

What is important to notice is that the standard approaches for linear regression described here, although able to fit linear and non-linear functions, do not provide much insight into the model errors. That is when Bayesian methods come to play.

Bayesian Ordinary Linear Regression

In Bayesian Linear regression (and in any Bayesian approach), the parameters \boldsymbol{\beta} are treated themselves as random variables. This allows for the consideration of model uncertainty, given that now instead of having the best point estimates for \boldsymbol{\beta} we have their full distributions from which to sample models — a sample of  \boldsymbol{\beta} corresponds to a model. The distribution of parameters \boldsymbol{\beta}, P(\boldsymbol{\beta}|\mathcal{D}), called the parameter posterior distribution, is calculated by multiplying the likelihood function used in MLE by a prior distribution for the parameters. A prior distribution is assumed from knowledge prior to analyzing new data. In Bayesian Linear Regression, a Gaussian distribution is commonly assumed for the parameters. For example, the prior on \boldsymbol{\beta} can be \mathcal{N}(\boldsymbol{\beta_0},\boldsymbol{V}_0) for algebraic simplicity. The parameter posterior distribution assuming a known \sigma_{\epsilon}^2 then has the form

p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) =\frac{p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})}{p(\mathcal{D})} \propto p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})  ,        given p(\mathcal{D}) is a constant depending only on the data.

We now have an expression from which to derive our parameter posterior for our linear model from which to sample \boldsymbol{\beta}. If the likelihood and the prior are Gaussian, the parameter posterior will also be a Gaussian, given by

p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) \propto \mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_0, \boldsymbol{V}_0)\mathcal{N}(\boldsymbol{y}|\boldsymbol{X\beta},\sigma_{\epsilon}^2\boldsymbol{I})=\mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_N,\boldsymbol{V}_N)
\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}\boldsymbol{X}^T)^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}\boldsymbol{y}

If we calculate the parameter posterior (distribution over parameters \boldsymbol{\beta}) for a simple linear model f(x) = \beta_0 + \beta_1x for a data set \mathcal{D} in which \boldsymbol{X} and \boldsymbol{y} are approximately linearly related given some noise \sigma_{\epsilon}^2, we can use it to sample values for \boldsymbol{\beta}. This is equivalent to sampling linear models f(x) for data set \mathcal{D}. As the number of data points increase, the variability of the sampled models should decrease, as in the figure below


This is all interesting but the parameter posterior per se is not of much use. We can, however, use the parameter posterior to find the posterior predictive distribution, which can be used to both get point estimates of y and the associated error. This is done by multiplying the likelihood by the parameter posterior and marginalizing the result over \boldsymbol{\beta}. This is equivalent to performing infinite sampling of blue lines in the example before to form density functions around a point estimate of \mu_{y_*}=f(\boldsymbol{x}_*), with (\boldsymbol{x}_*,y_*) denoting a new point that is not in \mathcal{D}. If the likelihood and the parameter posterior are Gaussian, the posterior predictive then takes the form below and will also be Gaussian (in Bayesian parlance, this means that the Gaussian distribution is conjugate to itself)!

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}


\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

or, to put it simply,

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* \pm \sigma_N^2(\boldsymbol{x}_*)

The posterior predictive, meaning final linear model and associated errors (the parameter uncertainty equivalent of frequentist statistics confidence intervals) are shown below


If, instead of having a data set \mathcal{D} in which \boldsymbol{X} and \boldsymbol{y} are related approximately according to a 4th order polynomial, we use a \phi(\cdot) function to artificially create more random variables (or features, in machine learning parlance) corresponding to the non-linear terms of the polynomial function. Our function \phi(\cdot) would be \phi(x) = [1, x, x^2, x^3, x^4]^T and the resulting X would therefore have five columns instead of two ([1, x]^T), so now the task is to find \boldsymbol{\beta}=[\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T. Following the same logic as before for X'=\phi(X) and x_*'=\phi(x_*), where prime denotes the new set random variable x from function \phi(\cdot), we get the following plots





This looks great, but there is a problem: what if we do not know the functional form of the model we are supposed to fit (e.g. a simple linear function or a 4th order polynomial)? This is often the case, such as when modeling the reliability of a water reservoir system contingent on stored volumes, inflows and evaporation rates, or when modeling topography based on samples surveyed points (we do not have detailed elevation information about the terrain, e.g. a DEM file). Gaussian Processes (GPs) provide a way of going around this difficulty.


Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Intro to Machine Learning Part 2: Binary Classification With Perceptron

My last post on machine learning basics ended with a discussion of the curse of dimensionality when using K-Nearest Neighbors (K-NN) for classification. To summarize, curse of dimensionality states that as the number of features a point has increases, the more difficult it becomes to find other points that are “close” to that point. This makes classification through K-NN nearly impossible when the number of features is large: if all points are distant from each other, how can you define neighbors? While techniques such as dimensional reduction may help improve the effectiveness of K-NN for high dimensional data, other machine learning algorithms do not suffer from this curse and can allow us to perform classification regardless of dimensionality. One of the first examples of such methods was the Perceptron algorithm, invented right here at Cornell by Frank Rosenblatt in 1957.

Some history

Perceptron is a single layer neural network (though the idea of layered neural networks was not invented until many years after Perceptron). Though the algorithm could run on computers of the day, the first implementation of Perceptron was on a custom built physical machine, the “Mark 1 Perceptron” and used for image recognition. The algorithm was lauded by the press at the time as the beginning of a revolution in artificial intelligence. The New York Times wrote that: “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence” [1].  Obviously, they got a bit ahead of themselves… The success and popularity of Perceptron was initial a boon for the field of machine learning until a seminal text by Marvin Minsky and Seymour Papert demonstrated that Perceptron is incapable of classifying problems that are not linearly separable (more on this later). This led to a withdrawal of funding for AI research in a period termed the first AI winter (1974-1980).

So what is Perceptron?

While all points in a high dimensional space are distant from each other, it turns out that these same points are all close to any hyperplane drawn within this space. For a nice explanation of this concept, see the excellent course notes from CEE 4780, here (discussion starts at the “distance to hyperplane” section about 3/4 of the way down).

The Perceptron algorithm is simply a method for finding a hyperplane that bisects two classes of data (labeled positive or negative).

H(x) = dot\{x: (x,w) +b=0\}

Where w  is an orthogonal vector that defines the orientation of hyperplane H, as shown in Figure 1. The sign of the inner product of w and each point, xi, can then be used to classify xi:

h(x_i) = sign(w^Tx+b)

y_i(w^Tx_i) > 0,\: \: y \in (-1,+1)      if xis classified correctly.



Figure 1: A hyperplane, H, that divides positive points (blue dots) and negative points (red triangles). The hyperplane’s orientation is defined by an orthogonal vector, w, which can be thought of as a vector of weights across d dimensions (in this case 3).

The Perceptron algorithm

To find the proper for data set x, Perceptron uses the following steps:

  1. Start with an initial guess of w = 0
  2. Check to see if there are any misclassified points (ie. is yi(wTxi) > 0 for all xi?)
  3. If any point is found is to be misclassified, it is added to w, creating a new hyperplane
  4. With the updated w, loop through all points again checking for misclassification and update if necessary
  5. Continue until there are no misclassfied points remaining (ie the algorithm has converged)

It can be formally proven that if the data is linearly separable, Perceptron will converge in a finite number of updates.

I should note that while this discussion of perceptron has used geometric intuition to explain the algorithm (which is how professor Weinberger teaches it in his 4780 class and how I find it easiest to understand), many people conceptualize Perceptron as a step function and w as the weights for each input to this function. While this results in exactly the same thing as the above description, if you think this may be an easier way to understand, see this link.

Problems with Perceptron

While Perceptron does not suffer from the curse of dimensionality as K-NN does, it is still highly limited in its capabilities as it will never converge on data that is not linearly separable (such as the famous XOR problem demonstrated by Minsky and Papert). Luckily for us many more techniques have been developed to handle such data sets since the late 60’s when their book was written! Stay tuned for future posts for more machine learning topics.


1: “New Navy Device Learns By Doing” the New York Times, 7/8/1958 (These seem like fairly grandiose claims for such a small column, just saying)