A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

raw_data

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

u = xE

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

PCA visualization

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

\tilde{x} = \tilde{u}\tilde{E}^T

Where:

\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set

\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)

\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

final_synth.png

 

 

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in 
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,... 
 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,... 
 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Canadaigua min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')

 

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

Root finding in MATLAB, R, Python and C++

In dynamical systems, we are often interested in finding stable points, or equilibria. Some systems have multiple equilibria. As an example, take the lake problem, which is modeled by the equation below where Xt is the lake P concentration, at are the anthropogenic P inputs, Yt~LN(μ,σ2)  are random natural P inputs, b is the P loss rate, and q is a shape parameter controlling the rate of P recycling from the sediment. The first three terms on the right hand side make up the “Inputs” in the figure, while the last term represents the “Outputs.” A lake is in equilibrium when the inputs are equal to the outputs and the lake P concentration therefore is not changing over time.

lakeModel

For irreversible lakes this occurs at three locations, even in the absence of anthropogenic and natural inputs: an oligotrophic equilibrium, an unstable equilibrium (called the critical P threshold) and a eutrophic equilibrium (see figure below).

PcritDiagram

The unstable equilibrium in this case is called the critical P threshold because once it is crossed, it is impossible to return to an oligotrophic equilibrium by reducing anthropogenic and natural P inputs alone. In irreversible lakes like this, we would therefore like to keep the lake P concentration below the critical P threshold. How do we find the critical P threshold? With a root finding algorithm!

As stated earlier, the system above will be in equilibrium when the inputs are equal to the outputs and the P concentration is not changing over time, i.e. when

X_{t+1} - X_t = \frac{X^q_t}{1+X^q_t} - bX_t = 0

Therefore we simply need to find the zero, or “root” of the above equation.  Most of the methods for this require either an initial estimate or upper and lower bounds on the location of the root. These are important, since an irreversible lake will have three roots. If we are only interested in the critical P threshold, we have to make sure that we provide an estimate which leads to the unstable equilibrium, not either of the stable equilibria. If possible, you should plot the function whose root you are finding to make sure you are giving a good initial estimate or bounds, and check afterward to ensure the root that was found is the one you want! Here are several examples of root-finding methods in different programming languages.

In MATLAB, roots can be found with the function fzero(fun,x0) where ‘fun’ is the function whose root you want to find, and x0 is an initial estimate. This function uses Brent’s method, which combines several root-finding methods: bisection, secant, and inverse quadratic interpolation. Below is an example using the lake problem.

myfun = @(x,b,q) x^q/(1+x^q)-b*x;
b = 0.42;
q = 2.0;
fun = @(x) myfun(x,b,q);
pcrit = fzero(fun,0.75);

This returns pcrit = 0.5445, which is correct. If we had provided an initial estimate of 0.25 instead of 0.75, we would get pcrit = 2.6617E-19, basically 0, which is the oligotrophic equilibrium in the absence of anthropogenic and natural P inputs. If we had used 1.5 as an initial estimate, we would get pcrit = 1.8364, the eutrophic equilibrium.

MatlabScreenShot

In R, roots can be found with the function uniroot, which also uses Brent’s method. Dave uses this on line 10 of the function lake.eval in his OpenMORDM example. Instead of taking in an initial estimate of the root, this function takes in a lower and upper bound. This is safer, as you at least know that the root estimate will lie within these bounds. Providing an initial estimate that is close to the true value should do well, but is less predictable; the root finding algorithm may head in the opposite direction from what is desired.

b <- 0.42
q <- 2.0
pcrit <- uniroot(function(x) x^q/(1+x^q) - b*x, c(0.01, 1.5))$root

This returns pcrit = 0.5445145. Good, we got the same answer as we did with MATLAB! If we had used bounds of c(0.75, 2.0) we would have gotten 1.836426, the eutrophic equilibrium.

What if we had given bounds that included both of these equilibria, say c(0.5, 2.0)? In that case, R returns an error: ‘f() values at end points not of opposite sign’. That is, if the value returned by f(x) is greater than 0 for the lower bound, it must be less than 0 for the upper bound and vice versa. In this case both f(0.5) and f(2.0) are greater than 0, so the algorithm fails. What if we gave bounds for which one is greater than 0 and another less, but within which there are multiple roots, say c(-0.5,2.0)? Then R just reports the first one it finds, in this case pcrit = 0.836437, the eutrophic equilibrium. So it’s important to make sure you pick narrow enough bounds that include the root you want, but not roots you don’t!

RscreenShot

In Python, you can use either scipy.optimize.root or scipy.optimize.brentq, which is what Jon uses on line 14 here. scipy.optimize.root can be used with several different algorithms, but the default is Powell’s hybrid method, also called Powell’s dogleg method. This function only requires an initial estimate of the root.

from scipy.optimize import root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.75)

scipy.optimize.root returns an object with several attributes. The attribute of interest to us is the root, represented by x, so we want pcrit.x. In this case, we get the correct value of 0.54454. You can play around with initial estimates to see how pcrit.x changes.

PythonScreenShot1

Not surprisingly, scipy.optimize.brentq uses Brent’s method and requires bounds as an input.

from scipy.optimize import brentq as root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.01, 1.5)

This just returns the root itself, pcrit = 0.5445. Again, you can play around with the bounds to see how this estimate changes.

PythonScreenShot2

In C++, Dave again shows how this can be done in the function ‘main-lake.cpp’ provided in the Supplementary Material to OpenMORDM linked from this page under the “Publications” section. On lines 165-168 he uses the bisect tool to find the root of the function given on lines 112-114. I’ve copied the relevant sections of his code into the function ‘find_Pcrit.cpp’ below.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <boost/math/tools/roots.hpp>

namespace tools = boost::math::tools;
using namespace std;

double b, q, pcrit;

double root_function(double x) {
  return pow(x,q)/(1+pow(x,q)) - b*x;
}

bool root_termination(double min, double max) {
  return abs(max - min) <= 0.000001;
}

int main(int argc, char* argv[])
{
  b = 0.42;
  q = 2.0;

  std::pair<double, double> result = tools::bisect(root_function, 0.01, 1.0, root_termination);
  pcrit = (result.first + result.second)/2;
  cout << pcrit << endl;
}

This yields the desired root of pcrit = 0.54454, but of course, changing the bounds may result in different estimates. In case you missed it, the take home message is to be careful about your initial estimate and bounds ;).

CPPscreenShot

 

Power Spectral Density Estimation in Matlab

Power Spectral Density Estimation in Matlab

Hi all,

This post is about power spectral density estimation (PSDE) using Matlab.  PSDE is often used in signal processing and fluid mechanics, so you may or may not be familiar with it.  The idea is to determine the frequency content of sampled time series data.  This means that we can determine the amount of the variance in the time series that is contained at different frequencies (or periodicities).  Let’s take a look at a simple example.

Suppose we are sampling a signal (in this case a sign wave) for about 3 minutes with a frequency of 100 hertz, meaning we take 100 measurements every second.  Let’s generate our signal and plot our observations using Matlab.

x = (0:0.01:60*pi);
y = sin(x);

figure(1)
plot(x,sin(x))
title('Signal with no Noise')
xlabel('Seconds')
ylabel('Observed Signal')
ylim([-1.5,1.5])
untitled

Sine wave, with no noise, sampled at 100 Hertz

From observing the data directly it’s pretty clear what the frequency and period of the data are here (1/(2*pi) and pi respectively): no need for PSDE.  However, real data is rarely so well behaved.  To simulate this, let’s add random noise to our original signal.  To make it interesting, let’s add a lot of noise: normal distributed noise with standard deviation 10 times larger than the magnitude of the signal!

for i = 1:max(size(x))
 y(i) = y(i)+10*norminv(rand);
end

figure(2)
plot(x,y)
title('Signal with Noise')
xlabel('Seconds')
ylabel('Observed Signal')
Sine wave, with random normal noise, sampled at 100 Hertz

Sine wave, with random normal noise, sampled at 100 Hertz

Now the noise completely washes out the signal, and it is not at all clear what the period or frequency of the underlying data are.  So let’s compute and plot the power spectrum of the data.  I leave it to the reader to review all of the math underlying computing the spectrum, but I’ll provide a very brief description of what we are doing here.  PSDE involves taking the discrete Fourier transform of the data.  This transforms our data from the temporal or spatial domain to the frequency domain.  The power spectral density (PSD) function is simply the magnitude of the squared Fourier transformed data (often scaled). It tells us how much of the variance in our signal is explained in each discrete frequency band.  A spike in the PSD function tells us that there’s a lot of variance explained in that frequency band.  The frequency resolution of our analysis (width of the frequency bands) is dictated by the length of our recorded observations.  Interestingly, if we integrate the PSD function over the frequency domain, we obtain the sample variance of our data.  Now, let’s look at the code.

fs = (0.01)^-1;
N =size(y,2);

Y = fft(y);
f=fs*linspace(0,1,N);
G = abs(Y.^2)*(1/(fs*N));

figure(3)
loglog(f(2:floor(N/2)),G(2:floor(N/2)))
title('Amplitude of Spectrum of y(t)')
xlabel('Frequency (Hz)')
ylabel('S(f)')

The first two lines are defining the sampling frequency (fs), the number of observations (N).  Next we use Matlab’s fft command to take the discrete Fourier transformation of the data.  We then compute the frequency bins we are considering, then we compute and plot the PSD function.  Let’s take a look at the resultamplitude_of_spectrum

 

There is a lot of noise, but there is a distinct peak at 0.1592, which is 1/(2*pi), indicates the frequency of the underlying signal.  This is neat, even though the signal contains a huge amount of noise, we’ve distinguished the correct frequency (and inversely the period) of the signal!

So how might we as systems analysts use this?

In my PhD thesis we used PSDE to determine the critical time-dynamics of a hydropower system’s operation.  This turned out to be a critical diagnostic tool.  In the hydropower system I considered, both inflows and energy prices are stochastic, so it can be difficult to determine if  the derived ‘optimal’ policy makes sense by only observing simulation results: the release in any one time step (here 6-hrs) can appear noisy and erratic in response to price or hydrologic shocks.  The power of PSDE is the ability to cut through the noise and reveal the underlying signal.

As an example, I’ve derived an implicit ‘optimal’ policy using sampling stochastic dynamic programming.  To evaluate the quality of the derived policy, I’ve simulated the system operation with this policy over the range of historical inflow series.  For simplicity, we’ll assume that the energy price profile (ranging from high on-peak to low off-peak prices) is identical on every day.  Now, let’s look at the PSD function for the reservoir release.

error_hydro

We can immediately pick out elements of the PSD function that make sense.  First, there is a lot of variability at very low frequencies, corresponding to periodicity on the order of months or seasons.  This is likely due to seasonal hydrologic variability between summer, spring, and winter inflows: releases are higher in the spring and early summer because more water is available, and generally lower in the winter and late summer when inflows to the reservoir are lower.  Similarly the peak at frequency 1/day makes sense, because we still have a daily price cycle that the system is responding to, put we are harder pressed to explain the other peaks between frequencies 0.1 and 1 (roughly corresponding to weekly and within week periodicities).  We have fixed the price profile, but hydrology shouldn’t be cycling during the week, so what is going on?

As it turns out, I had a bug in my algorithm (bad ending conditions for the SSDP) which was causing the ‘optimal’ policy to behave a bit myopically on a weekly basis.  This error was propagating through the model, but was not easy to see by simply observing the simulated results (see chapter 6 of the thesis).  Correcting the problem, I got the following spectrum.  The unexplainable spikes have gone away, and we now see that daily price cycling and seasonal hydrology explain the majority of the systems variability (as we expect).

correct_hydro

For our Cornell-based readers our new CEE faculty member Gregory McLaskey is offering a new class on time series analysis (including PSDE) which is highly recommended.  PSDE is also used extensively (with real data) in Todd Cowen‘s experimental fluid mechanics class which also recommend because you learn a lot and get to play with lasers which, let’s be honest, is why we all became engineers.

Basic NetCDF processing in Matlab

To follow up on Jon Herman’s posts on netCDF, including how to convert ascii to netCDF, using lists in netCDF, and using matplotlib with netCDF data, we also wanted to just learn how to access and plot netCDF data in native Matlab.

First of all, there is an impressive list of netCDF utilities at the NCAR website, including some that are designed for Matlab, such as the CSIRO toolbox.  However, what if you can’t install packages and extra utilities in Matlab?

Well, it turns out there are some built in functions that can handle these data.  Again, thanks to Jon Herman for finding these.  He provides a sample file that gives the instructions.  Basically, the ncdisp function gives a list of what elements are within the netCDF file.  And then, you can extract the data using ncread.

Want to find some data to play around with?  The IRI Climate Data Library is a great place to start. There are data from an impressive number of sources, including some precipitation and streamflow data from South America.

Questions/comments/etc?  Post them below! 

Invoking a Matlab wrapper while running a Parallel C process

As you can tell by my posts lately (thoughts on parallel MOEAs, and thoughts on compiling MODFLOW), I am working on a project where we are going to have to link a lot of codes together that may not like each other.  I guess it’s like little kids in the backseat of a car on a long trip!  So to continue this I wanted to talk about an interesting idea on extending the “Frankenstein’s monster” of these linked codes a step further…

The workflow

A parallel MOEA is coded in C.  It is running a master-worker paradigm in which all the workers are dutifully evaluating their simulations.  Note that each worker must work on his own individual input and output files!

The worker has to call on a MATLAB wrapper, which massages some input data and writes some files that are needed to run a simulation.  MATLAB has to somehow figure out what node it is running on!

Finally, MATLAB calls to an executable that is running a simulation code written in a Mysterious Language.  The exectuable communicates to the rest of the world via input and output text files.

A problem

We know that we can call MATLAB using a system command. Consider trying to run a MATLAB file called myMatlabFunction.m. The command to invoke MATLAB is simply:

matlab -r myMatlabFunction

but note! MATLAB doesn’t have the luxury of having the MPI setup, so it doesn’t know what processor rank it is! In fact, MATLAB is inherently ‘stupid’, in the sense that it is a new process that has been spawned off of the calculations that the worker node is doing. So how do we tell the MATLAB process its rank? This is needed, of course, because the input filename for the simulation is going to have, within it, the name of the rank of the process.

In the C code, all we need to do to determine the rank of the processor is:

#include
int rank;
MPI_Comm_rank (MPI_COMM_WORLD, &rank);/* get current process id */

My first idea was to include a parameter in the MATLAB function that would accept ‘rank’, and then the form of the call to MATLAB would simply enter the parameters in parentheses! I would post where I found out how to do this, but I’m going to because (wait for it) this is not a valid approach in Unix!

In fact, to do what we need to do, according to this helpful post on the Mathworks website, we need to set environment variables on the machine. Then, Matlab can read the environment variables.

The Solution

In your C code, first convert the rank to a char array, and then you can use the setenv command to set an environment variable. Note! For debugging purposes, it is helpful to save every input file that you create. Granted, this will mean that 1000s of files will be saved at once, but the benefit is that you get to see that the input files are created properly before you go destroying them. Therefore, there’s a second environment variable that you need to set which is the iteration. Essentially you will have a matrix of input files: ‘rank 2, iteration 5’ is the 5th iteration that rank 2 has worked on, etc. So:

sprintf(rankAsString, "%d", rank);
sprintf(iterationAsString, "%d", myCounter);

setenv("RANK", rankAsString, 1);
setenv("ITERATION", iterationAsString, 1);

To clarify, myCounter is a global variable that you increment up every time your function is called. So now, you have environment variables on your processor that can be read in by other programs to indicate the rank and iteration. This is important because your input file has those variables embedded in its name:

sprintf(theFilename, "input_rank-%d_iteration-%d.txt", rank, myCounter);

Now let’s look at the MATLAB code. To test initially, I just did something silly. All I did was have the MATLAB code read in the variables, and then spit them back out in a new file. If it can do that, it can obviously do more complicated things. Luckily MATLAB has a function that can read environment variables (getenv). You can also craft yourself some test output that you can look at to make sure the process is working:

function myMatlabFunction

rank = getenv('RANK');
myCounter = getenv('ITERATION');

disp('Rank is:');
disp(rank);

disp('Iteration is:');
disp(myCounter);

filename = ['input_rank-' num2str(rank) '_iteration-' num2str(myCounter) '.txt'];

outFilename = ['output_rank-' num2str(rank) '_iteration-' num2str(myCounter) '.txt'];

fid = fopen(filename);
for i = 1:11
	  vars(i) = fscanf(fid, '%f', 1);
end
fclose(fid);

disp(vars);

outid = fopen(outFilename, 'w');
fprintf(outid, '%f %f %f %f %f %f %f %f %f %f %f\n', vars);
fclose(outid);

exit

Note here that the number of variables is hard coded as 11, but this can be easily changed.

Conclusion

As you can see, just because you need to use source code to run a Parallel MOEA, that doesn’t mean that you can’t employ wrappers and executable simulation codes within your workflow. Granted, the use of wrappers is going to slow you down, but for a model with a long evaluation time, this won’t likely matter that much. As usual please comment and question below.

Compiling the Borg Matlab Wrapper (OSX/Linux)

The Borg MOEA is written in C, but thanks to Dave, a Matlab wrapper is now available. This post will describe how to compile and use it on OSX (and, to a lesser extent, Linux); thanks to Joe, who figured all of this out. Windows users should refer to this post instead.

Step 0: Get Files

You will need the full source code (borg.c, borg.h, mt19937.c, mt19937.h) for the serial version of Borg, as well as three additional files: borg.m, DTLZ2.m,and nativeborg.cpp. Students in CEE 6200 will have received these files in a zip folder (be sure to grab the OSX folder rather than the Windows one).

Step 1: Compile shared library

The Windows version comes with pre-compiled shared libraries, but the OSX version does not. This is ok, because we need a compiler for the rest of the steps anyway. If you don’t already have it, go install XCode, which is Apple’s software development kit. We really only need the gcc compiler, but it won’t hurt to install the whole package.

After installing XCode, we need to compile Borg as a shared library. Open a terminal and navigate to the directory where you put your Borg files. Then, type the following commands:

gcc -c borg.c
gcc -c mt19937ar.c
gcc -shared -o libborg.so borg.o mt19937ar.o

The file libborg.so is the shared library that we’re looking for. Take this file and copy it to where your Matlab files are located (CEE 6200 students, it should already be in the right directory). Note if you’ve never used a terminal before, it’s just the application called “Terminal” on your Mac. Use the command cd /to/some/directory to change directory, and pwd to print working directory (that is, to see where you are).

Step 2: Check Compiler in Matlab

Matlab needs to compile Borg into a mex file to call its functions directly. The default compiler that it looks for on OSX is XCode—which, fortunately, we installed in the previous step. You can double-check that Matlab recognizes your compiler by running mex -setup at the Matlab command line. (Note: NOT in your system terminal). The mex -setup command should locate your XCode installation, with the gcc compiler.

Step 3: Compile Borg

In the Matlab command window, navigate to the directory where you stored all of the Borg files from Step 0. Run the command mex nativeborg.cpp libborg.so. If it works, you’ll see a new file called nativeborg.mex* in your directory. If it doesn’t work, double-check that you have all of the files copied into the same directory, and that Matlab’s working directory is set properly.

A few caveats. If you’re running OSX 10.9 (Mavericks), this will most likely need a small tweak to work—read troubleshooting step 3b below. Linux users will need to add a library path to the compilation like so:

mex LDFLAGS="\$LDFLAGS -Wl,-rpath,\." nativeborg.cpp libborg.so

Again: these commands are being run in Matlab’s command windownot the system terminal.

Step 3b: Compilation Troubleshooting

If you’re running Mavericks (OSX 10.9) and thus have XCode 5 or higher, you will most likely get this error:

xcodebuild: error: SDK macosx10.7 cannot be located

The SDK for OSX 10.7 is no longer included in XCode, but Matlab still searches for it by default. Try following these instructions from Mathworks to change the SDK that Matlab searches for from 10.7 to 10.8. As per the instructions, even if you’re running Mavericks you should still try 10.8, because it is likely better-supported by Matlab. Both the 10.8 and 10.9 SDKs should be included with the latest version of XCode.

Step 3c: Troubleshooting Continued

The compilers included in the latest version of XCode may have further compatibility issues with mex. Specifically, when you try to run mex, you may get a cryptic error message about undefined type char16_t. Following these instructions, you will need to add -std=C++11 to the CXXFLAGS variable in your options file (the same mexopts.sh file that you edited in Step 3b above).

This forces the compiler to abide by a certain standard, in which the undefined type is now (apparently) defined. I have confirmed that this fix works on OSX 10.9.2 with Matlab 2013a.

Step 4: Run a test problem

The file DTLZ2.m shows an example of how an objective function should be formatted for use with the Borg Matlab wrapper. It accepts a vector of decision variables as input, and outputs a vector of objectives (and optionally, constraints). From the command line, you can optimize this function as follows:

[vars, objs] = borg(11, 2, 0, @DTLZ2, 100000, zeros(1,11), ones(1,11), 0.01*ones(1,2));

The function returns the decision variables and objectives associated with non-dominated points (the Pareto-approximate set). The first three inputs to the function are: the number of decision variables, number of objectives, and number of constraints. After that comes the handle for your objective function, the number of function evaluations to run in the optimization, the lower and upper bounds of the decision variables, and finally, the epsilon precision values for the objectives.

To see the set of nondominated solutions, try scatter plotting the columns of the objs matrix:

scatter(objs(:,1), objs(:,2));

And you should see a semicircular-looking Pareto front. Now all you need to do is swap out DTLZ2 with an objective function or model of your choice, and you’ll be ready to go!

Compile and run Borg MATLAB on JANUS

Some of the steps/information below are very similar to the information above; however, this is to clarify the exact steps necessary to compile and run Borg MATLAB on the CU Boulder supercomputer (JANUS).

Step 1: Transfer Borg source code to JANUS

Open your terminal and copy Borg source code from you computer to the appropriate folder on JANUS. Below is an example of a command you can use to copy these files. The *.c copies over all files ending in .c. If you only have one file of a specific type, you can also just use the filename.

scp /Users/elizabethhoule/Desktop/*.c elho9743@login.rc.colorado.edu:/lustre/janus_scratch/elho9743/borg

Step 2: Compile shared library

Log onto JANUS, and navigate to the JANUS directory that contains the Borg source code. For example:

ssh elho9743@login.rc.colorado.edu
cd /lustre/janus_scratch/elho9743/borg

Use the following code to compile Borg as a shared library. This should produce a file called libborg.so.

gcc -c -fPIC borg.c
gcc -c -fPIC mt19937ar.c
gcc -shared -o libborg.so borg.o mt19937ar.o

Check your folder via the ls command to ensure libborg.so now exists. Then copy libborg.so to your JANUS directory that contains your MATLAB code. For example:

cp /lustre/janus_scratch/elho9743/borg/libborg.so /lustre/janus_scratch/elho9743/matlabfiles/

Also ensure that your nativeborg.cpp and borg.m files are in this same directory. Navigate to the directory that now contains your MATLAB code and libborg.so.

cd /lustre/janus_scratch/elho9743/matlabfiles/

Step 3: Compile Borg

Use the following commands to compile Borg on JANUS.

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.
module load matlab/matlab-2013a
matlab
mex nativeborg.cpp libborg.so

The first command, LD_LIBRARY_PATH=$LD_LIBRARY_PATH:., tells MATLAB where the .so is located. Even after you compile Borg, if you close out of your terminal, you will need to type the first command again in your new terminal before Borg will run. When you type this command ensure you are in your JANUS directory that contains the MATLAB files.

The second command, module load matlab/matlab-2013a, loads MATLAB on JANUS. The third command, matlab, opens MATLAB on JANUS in your terminal. The fourth command, mex nativeborg.cpp libborg.so, compiles Borg in MATLAB on JANUS.

Lastly, you can run a test problem to make sure Borg is properly compiled. Make sure all of the necessary files are in the directory in which you are running the test problem. Use the following command to run the test problem:

[vars, objs] = borg(11, 2, 0, @DTLZ2, 10000, [0.01,0.01], zeros(1,11), ones(1,11));

If this works, you are good to go. Navigate out of MATLAB on JANUS and load slurm to start running your Borg jobs.

Compiling the Borg Matlab Wrapper (Windows)

The Borg MOEA is written in C, but thanks to Dave, a Matlab wrapper is now available. This post will describe how to compile and use it on Windows. A separate post describes the process for OSX/Linux.

Step 0: Get Files

If you have access to the Bitbucket repository, you can grab the files there. If not, you’ll need to email someone to get them (I believe public releases are planned for the near future). The files you need are as follows: Borg.dll, Borg.lib, borg.m, DTLZ2.m, nativeborg.cpp, and borg.h. (Note: the last file is in the main directory, not the Matlab subdirectory). Students in CEE 6200 will have received these files in a zip folder; be sure to get the Windows-specific version.

Step 1: Check Compiler

Matlab needs to compile Borg into a mex file to call its functions directly. To do this, it needs access to a compiler. Unfortunately, it only looks for a few specific compiler types depending on your OS. (For example, if you’re using gcc in Cygwin, Matlab knows nothing about that). A list of compiler choices for Windows/Mac/Linux is given here: http://www.mathworks.com/support/compilers/R2013a/.

You can check if you already have any of these installed by running mex -setup at the Matlab command line. (Note: NOT in your system terminal—which will either (1) do nothing, or (2) try to “make LaTeX”, neither of which you want right now). The mex -setup command will search for compilers. If it doesn’t find any, it will let you know.

Note: these instructions assume that you are using a reasonably recent version of Matlab. We’ve tested on 2013a specifically (64-bit). If you are using an older version of Matlab, the list of accepted compilers is completely different and may cause serious headaches if your Matlab is a 32-bit version. CEE 6200 students with older Matlab versions may want to try the computer lab in Carpenter instead; thanks to Jared Smith for testing these steps on the Carpenter computers.

(If it does find a compiler, congrats, you can skip ahead to Step 3!)

Step 2: Install Compiler

Assuming Matlab couldn’t find a compiler in the previous step, you’ll want to go install one. On Windows 7/8, Matlab encourages you to grab the Windows SDK.

Step 2b: Installation Troubleshooting

Windows: If you’re installing the Windows SDK for the first time, you may run into some problems. Specifically, if you already have Visual C++ 2010 installed, scroll to the bottom of these instructions from MathWorks and read the troubleshooting section. (Note: it is very likely that you have a version of Visual C++ installed, for one reason or another). The troubleshooting steps are: (1) uninstall the existing Visual C++ 2010, (2) Reinstall the Windows SDK, but de-select the Visual C++ 2010 options, and (3) Install a patch from Microsoft. Joe and I have successfully tested these steps on Windows 7 and 8 respectively , so with any luck this will be your only troubleshooting step.

Step 3: Compile Borg

In the Matlab command window, navigate to the directory where you stored all of the Borg files from Step 0. Run the command mex nativeborg.cpp Borg.lib. If it works, you’ll see a new file called nativeborg.mex* in your directory. (The * is “w64” for 64-bit Windows). If it doesn’t work, double-check that you have all of the files copied into the same directory, and that Matlab’s working directory is set properly.

Again: these commands are being run in Matlab’s command window, not the system terminal. Right now, the pre-compiled libraries (Borg.lib and Borg.dll) will only work for Windows. We are working on a set of steps for OSX/Linux users.

Step 4: Run a test problem

The file DTLZ2.m shows an example of how an objective function should be formatted for use with the Borg Matlab wrapper. It accepts a vector of decision variables as input, and outputs a vector of objectives (and optionally, constraints). From the command line, you can optimize this function as follows:

[vars, objs] = borg(11, 2, 0, @DTLZ2, 100000, zeros(1,11), ones(1,11), 0.01*ones(1,2));

The function returns the decision variables and objectives associated with non-dominated points (the Pareto-approximate set). The first three inputs to the function are: the number of decision variables, number of objectives, and number of constraints. After that comes the handle for your objective function, the number of function evaluations to run in the optimization, the lower and upper bounds of the decision variables, and finally, the epsilon precision values for the objectives.

To see the set of nondominated solutions, try scatter plotting the columns of the objs matrix:

scatter(objs(:,1), objs(:,2));

And you should see a semicircular-looking Pareto front. Now all you need to do is swap out DTLZ2 with an objective function or model of your choice, and you’ll be ready to go!

Matlab and Matplotlib Plotting Examples

A while back, Jon published a set of Matlab plotting examples on GitHub.  I’m not sure if this has been publicized yet — I couldn’t find it with a quick search of the blog, so it couldn’t hurt to publicize it again!

I’ve started adding my own Python/Matplotlib equivalents to Jon’s Matlab examples, in the hopes of turning this into a comprehensive library of basic plotting examples.  I’m not done yet, but I’ll add more as time permits.  I take requests, so if there’s anything you’d like to see, please leave a note in the comments.

Here’s a quick list of what’s there now:

  • line plots (Matlab and Python)
  • line plots with shading between curves (Python)
  • stacked area plots (Matlab)
  • animated gifs like the one from Jon’s recent post (Matlab and Python)
  • histograms and cdf plots (Matlab)
  • parallel coordinate plots (Matlab)
  • gridded data contour and surface plots (Matlab and Python)
  • scatter plots (Matlab and Python)
  • non-gridded data contour plots (Matlab and Python)

Code Sample: Stacked Bars and Lines in Matlab

I needed to make a plot that superimposes a “stacked” bar graph with a line in Matlab.  I got started by referencing this post on Matlab central, which used “function handles” to pass plotting commands into plotyy.  This got me thinking about some other Matlab features that are explained in the following code sample.  Hope this helps in your work too!

The data is from the USGS’s estimate of U.S. Water Use for 2005.


%How to plot data from USGS, Estimated Water Use in the U.S. in 2005
%as a stacked bar/line graph in Matlab.

%Specifically, we want the US population as a line, and the different
%water use types as stacked bars.

%The data, where each data value is for a different 5 year period
population = [151, 164, 180, 194, 206, 216, 230, 242, 252, 267, 285, 301];
totalwithdrawal = [180, 240, 270, 310, 370, 420, 430, 397, 404, 399, 413, 410];
public = [14, 17, 21, 24, 27, 29, 33, 36.4, 38.8, 40.2, 43.2, 44.2];
irrigation = [89, 110, 110, 120, 130, 140, 150, 135, 134, 130, 139, 128];
thermo = [40, 72, 100, 130, 170, 200, 210, 187, 194, 190, 195, 201];

%Place the data in a matrix, where each row is a
%different data type (i.e., population)
data = [thermo; irrigation; public;
 totalwithdrawal-(public+irrigation+thermo)];

%Now, flip the data so each data type has its own column
data = data';

%To use different data types in the plotyy environment, you
%can use Matlab's 'anonymous function' feature. Stackedbar
%and prettyline below are temporary functions you can only
%use inside of this script.
stackedbar = @(x, A) bar(x, A, 'stack');
prettyline = @(x, y) plot(x, y, 'k', 'LineWidth', 5);

%There's a version of plotyy that accepts function handles as an
%argument. We'll pass stackedbar and prettyline in there as the
%last arguments to this function.
[ax, h1, h2] = plotyy(1950:5:2005, data, 1950:5:2005, population, stackedbar, prettyline);

%You'll notice the line and bar axes are not agreeing with each other.
%We can fix this using the ax variable created above.
set(ax(1), 'XLim', [1945 2010]);
set(ax(2), 'XLim', [1945 2010]);

%Get rid of one of the sets of ticks, to make sure
%the figure looks nice.
set(ax(2), 'XTick', []);

%Changing the colormap will change the colors of the
%stacked bars. For example:
colormap summer;

%Finally write the legend. Note that, for whatever reason,
%the line is first in the list of legend entries.
legend('Population', 'Thermoelectric Power', 'Irrigation', 'Public Supply', 'Other', 'Location', 'NorthWest');

Here’s the result!