Basic NetCDF processing in Matlab

To follow up on Jon Herman’s posts on netCDF, including how to convert ascii to netCDF, using lists in netCDF, and using matplotlib with netCDF data, we also wanted to just learn how to access and plot netCDF data in native Matlab.

First of all, there is an impressive list of netCDF utilities at the NCAR website, including some that are designed for Matlab, such as the CSIRO toolbox.  However, what if you can’t install packages and extra utilities in Matlab?

Well, it turns out there are some built in functions that can handle these data.  Again, thanks to Jon Herman for finding these.  He provides a sample file that gives the instructions.  Basically, the ncdisp function gives a list of what elements are within the netCDF file.  And then, you can extract the data using ncread.

Want to find some data to play around with?  The IRI Climate Data Library is a great place to start. There are data from an impressive number of sources, including some precipitation and streamflow data from South America.

Questions/comments/etc?  Post them below! 

NumPy vectorized correlation coefficient

Imagine that you have a matrix X with dimensions (N x k) and a vector y of size (1 x k), and you want to calculate the linear correlation coefficient between y and each row of X to obtain a vector of correlation coefficients, r, of size (N x 1) .

The first way you might consider doing this is in a loop with the np.corrcoef function, which returns the linear correlation coefficient between two vectors:

r = np.zeros((N,1))
for i in xrange(0,N):
    r[k] = np.corrcoef(X[i,:], y)[0,1]

If N is large, or if you need to perform this calculation many times (with an outer loop wrapped around it), this will be very slow. We can time it as follows:

for t in xrange(0,10): # average 10 timings

    start = time.time()
    r = np.zeros((10000,1))
    for k in xrange(0,10000):
        r[k] = np.corrcoef(vic_runoff[k,:], obs_runoff)[0,1]

    end = time.time()
    times[t] = end-start

print np.mean(times)

In this example with N = 10000, the average time is 0.48 seconds per calculation of the r vector. Too slow!

How can we speed this up? Some NumPy functions, such as sum and mean, have an axis argument to perform the operation along a particular dimension, but corrcoef does not (presumably because people often want the correlations between the rows of the matrix X, rather than the correlation of all rows with respect to a single vector). Recall the formula for the sample correlation coefficient between 1D vectors x and y:

linear correlation coefficient (from Wikipedia)

linear correlation coefficient (from Wikipedia)

We can write a function using NumPy’s vectorized arithmetic to compute these values all at once rather than in a loop. For example, np.multiply(X,y) (also given by X*y) performs element-wise multiplication of the vector y over all rows of the matrix X. The function might look something like this:

def vcorrcoef(X,y):
    Xm = np.reshape(np.mean(X,axis=1),(X.shape[0],1))
    ym = np.mean(y)
    r_num = np.sum((X-Xm)*(y-ym),axis=1)
    r_den = np.sqrt(np.sum((X-Xm)**2,axis=1)*np.sum((y-ym)**2))
    r = r_num/r_den
    return r

If we try timing it as before:

for t in xrange(0,10): # average 10 timings

    start = time.time()
    r = vcorrcoef(vic_runoff,obs_runoff)
    end = time.time()
    times[t] = end-start

print np.mean(times)

The average time is only 0.0017 seconds—a reduction of about 99.6%, or a 280x speedup compared to the non-vectorized approach. Not bad! It’s often said that Python is slow, but the matrix operations in NumPy can clearly be used to our advantage.

Send all optimization algorithms the same model when comparing performance!

While performing diagnostics on Riddhi’s formulations for the environmental economics “Lake Problem”, we had had slightly different models for communication with Borg and the MOEAFramework algorithms. The only difference in the code was the part where the Borg/MOEA connection was made, so we did not think it would impact performance in any meaningful way, especially since command line calls outside of optimization with identical decision variables showed essentially identical results. The only difference was the truncation of a few digits way past the specified epsilon for the Borg model.  In spite of these minute differences, our preliminary results were incredibly disconcerting. Borg had a reference set contribution of zero, and when the combined set found by Borg was compared to that found by a Random search, the front found by the Random search dominated that found by Borg, which is not possible especially on such a simple problem as the 2 objective deterministic formulation with which we were working!  After many attempts to determine the reason for our bizarre results, Jon pointed out that we could probably call the model we had written for the MOEAFramework in the command line when we optimized with Borg. When I tried this, the results made sense!  Borg had a reference set contribution of 1.0 for the 2 objective deterministic formulation and its Pareto approximate front dominated that of the Random search.  Further investigation with decision variables returned by optimization is leading us to believe the format in which the two models read variables may differ slightly, and we are looking into that hypothesis.

For anyone else comparing performance of multi-objective evolutionary algorithms, make sure you send the exact same model to all algorithms!  Otherwise, the comparison will not be fair.