Profiling C++ code with Callgrind

Often times, we have to write code to perform tasks whose complexity vary from mundane, such as retrieving and organizing data, to highly complex, such as simulations CFD simulations comprising the spine of a project. In either case, depending on the complexity of the task and amount of data to be processed, it may happen for the newborn code to leave us staring at an underscore marker blinking gracefully for hours on a command prompt during its execution until the results are ready, leading to project schedule delays and shortages of patience. Two standard and preferred approaches to the problem of time intensive codes are to simplify the algorithm and to make the code more efficient. In order to better select the parts of the code to work on, it is often useful to first find the parts of the code in which more time by profiling the code.

In this post, I will show how to use Callgrind, part of Valgrind, and KCachegrind to profile C/C++ codes on Linux — unfortunately, Valgrind is not available for Windows or Mac, although it can be ran on cluster from which results can be downloaded and visualized on Windows with QCachegrind. The first step is to install Valgrind and KCachegrind by typing the following commands in the terminal of a Debian based distribution, such as Ubuntu (equivalent yum commands area available for Red Hat based distributions):

$ sudo apt-get install valgrind
$ sudo apt-get install kcachegrind

Now that the required tools are installed, the next step is to compile your code with GCC/G++ (with a make file, cmake, IDE or by running the compiler directly from the terminal) and then run the following command in a terminal (type ctrl+shift+T to open the terminal):

$ valgrind --tool=callgrind path/to/your/compiled/program program_arguments

Callgrind will then run your program with some instrumentation added to its execution to measure time expenditures and cache use by each function in your code. Because of the instrumentation, Your code will take considerably longer to run under Callgrind than it typically would, so be sure to run a representative task that is as small as possible when profiling your code. During its execution, Callgrind will output a report similar to the one below on terminal itself:

==12345== Callgrind, a call-graph generating cache profiler
==12345== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==12345== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==12345== Command: path/to/your/compiled/program program_arguments
==12345==
==12345== For interactive control, run 'callgrind_control -h'.
IF YOUR CODE OUTPUTS TO THE TERMINAL, THE OUTPUT WILL BE SHOWN HERE.
==12345== 
==12345== Events    : Ir
==12345== Collected : 4171789731
==12345== 
==12345== I   refs:      4,171,789,731

The report above shows that it collected 4 billion events in order to generate the comprehensive report saved in the file callgrind.out.12345 — 12345 is here your process id, shown in the report above. Instead of submerging your soul into a sea of despair by trying to read the output file in a text editor, you should load the file into KCachegrind by typing:

$ kcachegrind calgrind.out.12345

You should now see a screen like the one below:

kcachegrind_initial.png

The screenshot above shows the profiling results for my code. The left panel shows the functions called by my code sorted by total time spent inside each function. Because functions call each other, callgrind shows two cost metrics as proxies for time spent in each function: Incl., showing the total cost of a function, and self, showing the time spent in each function itself discounting the callees. By clicking on “Self” to order to functions by the cost of the function itself, we sort the functions by the costs of their own codes, as shown below:

Untitled_sorted

Callgrind includes functions that are native to C/C++ in its analysis. If one of them appears in the highest positions of the left panel, it may be the case to try to use a different function or data structure that performs a similar task in a more efficient way. Most of the time, however, our functions are the ones in most of the top positions in the list. In the example above, we can see that a possible first step I can take to improve the time performance of my code is to make function “ContinuityModelROF::shiftStorage” more efficient. A few weeks ago, however, the function “ContinuityModel::continuityStep” was ranked first with over 30% of the cost, followed by a C++ map related function. I then replaced a map inside that function by a pointer vector, resulting in the drop of my function’s cost to less than 5% of the total cost of the code.

In case KCachegrind shows that a given function that is called from multiple places in the code is costly, you may want to know which function is the main culprit behind the costly calls. To do this, click on the function of interest (in this case, “_memcpy_sse2_unalight”) in the left panel, and then click on “Callers” in the right upper panel and on “Call Graph” in the lower right panel. This will show in list and graph forms the calls made to the function by other functions, and the asociated percent costs. Unfortunately, I have only the function “ContinuityModelROF::calculateROF” calling “_memcpy_sse2_unalight,” hence the simple graph, but the graph would be more complex if multiple functions made calls to “_memcpy_sse2_unalight.”

I hope this saves you at least the time spend reading this post!

A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

raw_data

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

u = xE

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

PCA visualization

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

\tilde{x} = \tilde{u}\tilde{E}^T

Where:

\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set

\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)

\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

final_synth.png

 

 

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in 
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,... 
 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,... 
 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Canadaigua min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')

 

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

Solving non-linear problems using linear programming

Solving non-linear problems using linear programming

This week’s post comes from recent conversations we’ve had around the Reed group concerning tools to quickly solve (approximately) non-linear programming problems.  First, some context.

As part of a simulation model our group is building, a drinking water allocation sub-problem must be solved.   Figure 1 is a simplified example of the sort of problem we are solving.

network

Figure 1: Mock water distribution network

There are three utilities that each have a demand (d_{1}, d_{2}, and d_{3}). The utilities are connected via some infrastructure, as shown in Figure 1.  When our total available water (R) is in excess of the demand (d_{1}+d_{2}+d_{3}), no rationing is needed.  When we do need to ration, we want to allocate the water to minimize the percent supply deficits across the three utilities:

equation 1

Equation 1

Subject to:

equation 2

The last constraint here describes the a limitation of the distribution network.  The real problem is much more complicated, but we needn’t detail that here.

This problem needs to be solved thousands, or hundreds of thousands of times in each simulation, so we want any solution technique to be fast.  The natural solution is linear programming (LP), which can solve problems with tens of thousands of variables and constraints nearly instantaneously.

We won’t discuss LP in great detail here, except to say that LP requires an objective and constraints that are linear with respect to the decision variables.  These restrictive requirements significantly reduce the number of potential optimal solutions that must be searched.  By systematically testing and pivoting between these potential optimal solutions, the popular Simplex Algorithm quickly converges to the optimal solution.

As stated in equation 1, our rationing scheme is indifferent to imposing small deficits across all three utilities, or imposing one large deficit to a single utility.  For example, the objective value in equation 1 is the same, whether each utility has a deficit of 5%, or if utility 1 has a deficit of 15%, and utilities 2 and 3 have no deficit.  In reality, many small deficits are likely preferable to one large one.  So what are we to do?

We could square our deficits.  In that case, our rationing scheme will prefer small distributed deficits over one large deficit:

equation 3

Equation 2

BUT, we can’t use LP to solve this problem, as our objective is now non-linear! There are non-linear programming algorithms that are relatively fast, but perhaps not fast enough.  Instead we could linearize our non-linear objective, as shown in Figure 2.

The strategy here is to divide a single allocation,  x_{1} for instance, into many decision variables, representing different ranges of the actual allocation x_{1}.  In each range, a linear segment approximates the actual quadratic objective function.  Any actual release x_{1} can be achieved by assigning the appropriate values to the new decision variables (k_{1}, k_{2}, and k_{3}), and the contribution to the objective function from that release can be approximated by:

equation 4

Equation 3

Subject to:

equation 5

If a more accurate description is needed, the range of x_{1} can be divided into more segments.  For our purposes just a few segments are probably sufficient.  A similar strategy can be adopted for x_{2} and x_{3}.  Of course the constraints from the original optimization problem would need to be translated into terms of the new decision variables.

Now we are adding many more decision variables and constraints, but this is unlikely to slow a modern LP algorithm too much; we are still solving a relatively simple problem.  BUT, how does the LP algorithm know to increase k_{1} to its maximum threshold before applying k_{2}?  Do we need to add a number of conditional constraints to ensure this is done properly?

It turns out we don’t!  Because our squared deficit curve in Figure 2 is monotonic and convex, we know that slope of the linear segments making up the approximation are increasing (becoming less negative).  Thus, in a minimization problem, the marginal improvement in the objective is highest for the k_{1} segment, followed by the k_{2} segment, followed by the k_{3} segment, and so on.  In other words a < b < c.For this reason, the algorithm will increase k_{1} to its maximum threshold before assigning a non-zero value to k_{2}, and so forth.  No need for complicated constraints!

Now this is not always the case.  If the function were not monotonic, or if it were convex for a maximization, or concave for a minimization, this would not work.  But, this trick works for a surprising number of applications in water resources systems analysis!

If nothing else this simple example serves as a reminder that a little bit of thought in formulating problems can save a lot of time later!

Customizing color matrices in matplotlib

In this post I intend to pass on some tricks on matplotlib color matrix customization.  I am guilty of beautifying some of my color matrices with Adobe Illustrator in the past, re-arranging labels, titles, colormaps, etc.  However, this time I had to generate way too many of them and I could see the beautifying process becoming extremely painful.  I will simply demonstrate how to do the following three plots simultaneously with relatively few lines of code in the hopes of providing useful elements for your own plot cutomization.

plot1a.png

plo2.png

plt3.png

Plot 1- Plot 3  were generated with the following script which I will explain in detail later int this post:

import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#Listing your files
files = glob.glob('./attainment_matrices/*.out')

#Organizing your files
data_plot1=[np.genfromtxt(f) for f in files[8:12]]
data_plot2=[np.genfromtxt(f) for f in files[0:4]]
data_plot3=[np.genfromtxt(f) for f in files[16:20]]
data=[data_plot1,data_plot2,data_plot3]

#Organizing titles and labels
plot_titles=['Plot 1','Plot 2', 'Plot 3']
subplot_titles= ['Subplot 1','Subplot 2', 'Subplot 3','Subplot 4']
labels= ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
y_labels= ['Y Title a$\longrightarrow$','Y Title b $\longrightarrow$','Y Title c $\longrightarrow$']
cmap_labels=['Colormap label a$\longrightarrow$', 'Colormap label b$\longrightarrow$', 'Colormap label c$\longrightarrow$']

# Some variables to adjust subplots if necessary
left = 0.125 # the left side of the subplots of the figure
right = 0.9 # the right side of the subplots of the figure
bottom = 0.3 # the bottom of the subplots of the figure
top = 0.82 # the top of the subplots of the figure
wspace = 0.2 # the amount of width reserved for blank space between subplots
hspace = 0.5 # the amount of height reserved for white space between subplots

#Font sizes
plot_fontsize=40
subplot_fontsize=32
tick_label_fontsize=22 # Ticks, colormap, x and y labels use this fontsize

#x-label adjustments
rotation= 45 # rotation of labels
adjust=0 #if you want the x labels to be displayed right at the middle then adjust=0.5

x=np.arange(0,5.5)
y=np.linspace(0,100,1001)

#colormaps
colormap=['Set3_r', 'YlGnBu','Paired']

# the j is the iteration variable for each subplot, and the l is the iteration variable
# for each plot.
for l in range(len(plot_titles)):
fig, ax=plt.subplots(1,len(subplot_titles),sharey=True)
plt.subplots_adjust(left=left, bottom=bottom, right=right, top=top, wspace=wspace, hspace=hspace)
#setting the titles wrapped by a transparent grey box at position=(x,y)
fig.suptitle(plot_titles[l], fontsize=plot_fontsize,
bbox={'facecolor':'grey', 'alpha':0.1, 'pad':12}, position=(0.1827, .95))

for j in range(len(subplot_titles)):
a= ax[j].pcolor(x,y,data[l][j], cmap=colormap[l])
ax[j].set_title(subplot_titles[j], fontsize= subplot_fontsize, y=1.03)
#Set the y-label only in the first subplot
ax[0].set_ylabel(y_labels[l], fontsize=tick_label_fontsize)
ax[j].set_xticks(x + adjust, minor=False)
#ax[j].set_xlim(left=0, right=5)
#ax[j].set_ylim(0,100)
ax[j].set_xticklabels(labels[:], rotation=rotation)
ax[j].tick_params(labelsize=tick_label_fontsize)

#colorbar settings:
leftc= 0.12504
bottomc=.13
width_c=.775
height_c=0.04
cbar_ax= fig.add_axes([leftc,bottomc,width_c,height_c])
#cbar= fig.colorbar(a, cax=cbar_ax, orientation='horizontal')
cbar = fig.colorbar(a,cax=cbar_ax, ticks=[0, 0.5, 1], orientation='horizontal')
cbar.ax.set_xticklabels(['Low', 'Medium', 'High'])
cbar.set_label(cmap_labels[l], fontsize=tick_label_fontsize, labelpad=25)
cbar.ax.tick_params(labelsize=tick_label_fontsize)

plt.show()
plt.show()

First, in lines 1 though 4 I specify the required libraries.  I use glob.glob to list the files for the analysis with their full path in line 8.  Then if you want to see the order in which the files are listed you can simply run the print command as follows:

print files[:]

And you should be able to see the order of the files like so:

[‘./data_directory/file1.out’, ‘./data_directory/file10.out’, … ‘./data_directory/file24.out’]

I used the numpy genfromtxt function in lines 11-13 to load the data from the specified files while organizing the data that would be used in plot 1, plot 2 and plot 3.   I then made an array of the previous data on line 14 so I could use it in a loop later on.

I organized the titles of the main plots, subplots, the x and y labels, as well as the colormap labels in lines 17-21.  All the parameters required to adjust the aspect ratio of the subplots are listed in lines 24 to 29.    If you simply want all of your subplots to be squared, you can add the aspect=’equal’  parameter directly in the plt.subplot() function.

The font for the plots, subplots, ticks and labels are specified in lines 32 to 34.  The x-labels can be adjusted in multiple ways.  In line 27 I set the rotation of the x-labels to 45 degrees.  If you want the labels to be completely vertical then you would do: rotation=90.  If you want horizontal labels, you don’t need to specify a rotation parameter.  Then, I used the adjust variable to specify the position of the x-label,  adjust=0 specifies that the label will be written starting at the left corner of the bar, if you want the label to be centered, then you can do adjust=0.5.

In line 44,  I list the different colormaps to be used by each plot. The outer loop in line 48, iterates through the 3 plots,  while the inner loop in line 55, iterates through the 4 subplots generated in each plot.    In line 49 we specify the number of rows and columns of subplots that will be generated.  I want them to share the  y axis, hence, sharey=True.   If you want your subplots to also share the x axis, you would simply add ‘sharex=True‘ in line 49.  The plt.subplots_adjust function in line 50, allows you to specify the exact aspect ratio of your subplots, including the white space between them and their location in the figure canvas, this is detailed in lines 24 to 29.  In line 52, I specified the title of the plot as a whole, since I have three different plots, I loop through each of the different titles.  The title is shown in a grey transparent box at the upper left corner of the canvas which was specified by position(x,y).

Lines 56 to 64 show the subplots’ code.  I use the pcolor function to generate the color matrices.  However, there are other methods to create them, such as pcolormesh, imshow, contour, etc.  In line 57 I loop through the subplot titles, then I assign their font size.  Here, the y=1.03 specifies the distance from the subplot title to the plot.  The more distance I want to create the larger this y value should be.  In line 59 I set the y-label, since I only want the y-label to be shown in the left most plot, I fix ax[0].set_ylabel(…), if you want each subplot to have their own y-labels then you can loop through each of them with the subplot iteration variable j, such as ax[j].set_ylabel(…).  Lines 61 to 62 (commented out), show how you could set the x and y axis limits.  In line 63, I set the x_ticklabels; similarly you could set the y_ticklabels if necessary.  The fontsize across all the ticks in line 64.

The colorbar settings are shown in lines 67 through 76.  Observe how you can specify the position of the left bottom corner of the colorbar, and from there you can assign the width and the height of the colorbar.  Note that there’s a couple of ways to specify the colorbar, the first one is shown in line 72, it will generate a colorbar with the default ticks.  However, if you want to cutomize or add text to your colorbar, you would have to do so as shown in lines 73-74.  The ticks parameter in line 73, specifies the position were the labels written in line 74 are displayed.  You can set the colorbar label with .set_label.   I loop through the colormap labels for each plot and assign their fontsize in line 75.  The labelpad allows you to specify the distance between the colorbar and the label.   Finally,  the font size of the colormap ticks are specified in line 76.

I hope you can find some of the previous elements useful when designing your own color matrices ;).

 

 

 

A simple command for plotting autocorrelation functions in Matlab

Autocorrelation is a measure of persistence within a data set, which can be defined as the tendency for successive data points to be similar (Wilks, 2011).  In atmospheric science temporal autocorrelation can be a helpful tool for model evaluation. Temporal autocorrelation is also a fundamental concept for synthetic weather generation (for more detail see Julie’s fantastic series of blog posts on synthetic weather generation here).  Calculating autocorrelation within a sample data set can also be a helpful for assessing the applicability of classical statistical methods requiring independence of data points within a sample. Should a data set prove to be strongly persistent, such methods will likely yield inaccurate results.

Autocorrelation is commonly computed by making a copy of the original data set, shifting the copy k points forward (where k is the lag over which you would like to compute the autocorrelation)  and computing the Pearson correlation coefficient between the original data set and the copy.

autocorrelation-equation

Where:

autocorr_variables

The calculation of autocorrelation for a number of different lags at once is known as the autocorrelation function. Plotting the autocorrelation graphically can be a helpful tool for quickly assessing the presence of autocorrelation within a data set.

You can generate such plots in Matlab using the simple command shown below:

autocorr(T,k) 
% T is your data set and k is the number of lags you would like to compute

The command generates a plot of the autocorrelation function. Below are two examples, the first is the autocorrelation function of a set of observed temperature values in Des Moines Iowa, the second is autocorrelation function of the temperature values at the same location as modeled by the MM5I regional climate model:

DM_HRM3.jpg

Figure 1: Temporal autocorrelation function of temperature observations from Des Moines Iowa (temperatures reported at 3 hour intervals)

 

dm_mm5i

Figure 2: Temporal autocorrelation function of temperature produced by the MM5I regional climate model for Des Moines Iowa (temperatures reported at 3 hour intervals)

Note the cyclical nature of the autocorrelation functions, this is a reflection of the daily temperature cycle. The autocorrelations function of the maximum or minimum temperatures would show more constant persistence.

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Burlington, MA: Academic Press.

Use python `cf ` and `esgf-python-client` package to interact with ESGF data portal

Working with climate change impact assessment may often require us to download a number of outputs from Global Circulation Models (GCMs) to depict the plausible changes of projected future conditions, as well as to use them for conventional top-down scenario analysis. Usually those kinds of scenario data are public available for download and are stored as NetCDF format on the data portal using Earth System Grid Federation (ESGF) architecture. Yet the procedure of downloading those data, which typically span over a lengthy time horizon, may be very cumbersome and requires following steps:

  • create the username and credential on ESGF data portal;
    • define the search entries to narrow down the results;
    • generate the *.wget file for batch downloading using secured OpenDAP protocal ;
    • run the *.wget file to download all the raw data onto the local disk;

For the climatic projection data, usually the downloaded data are stored in NetCDF format with irregular coordinate systems (e.g., rotated pole), segmented in pieces of small time periods, and available only for large continental scale, which is, however, unnecessary for regional case study.

In this post, I would like to share some of my experience and tools to tackle with the ESGF data portal, and facilitate the process. The context will encompass the concepts of NetCDF and Python language, both of which were already well introduced in our Water Programming blog. Therefore, I strongly suggest to revisit those posts in case of any doubt, and also because I am not an expert on neither NetCDF nor Python.

The first tool is called cf-python package, which implements comprehensive tools for reading, writing and processing NetCDf data. Personally I found it extremely useful for reading/writing NetCDF data, regridding and extracting the sub-domain from original dataset. More importantly, given the OpenDAP url all those aforementioned operations can be done on the server side without explicitly downloading the orignal data onto local disk! So instead of downloading a 10Gb projection data for entire Europe that takes 1 hour to finish, you can just download the part specific to your study area within a couple of seconds at a cost of a few Mb.

The second tool is named esgf-python-client. This package contains the API code for calling the ESGF Search API, whose documentation by the way has not been updated for quite a while. With the combined use of cf-python and this client, you are saved from user/password login and dealing with tedious web interface of ESGF search page (try here) and downloading.

Below I explain more in details by using an example of downloading European regional downscaling data for RCP4.5 scenario from one of the ESGF data portal. The example is essentially a few lines of codes with inline comments (notice that some indents of lines may not show correctly in the post, so just copy paste and reformat them in your editor).

from pyesgf.logon import LogonManager 
from pyesgf.search import SearchConnection
import cf
import numpy as np

def main():
 ## ================== To be modified by user ================================================
 # Define the search key words for facet fields in ESGF node. 
 # Note: the supported facet fields can be identified using: 
 # " http://<base_search_URL>/search?facets=*&distrib=false&limit=0 ", where
 # <base_search_URL> is the the domain of esgf-node, e.g., https://pcmdi.llnl.gov/esg-search

esgf_node='https://esgf-node.ipsl.upmc.fr/esg-search' # you may try different data portal, as many of them are frequently under maintenance longer than their actual service time

Project='CORDEX' 
 Time_frequency='day' # 'month', '6h', '3h' 
 Experiment='rcp45' # 'rcp25', 'rcp85', 'history'
 Variable='tas' # 'pr' for precipitation 
 Ensemble='r1i1p1' # not change 
 Domain='EUR-44' # EUR-44i, EUR-22, EUR-22i, EUR-11i

# Define the index of domain for study region. Say original domain is 100 (lon) by 60 (lat) grids, and your study area falls between 50 - 54 in longitude and 41 to 45 in latitude. 
 idx_lon = np.array( range(49,54) ) #49 instead of 50 as numpy array is 0-based indexing
 idx_lat = np.array( range(40,45) )

## =========================================================================================

 # Connect to ESGF data portal using your openid and password. May need a single run of *.wget file first to
 # retrieve personal credientials
 lm = LogonManager()
 lm.logon_with_openid(your_openID, your_password)

conn = SearchConnection(esgf_node, distrib=True)

 # Make query based on the facet fields. 
 ctx = conn.new_context( project=Project, 
 time_frequency=Time_frequency,
 experiment=Experiment, 
 variable=Variable,
 ensemble=Ensemble, 
 domain=Domain )

# ctx.hit_count # show total number of results found

 # loop over the hit_count results
 for each_hit in range(0, ctx.hit_count):

ds = ctx.search()[each_hit]
 files = ds.file_context().search() # use "len(files)" to see total files found

url_list_sorted = sorted( [i.opendap_url for i in files] ) # url list sorted to get assending timestamp, e.g., from 1/1/2005 to 12/31/2099
 filename_out = i.filename[0:-21] # get file name, the [0:21] means a subset of filename string

 counter = 0

data_list = []
 for each_url in url_list_sorted:
 counter += 1 
 meta_data = cf.read(each_url) # read the meta-data information from OpenDAP NetCDF url
 if len(meta_data) > 1: # in some cases, the returned meta_data contains two fields, i.e., len(meta_data)>1, and only the second field contains the variable of interest
 data_list.append(meta_data[1])
 else:
 data_list.append(meta_data[0])

# Aggregating the time
 dataset_agg = cf.aggregate(data_list) # this will concatenate all files with small time period into a single file with complete time series

# Slicing dataset
 if len(dataset_agg.shape) == 4:
 dataset_sliced = dataset_agg.subspace[0,:,idx_lat,idx_lon] # for 4-D , e.g., [height, time, lat, lon], .subspace() function is used to extract the subset of the data
 elif len(dataset_agg.shape) == 3:
 dataset_sliced = dataset_agg.subspace[:,idx_lat,idx_lon] # for 3-D , e.g., [time, lat, lon], .subspace() function is used to extract the subset of the data

# Save data into desired format (e.g., using savemat from 'scipy' package)
 cf.write(dataset_sliced, filename_out+'.nc', mode='w')

 lm.logoff() # logoff

if __name__ == "__main__":
 main()

Sorry for the lengthy and poor organized post, but I hope you will find it useful. For any doubts or suggestions, please feel free to contact me (likymice@gmail.com).

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.