# Using HDF5/zlib compression in NetCDF4, part 2: testing the compression settings

There has been a previous post, courtesy of Greg Garner, on why HDF5/zlib compression matters for NetCDF4. That post featured a plot that showed how much you could compress your data when increasing the compression level. But the fine print also acknowledged that this data was for a pretty idealized dataset. So how much should you compress your data in a real-world application? How can you test what your trade-off really is between compression and computing time?

Follow this 4-step process to find out!

I’ll be illustrating this post using my own experience with the Water Balance Model (WBM), a model developed at the University of New Hampshire and that has served for several high-profile papers over the years (including Nature and Science). This is the first time that this model, written in Perl, is being ported to another research group, with the goal of exploring its behavior when running large ensembles of inputs (which I am starting to do! Exciting, but a story for another post).

# Step 1. Read the manual

There is a lot of different software for creating NetCDF data. Depending on the situation, you may have a say on which to use, or be already using the tool that comes with the software suite you are working with. Of course, in the latter case, you can always change the tools. But reasonable a first step before that is to test them. Ergo, look up the documentation for the software you are using, to see how you can control compression on them.

Here, WBM uses the PDL::NetCDF Perl library, which has useful functions for adding data to a NetCDF file after every time step the model runs. Contrary to Greg’s post that uses C and where there are two flags (“shuffle” and “deflate”) and a compression level parameter (“deflate_level”), for PDL::NetCDF there are only two parameters. The SHUFFLE flag is the equivalent in Perl of the “shuffle” flag in C. The DEFLATE Perl parameter ihas integer values from 0 to 9, with a value 0 being equivalent to the C-flag “deflate” being turned off, and any value from 1 to 9 being equivalent to the “deflate”C-flag being on, the value of DEFLATE being then equivalent to the value of the “deflate_level” parameter in Greg’s post. Therefore, the DEFLATE variable from the PDL::NetCDF library in Perl lumps together the parameters “deflate” and “deflate_level” used in C.

I then located the DEFLATE and SHUFFLE variables within the auxiliary functions of the WBM. In the function write_nc, the following two lines of codes are key:

 my \$deflate = set_default($$options{DEFLATE},1); # NetCDF4 deflate (compression) parameter</pre> my shuffle = set_default($$options{SHUFFLE},0); # NetCDF4 shuffle parameter 

# Step 2. Set up a test protocol

This builds on Greg’s idea of recording time and resulting file size for all compression level. Here we are interested in these quantities for full-scale model runs, and not just for the generation of a single NetCDF dataset.

In this case therefore, we want to contrast the default setting above with stronger compression settings, for ensemble runs of WBM on the Cube (the local HPC cluster). For a better comparison, let us place ourselves in the conditions in which ensemble runs will be made. Runs will use all 16 cores of a Cube node, therefore for each compression setting, this experiment runs 16 instances of the WBM on a single node. Each of the 16 instances runs on a single core. All WBM runs are identical so the only differences between run times and result file size come from compression settings.

Compression settings for (SHUFFLE,DEFLATE) are (0,1) by default, and we compare that with all settings from (1,1) to (1,9).

# Step 3. Run experiment, get results

Here are the results from this experiment. Results consider 47 output fields for WBM runs with a daily time-step for 8 years (2009-2016), plus 5 years of warmup (this is pretty common for hydrological models). All this in a spatial mesh of 148,500 grid cells. A folder containing binaries for a single input variable, for this time span and spatial coverage, has a size of 3.1GB. Therefore, the expected size for 47 variables in binary format is 146Go. Let us compare with our results:

As one can see the presence of the shuffle flag or the value of the deflate parameter have little influence on the size of the results files. Compressed results are 3 to 4 time smaller than binaries, which highlights the interest of compressing, but also means we do not have the order(s) of magnitude differences reported by Greg’s blog post. This is mainly because the binary format used for WBM inputs is much more efficient than the uncompressed ASCII that Greg used in his experiment. For a deflate parameter of 9, there is an apparent problem within the PDL library, and no output (note that a single-core run with shuffle=0 and deflate=9 did not lead to a similar problem).

# Step 4. Conclude on compression parameters

Here the epxerimental setup has shown that carefully selecting the output fields will save more space than fine-tuning NetCDF compression parameters. For instance, some of the 47 output fields above are fully redundant with others. Others are residual fields, and the only interest of looking them up is to verify that a major development within the WBM code did not mess up with the overall water balance.

More generally, the effects of compression are situation-specific and are not as great when there is no obvious regularity in the data (as is often the case with outputs from large models), or when the binary format used is already much better than ASCII. This said, NetCDF still occupies much less space than binaries, and is much easier to handle: WBM outputs are contained in one file per year (8 files total) with very useful metadata info…

# A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

$u = xE$

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

$\tilde{x} = \tilde{u}\tilde{E}^T$

Where:

$\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set$

$\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)$

$\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used$

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,...
2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,...
13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
legend('Original', 'Reconstruction')

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

# Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
int i, j, k;

// Define the dimension sizes for
// the example data.
int dim1_size = 100;
int dim2_size = 50;
int dim3_size = 200;

// Define the number of dimensions
int ndims = 3;

// Allocate the 3D vectors of example data
float x[dim1_size][dim2_size][dim3_size];
float y[dim1_size][dim2_size][dim3_size];
float z[dim1_size][dim2_size][dim3_size];

// Generate some example data
for(i = 0; i < dim1_size; i++) {
for(j = 0; j < dim2_size; j++) {
for(k = 0; k < dim3_size; k++) {
x[i][j][k] = (i+j+k) * 0.2;
y[i][j][k] = (i+j+k) * 1.7;
z[i][j][k] = (i+j+k) * 2.4;
}
}
}

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
int dim1id, dim2id, dim3id;

// Allocate space for the netcdf file id
int ncid;

// Allocate space for the data variable ids
int xid, yid, zid;

// Setup the netcdf file
int retval;
if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }

// Define the dimensions in the netcdf file
if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }

// Gather the dimids into an array for defining variables in the netcdf file
int dimids[ndims];
dimids[0] = dim1id;
dimids[1] = dim2id;
dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
int shuffle = 1;
int deflate = 1;
int deflate_level = 4;
if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

• int ncid – The NetCDF file ID returned from the nc_create() function
• int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
• int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
• int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
• int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }

if((retval = nc_enddef(ncid))) { ncError(retval); }

// Write the data to the file
if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }

// Close the netcdf file
if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

1. Generate the example NetCDF file from the code above using each of the available compression levels.
2. Time how long the code takes to generate the file.
3. Note the final file size of the NetCDF.
4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.