Using HDF5/zlib compression in NetCDF4, part 2: testing the compression settings

There has been a previous post, courtesy of Greg Garner, on why HDF5/zlib compression matters for NetCDF4. That post featured a plot that showed how much you could compress your data when increasing the compression level. But the fine print also acknowledged that this data was for a pretty idealized dataset. So how much should you compress your data in a real-world application? How can you test what your trade-off really is between compression and computing time?

Follow this 4-step process to find out!

I’ll be illustrating this post using my own experience with the Water Balance Model (WBM), a model developed at the University of New Hampshire and that has served for several high-profile papers over the years (including Nature and Science). This is the first time that this model, written in Perl, is being ported to another research group, with the goal of exploring its behavior when running large ensembles of inputs (which I am starting to do! Exciting, but a story for another post).

Step 1. Read the manual

There is a lot of different software for creating NetCDF data. Depending on the situation, you may have a say on which to use, or be already using the tool that comes with the software suite you are working with. Of course, in the latter case, you can always change the tools. But reasonable a first step before that is to test them. Ergo, look up the documentation for the software you are using, to see how you can control compression on them.

Here, WBM uses the PDL::NetCDF Perl library, which has useful functions for adding data to a NetCDF file after every time step the model runs. Contrary to Greg’s post that uses C and where there are two flags (“shuffle” and “deflate”) and a compression level parameter (“deflate_level”), for PDL::NetCDF there are only two parameters. The SHUFFLE flag is the equivalent in Perl of the “shuffle” flag in C. The DEFLATE Perl parameter ihas integer values from 0 to 9, with a value 0 being equivalent to the C-flag “deflate” being turned off, and any value from 1 to 9 being equivalent to the “deflate”C-flag being on, the value of DEFLATE being then equivalent to the value of the “deflate_level” parameter in Greg’s post. Therefore, the DEFLATE variable from the PDL::NetCDF library in Perl lumps together the parameters “deflate” and “deflate_level” used in C.

I then located the DEFLATE and SHUFFLE variables within the auxiliary functions of the WBM. In the function write_nc, the following two lines of codes are key:

 my $deflate = set_default($$options{DEFLATE},1); # NetCDF4 deflate (compression) parameter</pre>
my $shuffle = set_default($$options{SHUFFLE},0); # NetCDF4 shuffle parameter 

Step 2. Set up a test protocol

This builds on Greg’s idea of recording time and resulting file size for all compression level. Here we are interested in these quantities for full-scale model runs, and not just for the generation of a single NetCDF dataset.

In this case therefore, we want to contrast the default setting above with stronger compression settings, for ensemble runs of WBM on the Cube (the local HPC cluster). For a better comparison, let us place ourselves in the conditions in which ensemble runs will be made. Runs will use all 16 cores of a Cube node, therefore for each compression setting, this experiment runs 16 instances of the WBM on a single node. Each of the 16 instances runs on a single core. All WBM runs are identical so the only differences between run times and result file size come from compression settings.

Compression settings for (SHUFFLE,DEFLATE) are (0,1) by default, and we compare that with all settings from (1,1) to (1,9).

Step 3. Run experiment, get results

Here are the results from this experiment. Results consider 47 output fields for WBM runs with a daily time-step for 8 years (2009-2016), plus 5 years of warmup (this is pretty common for hydrological models). All this in a spatial mesh of 148,500 grid cells. A folder containing binaries for a single input variable, for this time span and spatial coverage, has a size of 3.1GB. Therefore, the expected size for 47 variables in binary format is 146Go. Let us compare with our results:

netcdf_expe_results

As one can see the presence of the shuffle flag or the value of the deflate parameter have little influence on the size of the results files. Compressed results are 3 to 4 time smaller than binaries, which highlights the interest of compressing, but also means we do not have the order(s) of magnitude differences reported by Greg’s blog post. This is mainly because the binary format used for WBM inputs is much more efficient than the uncompressed ASCII that Greg used in his experiment. For a deflate parameter of 9, there is an apparent problem within the PDL library, and no output (note that a single-core run with shuffle=0 and deflate=9 did not lead to a similar problem).

Step 4. Conclude on compression parameters

Here the epxerimental setup has shown that carefully selecting the output fields will save more space than fine-tuning NetCDF compression parameters. For instance, some of the 47 output fields above are fully redundant with others. Others are residual fields, and the only interest of looking them up is to verify that a major development within the WBM code did not mess up with the overall water balance.

More generally, the effects of compression are situation-specific and are not as great when there is no obvious regularity in the data (as is often the case with outputs from large models), or when the binary format used is already much better than ASCII. This said, NetCDF still occupies much less space than binaries, and is much easier to handle: WBM outputs are contained in one file per year (8 files total) with very useful metadata info…


 

 

Advertisements

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.

Symbolic Links

A symbolic link (symlink) is a file that’s a pointer to another file (or a directory). One use for symlinks is to have a big file be in two places without using up twice as much space. ln -s /gpfs/home/abc123/scratch/big_file /gpfs/home/abc123/work/big_file

You can see another use in your home directory on the cluster: ls -l scratch gives us this directory listing:

lrwxrwxrwx 1 mjw5407 mjw5407 21 Jun 22 2011 scratch -> /gpfs/scratch/mjw5407.

The l right at the beginning of the line tells us that scratch is a symlink to /gpfs/scratch/mjw5407.

And if for some reason you see that scratch is not a symlink but a regular directory, something has gone wrong. (This happened recently to a cluster user. Check your directory listings!)

What’s taking up space on your cluster account?

A quick tip on finding and deleting big files on your cluster account. Use mmlsquota to inspect your quota usage on GPFS filesystems. This will tell you whether you really need to clean up. Use du -hs * to figure out how big your subdirectories are, and ls -lh to inspect the files in a directory. Use (with great caution) rm -rf <directory> to remove a big directory, and rm <filename> to delete a file (both commands are irreversible).

Some ideas for your Bash submission scripts

I’ve been playing around with some design options for PBS submission scripts that may help people doing cluster work.  Some things to look for in the source code:

  • You can use a list in bash that contains multiple text entries, and then access those text entries to create strings for your submissions.  Note that you can actually display the text first (see the ‘echo ${PBS}’) before you do anything; that way you aren’t requesting thousands of jobs that have a typo in them!
  • Using “read” allows the bash programmer to interact with the user.  Well, in reality you are usually both the programmer and the user.  But lots of times, I want to write a script and try it out first, before I submit hundreds of hours of time on the cluster.  The flags below can help with that process.
  • I added commands to compile the source code before actually submitting the jobs.  Plus, by using flags and pauses intelligently, you can bail out of the script if there’s a problem with compilation.
#!/bin/bash
NODES=32
WALLHOURS=5

PROBLEMS=("ProblemA" "ProblemB")
NSEEDS=10
SEEDS=$(seq 1 ${NSEEDS}) #note there are multiple ways to declare lists and sequences in bash

NFES=1000000
echo "NFEs is ${NFES}" #echo statements can improve usability of the script, especially if you're modifying it a lot for various trials

ASSUMEPERMISSIONFLAG=No #This is for pausing the submission script later

echo "Compile? Y or N."

read COMPILEFLAG
if [ "$COMPILEFLAG" = "Y" ]; then
    echo "Cleaning.."
    make clean -f MakefileParallel
    echo "Compiling.."
    make -f MakefileParallel
else
        echo "Not compiling."
fi

for PROBINDEX in ${!PROBLEMS[*]}
do
    PROBLEM=${PROBLEMS[$PROBINDEX]} #note the syntax to pull a list member out here
    echo "Problem is ${PROBLEM}"

    for SEED in ${SEEDS}
    do
        NAME=${PROBLEM}_${SEED} #Bash is really nice for manipulating strings like this
        echo "Submitting: ${NAME}"

        #Here is the actual PBS command, with bash variables used in place of different experimental parameters.  Note the use of getopt-style command line parsing to pass different arguments into the myProgram executable.  This implementation is also designed for parallel processing, but it can also be used for serial jobs too.

        PBS="#PBS -l nodes=32\n\
        #PBS -N ${NAME}\n\
        #PBS -l walltime=05:00:00\n\
        #PBS -j oe\n\
        #PBS -o ${NAME}.out\n\
        cd \$PBS_O_WORKDIR\n\
        module load openmpi/intel\n\
        mpirun ./myProgram -b ${PROBLEM} -c combined -f ${NFES} -s ${SEED}"

        #The first echo shows the user what is about to be passed to PBS.  The second echo then pipes it to the command qsub, and actually submits the job.

        echo ${PBS}

        if [ "$ASSUMEPERMISSIONFLAG" = "No" ]; then

            echo "Continue submitting? Y or N."

            read SUBMITFLAG

            #Here, the code is designed to just keep going after the user says Y once.  You can redesign this for your own purposes.  Also note that this code is fairly brittle in that the user MUST say Y, not y or yes.  You can build that functionality into the if statements if you'd like it.

            if [ "$SUBMITFLAG" = "Y" ]; then
                 ASSUMEPERMISSIONFLAG=Yes #this way, the user won't be asked again
                 echo -e ${PBS} | qsub
                 sleep 0.5
                 echo "done."
            fi
        else
            echo -e ${PBS} | qsub
            sleep 0.5
            echo "done."
         fi
    done
done

An Alternative File Manager

Here’s another useful piece of software for those of you who are sick of the built-in options for file management, especially in Windows.  Q-Dir located at http://www.softwareok.com/?seite=Freeware/Q-Dir will allow you to view up to four locations and once and it allows you to do everything that you would in a normal file manager.  Additionally, it is simply an executable, which means that you can just place it on a portable drive and create a shortcut to it on your desktop.

I have been using this for several months now and have been quite happy with it.

See also: here

Software for Comparing Files and Directories

I recently needed to compare a bunch of directories with one another in order to ensure that I had the most up-to-date files archived on my system.  For those of you involved on the AWR Comparison Study, it was for synchronizing all of the re-run data.  I found the following software available at http://winmerge.org/ quite useful in doing this.

WinMerge will help you to compare the contents of two directories to ensure that you don’t end up losing some files when you go to clean up your directories following a study.  Additionally, you can just drag both of the folders you want to compare into the main application window rather than having to browse to the directories.  It is very fast and efficient.

See also: here