Using HDF5/zlib compression in NetCDF4, part 2: testing the compression settings

There has been a previous post, courtesy of Greg Garner, on why HDF5/zlib compression matters for NetCDF4. That post featured a plot that showed how much you could compress your data when increasing the compression level. But the fine print also acknowledged that this data was for a pretty idealized dataset. So how much should you compress your data in a real-world application? How can you test what your trade-off really is between compression and computing time?

Follow this 4-step process to find out!

I’ll be illustrating this post using my own experience with the Water Balance Model (WBM), a model developed at the University of New Hampshire and that has served for several high-profile papers over the years (including Nature and Science). This is the first time that this model, written in Perl, is being ported to another research group, with the goal of exploring its behavior when running large ensembles of inputs (which I am starting to do! Exciting, but a story for another post).

Step 1. Read the manual

There is a lot of different software for creating NetCDF data. Depending on the situation, you may have a say on which to use, or be already using the tool that comes with the software suite you are working with. Of course, in the latter case, you can always change the tools. But reasonable a first step before that is to test them. Ergo, look up the documentation for the software you are using, to see how you can control compression on them.

Here, WBM uses the PDL::NetCDF Perl library, which has useful functions for adding data to a NetCDF file after every time step the model runs. Contrary to Greg’s post that uses C and where there are two flags (“shuffle” and “deflate”) and a compression level parameter (“deflate_level”), for PDL::NetCDF there are only two parameters. The SHUFFLE flag is the equivalent in Perl of the “shuffle” flag in C. The DEFLATE Perl parameter ihas integer values from 0 to 9, with a value 0 being equivalent to the C-flag “deflate” being turned off, and any value from 1 to 9 being equivalent to the “deflate”C-flag being on, the value of DEFLATE being then equivalent to the value of the “deflate_level” parameter in Greg’s post. Therefore, the DEFLATE variable from the PDL::NetCDF library in Perl lumps together the parameters “deflate” and “deflate_level” used in C.

I then located the DEFLATE and SHUFFLE variables within the auxiliary functions of the WBM. In the function write_nc, the following two lines of codes are key:

 my $deflate = set_default($$options{DEFLATE},1); # NetCDF4 deflate (compression) parameter</pre>
my $shuffle = set_default($$options{SHUFFLE},0); # NetCDF4 shuffle parameter 

Step 2. Set up a test protocol

This builds on Greg’s idea of recording time and resulting file size for all compression level. Here we are interested in these quantities for full-scale model runs, and not just for the generation of a single NetCDF dataset.

In this case therefore, we want to contrast the default setting above with stronger compression settings, for ensemble runs of WBM on the Cube (the local HPC cluster). For a better comparison, let us place ourselves in the conditions in which ensemble runs will be made. Runs will use all 16 cores of a Cube node, therefore for each compression setting, this experiment runs 16 instances of the WBM on a single node. Each of the 16 instances runs on a single core. All WBM runs are identical so the only differences between run times and result file size come from compression settings.

Compression settings for (SHUFFLE,DEFLATE) are (0,1) by default, and we compare that with all settings from (1,1) to (1,9).

Step 3. Run experiment, get results

Here are the results from this experiment. Results consider 47 output fields for WBM runs with a daily time-step for 8 years (2009-2016), plus 5 years of warmup (this is pretty common for hydrological models). All this in a spatial mesh of 148,500 grid cells. A folder containing binaries for a single input variable, for this time span and spatial coverage, has a size of 3.1GB. Therefore, the expected size for 47 variables in binary format is 146Go. Let us compare with our results:

netcdf_expe_results

As one can see the presence of the shuffle flag or the value of the deflate parameter have little influence on the size of the results files. Compressed results are 3 to 4 time smaller than binaries, which highlights the interest of compressing, but also means we do not have the order(s) of magnitude differences reported by Greg’s blog post. This is mainly because the binary format used for WBM inputs is much more efficient than the uncompressed ASCII that Greg used in his experiment. For a deflate parameter of 9, there is an apparent problem within the PDL library, and no output (note that a single-core run with shuffle=0 and deflate=9 did not lead to a similar problem).

Step 4. Conclude on compression parameters

Here the epxerimental setup has shown that carefully selecting the output fields will save more space than fine-tuning NetCDF compression parameters. For instance, some of the 47 output fields above are fully redundant with others. Others are residual fields, and the only interest of looking them up is to verify that a major development within the WBM code did not mess up with the overall water balance.

More generally, the effects of compression are situation-specific and are not as great when there is no obvious regularity in the data (as is often the case with outputs from large models), or when the binary format used is already much better than ASCII. This said, NetCDF still occupies much less space than binaries, and is much easier to handle: WBM outputs are contained in one file per year (8 files total) with very useful metadata info…


 

 

Launching Jupyter Notebook Using an Icon/Shortcut in the Current Working Directory Folder

Launching Jupyter Notebook Using an Icon/Shortcut in the Current Working Directory Folder

A petty annoyance I’ve encountered when wanting to open Jupyter Notebook (overview) is that I couldn’t find a way to instantly open it in my current Windows Explorer window. While one trick you can use to open the Command Prompt in this folder is by typing ‘cmd’ in the navigation bar above (shown below) and pressing Enter/Return, I wanted to create a shortcut or icon I could double-click in any given folder and have it open Jupyter Notebook in that same working directory. cmd.png

This method allows you to drag-and-drop the icon you create into any folder and have it launch Jupyter Notebook from the new folder. It works for Windows 7, 8, and 10. Please feel free to let me know if you encounter any errors!

A great application for this shortcut may be to include this shortcut in GitHub folders where you wish to direct someone to launch Jupyter Notebook with minimal confusion. Just direct them to double-click on the icon and away they go!

Creating Your Own Jupyter Notebook Shortcut

new_shortcut.png

To begin, we must have already installed Jupyter Notebook or Jupyter Lab. Next, navigate to the folder we want to create your shortcut. Right-click, select ‘New’, then create a shortcut. 

shortcut_34.PNG

In the Create Shortcut Windows prompt, type the location of the item you want the Shortcut Icon to direct to. In this case, we are wanting direct this shortcut to the Command Prompt and have it run the command to open Jupyter Notebook. Copy/paste or type the following into the prompt:

cmd /k “jupyter notebook”

Note that cmd will change to the location of the Command Prompt executable file (e.g. C:\Windows\System32\cmd.exe), and ‘/k’ keeps the Command Prompt window open to ensure Jypyter Notebook does not crash. You can edit the command in the quotation marks to any command you would want, but in this case ‘jupyter notebook’ launches an instance of Jupyter Notebook.

You can then save this shortcut with whatever name you wish!

At this point, double-clicking the shortcut will open Jupyter Notebook in a static default directory (e.g. ‘C:\Windows\system32’). To fix this, we need to ensure that this shortcut instead directs to the current working directory (the location of the shortcut).

Picture1321.png

Next, we need to edit the location where the Command Prompt will run in. Right-click on your newly-created icon and select ‘Properties’ at the bottom of the menu to open the window shown on the left. One thing to note is that the ‘Target’ input is where we initially put in our ‘location’ prompt from above.

At this point, change the ‘Start in:’ input (e.g. ‘C:\Windows\system32’) to the following:

%cd%

By changing this input, instead of starting the Command Prompt in a static default directory, it instead starts the command prompt  in the current working directory for the shortcut.

At this point, you’re finished! You can drag and drop this icon to any new folder and have Jupyter Notebook start in that new folder.

If you wish to download a copy of the shortcut from Dropbox. Note that for security reasons, most browsers, hosting services, and email services will rename the file from ‘jupyter_notebook_shortcut.lnk’ to ‘jupyter_notebook_shortcut.downloads’.

Many thanks to users on superuser for helping develop this solution!

Please let me know if you have any questions, comments, or additional suggestions on applications for this shortcut!

 

A Deeper Dive into Principal Component Analysis

This post is meant to be a continuation of Dave Gold’s introductory post on Principal Component Analysis, which is an excellent explanation on how to conduct a PCA and visualize the principal components. The goal of this post is to elaborate on how to proceed after you have conducted a PCA and to address some common questions and concerns associated with the method.

Performing a PCA in R

Often times, you will perform a PCA on large datasets that contain many variables and/or many observations. One such dataset that will be used as an example is the Living Blended Drought Atlas (LBDA) which is a reconstruction of the Palmer Drought Severity Index (PDSI) over the contiguous United States from 1473-2005. This dataset contains 4968 columns, or variables, each of which is a grid cell over the U.S., and 533 rows, each of which is a yearly observation. We will call this dataset, X. In matrix notation, we will denote the PCA analysis formula as:

U=XW    (1)

where X is the dataset, W is the weighting matrix, whose columns are the key patterns in the data, and U is the matrix whose columns are the resulting principal components (PCs). You can perform a PCA on this dataset with a single function in R, prcomp.

 PCA=prcomp(X, scale=TRUE/FALSE) 

The first input into the function is your data matrix and the second input is used to declare if your dataset should be scaled to have a unit variance before the PCA is conducted. There are various other inputs into the function, listed here, that can be included if necessary.

prcomp returns three sets of results in a list that we have called “PCA”:

  1. sdev: the standard deviations of the principal components (if you square them, you get the eigenvalues of the covariance/correlation matrix)
  2. rotation: the loading matrix whose columns are the eigenvectors (W in the equation above)
  3. x: the rotated data or your PCs (The columns of U in the equation above)

And that’s it! You have the results of the PCA. Now comes the more difficult part: interpreting them.

How do I choose how many PCs to keep?

The dimensions associated with equation (1) are as follows:

    U=XW

(nxk)=(nxk)(kxk) 

If the number of observations is much larger than the number of variables in the dataset, i.e. n>>k, then the PCA will return k distinct eigenvectors. In our case, since n is smaller than our number of variables, the most non-zero eigenvectors that the PCA will return is n. Either way, we have many supposedly distinct patterns. How do we decide how many of those patterns to keep?

The answer is not always clear and most often subjective and case-dependent. One common tool used is a scree plot or an eigenvalue spectrum.

Scree Plot

Each column of our W matrix is a distinct, independent pattern, also called an empirical orthogonal function (EOF). Each EOF is responsible for explaining some amount of variance in the dataset. A scree plot, shown in Figure 1, allows you visualize this variance breakdown.

Picture1

Figure 1: Scree Plot

 

On the x-axis of the scree plot is the EOF (we’ve chosen to keep 10) and on the y-axis is the total variance explained by that EOF. Each variance is equivalent to the eigenvalue associated with its respective eigenvector. You can find the eigenvalues/variances by squaring of the results of sdev that is returned by prcomp.

Generally speaking, one can look for the “elbow” of the scree plot to determine at what point to truncate the EOFs and retain up to the EOF before the elbow. At about the 5th EOF, the graph starts to level off and all subsequent EOFs start to contribute about the same amount of variance. Therefore, the elbow of the graph is located at about the 5th EOF and you will retain the first 4 EOFs.

North’s Rule of Thumb

North’s Rule of Thumb is more of a precise way of truncating and involves creating confidence intervals around your estimates of the variance. The rule states that you should truncated EOFs only when the confidence intervals of the variances start to intersect. At this point, the eigenvectors are considered too close to be interpretable and spacing might be due to sampling error rather than a clear distinction between the variances [1].

Rotated EOFs

At some point, your EOFs might start to exhibit patterns that can be hard to interpret or attribute to a physical phenomenon. It is not uncommon for these types of patterns to result from pure noise in your data, especially if you are analyzing latter EOFs that explain a very small amount of the variance [2].

Rotating EOFs is a practice that is done to simplify the patterns obtained in EOFs and make them more interpretable. A varimax orthogonal rotation can be used to determine an optimum rotation matrix that maximizes the variance in the columns of W. The variance of the columns is maximized by driving some of the loadings to zero and trying to maximize the values of other loadings. In R, this is done using the varimax function.

 my.varimax=varimax(PCA$rotation[,1:10]) 

In the above command, my input is the first 10 EOFs from the original weighting matrix. The result, my.varimax, is a list with the following components:

  1. loadings: the resulting rotated loading matrix
  2. rotmat: the rotation matrix

The new loading matrix, Wrot , is still orthogonal after the rotation, and the eigenvectors are, therefore, still orthonormal. However, multiplication of the rotated loading matrix by the original dataset, X, to obtain a new U, results in principal components are not guaranteed to be independent. This can be seen through further inspection of the correlation matrix associated with U. Unfortunately, this is a tradeoff associated with obtaining EOFs that are simpler and easier to interpret.

Sources:

[1] http://yyy.rsmas.miami.edu/users/bmapes/teaching/MPO581_2011/EOF_chapter_DelSole.pdf

[2] Hannachi, A., Jolliffe, I.T., and Stephenson, D.B. (2007), Empirical orthogonal functions and related techniques in atmospheric science: A review, International Journal of Climatology, 27, 1119-1152.

*All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class, BEE 6300: Environmental Statistics 

Simple Bash shell scripts that have made my life easier

I’ve recently been using Bash shell scripts to improve the efficiency of my workflow when working on Linux systems and I thought I would share some of them here. I’m fairly new to Linux so this post is not meant to be a comprehensive guide on how to write shell scripts rather, I hope the scripts in this post can serve as examples for those who may also be learning Linux and unsure of where or how to start writing shell scripts. I didn’t write any of these from scratch, most of the scripts are based off files shared with me by group members Julie Quinn, Bernardo Trindade and Jazmin Zatarian Salazar. If you’re interested in learning more about any of the commands used in these scripts I’ve put some references I found useful at the end of this post. If you’re more experienced in writing shell scripts, please feel free to put tips or suggestions in the comments.

1. A simple script for making directories

For my research I’m processing results of a monte carlo simulation for several solutions found through multi-objective search and I needed to make folders in several locations to store the output from each solution. My first instinct was to make each directory separately using the mkdir command in the command line, but this quickly got tedious. Instead I used a bash script to loop through all the solution numbers and create a new directory for each. For more on using loops in Bash, check out this reference.

#!/bin/bash#!/bin/bash

# This script will create directories named "Solution_*.txt" for
# a set of numbered solutions 

# specify solution numbers
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# create a variable to store the string "Solution_"
DIRECTORY="Solution_" 

# loop over solution numbers
for i in ${SOLUTIONS[@]}
do
# create a separate directory for each solution
mkdir $DIRECTORY${i}
done

2. Calling a Java function and saving the output

The MOEA framework is a tool written in Java with all sorts of cool functions. I used it to generate 1024 latin hypercube samples across a given range for each of the 8 solutions mentioned above. Using a shell script allows for you to easily set up the arguments needed for the MOEA framework, call the Java function and save the output to your desired file format. The MOEA framework’s tool spits out a .txt file, but this script uses the “sed” command to save it as a .csv file. More on “sed” can be found in the reference at the end of this post.

#!/bin/bash#!/bin/bash
# this shell script will call the MOEA framework's Latin Hypercube
# Sampling tool to create 1024 samples from a set of
# prespecified ranges for each of 8 solutions

# create variables to store Java arguments
JAVA_ARGS="-Xmx1g -classpath MOEAFramework-1.16-Executable.jar"
NUM_SAMPLES=1024
METHOD=latin

# these are the solutions we will create samples from
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# loop through solutions
for i in ${SOLUTIONS[@]}
do
    # define names for input (ranges) and output file names
    RANGES_FILENAME=${i}ranges.txt
    OUTPUT_FILENAME=Solution${i}_Samples.txt
    CSV_FILENAME=Solution${i}_Samples.csv

    # Call MOEA framework from JAVA using specified arguments to
    # create LHS Samples, specify OUTPUT_FILENAME as output
    java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SampleGenerator -m ${METHOD} -n ${NUM_SAMPLES} -p ${RANGES_FILENAME} -o ${OUTPUT_FILENAME}

    # Use the sed command tocreate new comma separated values file
    # from original output .txt file
    sed 's/ /,/g' ${OUTPUT_FILENAME} &amp;amp;amp;amp;gt; ${CSV_FILENAME} 

    # remove .txt files
    rm $OUTPUT_FILENAME
done

3. A piping example

Piping allows you to link together programs by making the output from one program or function the input to another. The script below was originally written by my friend Shrutarshi Basu for a class project we were working on together. This script is made to process the output from the Borg MOEA for 9 random seeds of the DTLZ2 benchmarking problem across several different algorithmic configurations, seen in the code as “masters” (for more on this see Jazmin’s post here). In addition to calling Java tools from the MOEAframework, Basu uses piping to link the Linux commands “tac”, “sed”, “grep” and “cut”.  For more on each of these commands, see the links at the bottom of this post.


# loop over each of 9 seeds
for i in {0..9}
do
obj=DTLZ2_S${i}.obj
output=dtlz2.volume

# loop over masters
for m in $(seq 0 $1)
do
runtime=DTLZ2_S${i}_M${m}.runtime
mobj=DTLZ2_S${i}_M${m}.obj

# extract objectives from output
echo "Extracting objectives"
tac ${runtime} | sed -n '1,/\/\// p' | grep -v "//" | cut -d' ' -f15-19 | tac > ${mobj};
done

# combine objectives into one file
echo "Combining objectives"
java -cp ../../moea.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger \
-d 5 -e 0.01,0.01,0.01,0.01,0.01 \
-o ${obj} DTLZ2_S${i}_M*.obj

# calculate the hypervolume
echo "Finding final hypervolume"
hvol=$(java -cp ../../moea.jar HypervolumeEval ${obj})

printf "%s %s\n" "$i" "$hvol" >> ${output}
echo "Done with seed $i"
done

Additional References and Links