Enhance your (Windows) remote terminal experience with MobaXterm

Jazmin and Julie recently introduced me to a helpful program for Windows called “MobaXterm” that has significantly sped up my workflow when running remotely on the Cube (our cluster here at Cornell). MobaXterm bills itself as an “all in one” toolbox for remote computing. The program’s interface includes a terminal window as well as a graphical SFTP browser. You can link the terminal to the SFTP browser so that as you move through folders on the terminal the browser follows you. The SFTP browser allows you to view and edit files using your text editor of choice on your windows desktop, a feature that I find quite helpful for making quick edits to shell scripts or pieces of code as go.

mobaxtermsnip

A screenshot of the MobaXterm interface. The graphical SFTP browser is on the left, while the terminal is on the right (note the checked box in the center of the left panel that links the browser to the terminal window).

 

You can set up a remote Cube session using MobaXterm with the following steps:

  1. Download MobaXterm using this link
  2.  Follow the installation instructions
  3. Open MobaXterm and select the “Session” icon in the upper left corner.
  4. In the session popup window, select a new SSH session in the upper left, enter “thecube.cac@cornell.edu” as the name of the remote host and enter your username.
  5. When the session opens, check the box below the SFTP browser on the left to link the browser to your terminal
  6. Run your stuff!

Note that for a Linux system, you can simply link your file browser window to your terminal window and get the same functionality as MobaXterm. MobaXterm is not available for Mac, but Cyberduck and Filezilla are decent alternatives. An alternative graphical SFTP browser for Windows is WinSCP, though I prefer MobaXterm because of its linked terminal/SFTP interface.

For those new to remote computing, ssh or UNIX commands in general, I’d recommend checking out the following posts to get familiar with running on a remote cluster:

 

 

 

Advertisements
Basic Machine Learning in Python with Scikit-learn

Basic Machine Learning in Python with Scikit-learn

Machine learning has become a hot topic in the last few years and it is for a reason. It provides data analysts with efficient ways of extracting information from data, allowing it to be used for analysis and modeling purposes.

The Scikit-learn Python library has implementations of dozens of learning algorithms and is freely available for academic and commercial use under the terms of the BSD licence. Some of these algorithms can be extremely useful for our job as water systems analysts, so given the overwelming amount of algorithms implemented in Scikit-learn, I though I would mention a few I find particularly useful for my research. For each method below I included link swith an examples from the Scikit-learn’s website. Instalation and use instructions can be found in their website.

CART Trees

CART trees that can used for regression or classification. Any any tree, CART trees are considered poor (generally high variance) classifiers unless bootstrapped or boosted (see supervised learning), but the resulting rules are easily interpretable.

CART Treeshttp://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

Dimensionality reduction

Principal component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique. It works by finding the basis the maximizes the data’s variance, allowing for the elimination of axis that have low variances. Among its uses are noise reduction, data visualization, as it preserves the distances between data points, and improvement of computational efficiency of other algorithms by getting rid of redundant information. PCA can me used in its pure form or it can be kernelized to handle data sets whose variance is maximum in a non-linear direction. Manifold learning is another way of performing dimensionality reduction by unwinding the lower dimensional manifold where the information lies.

PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html#sphx-glr-auto-examples-decomposition-plot-pca-3d-py

Kernel PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html#sphx-glr-auto-examples-decomposition-plot-kernel-pca-py

Manifold learninghttp://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

Clustering

Clustering is used to group similar points in a data set. One example is the problem of find customer niches based on the products each customer buys. The most famous clustering algorithm is k-means, which, as any other machine learning algorithm, works well on some data sets but not in others. There are several alternative algorithms, all of which exemplified in the following two links:

Clustering algorithms comparison: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

Gaussian Mixture Models (finds the same results as k-means but also provides variances): http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

Reducing the dimentionality of a dataset with PCA or kernel PCA may speed up clustering algorithms.

Supervised learning

Supervised learning algorithms can be used for regression or classification problems (e.g. classify a point as pass/fail) based on labeled data sets. The most “trendy” one nowadays is neural networks, but support vector machines, boosted and bagged trees, and others are also options that should be considered and tested on your data set. Bellow are links to some of the supervised learning algorithms implemented in Scikit-learn:

Comparison between supervised learning algorithmshttp://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Neural networks: http://scikit-learn.org/stable/modules/neural_networks_supervised.html

Gaussian Processes is also a supervised learning algorithm (regression) which is also be used for Bayesian optimization:

Gaussian processes: http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py

 

Use python `cf ` and `esgf-python-client` package to interact with ESGF data portal

Working with climate change impact assessment may often require us to download a number of outputs from Global Circulation Models (GCMs) to depict the plausible changes of projected future conditions, as well as to use them for conventional top-down scenario analysis. Usually those kinds of scenario data are public available for download and are stored as NetCDF format on the data portal using Earth System Grid Federation (ESGF) architecture. Yet the procedure of downloading those data, which typically span over a lengthy time horizon, may be very cumbersome and requires following steps:

  • create the username and credential on ESGF data portal;
    • define the search entries to narrow down the results;
    • generate the *.wget file for batch downloading using secured OpenDAP protocal ;
    • run the *.wget file to download all the raw data onto the local disk;

For the climatic projection data, usually the downloaded data are stored in NetCDF format with irregular coordinate systems (e.g., rotated pole), segmented in pieces of small time periods, and available only for large continental scale, which is, however, unnecessary for regional case study.

In this post, I would like to share some of my experience and tools to tackle with the ESGF data portal, and facilitate the process. The context will encompass the concepts of NetCDF and Python language, both of which were already well introduced in our Water Programming blog. Therefore, I strongly suggest to revisit those posts in case of any doubt, and also because I am not an expert on neither NetCDF nor Python.

The first tool is called cf-python package, which implements comprehensive tools for reading, writing and processing NetCDf data. Personally I found it extremely useful for reading/writing NetCDF data, regridding and extracting the sub-domain from original dataset. More importantly, given the OpenDAP url all those aforementioned operations can be done on the server side without explicitly downloading the orignal data onto local disk! So instead of downloading a 10Gb projection data for entire Europe that takes 1 hour to finish, you can just download the part specific to your study area within a couple of seconds at a cost of a few Mb.

The second tool is named esgf-python-client. This package contains the API code for calling the ESGF Search API, whose documentation by the way has not been updated for quite a while. With the combined use of cf-python and this client, you are saved from user/password login and dealing with tedious web interface of ESGF search page (try here) and downloading.

Below I explain more in details by using an example of downloading European regional downscaling data for RCP4.5 scenario from one of the ESGF data portal. The example is essentially a few lines of codes with inline comments (notice that some indents of lines may not show correctly in the post, so just copy paste and reformat them in your editor).

from pyesgf.logon import LogonManager 
from pyesgf.search import SearchConnection
import cf
import numpy as np

def main():
 ## ================== To be modified by user ================================================
 # Define the search key words for facet fields in ESGF node. 
 # Note: the supported facet fields can be identified using: 
 # " http://<base_search_URL>/search?facets=*&distrib=false&limit=0 ", where
 # <base_search_URL> is the the domain of esgf-node, e.g., https://pcmdi.llnl.gov/esg-search

esgf_node='https://esgf-node.ipsl.upmc.fr/esg-search' # you may try different data portal, as many of them are frequently under maintenance longer than their actual service time

Project='CORDEX' 
 Time_frequency='day' # 'month', '6h', '3h' 
 Experiment='rcp45' # 'rcp25', 'rcp85', 'history'
 Variable='tas' # 'pr' for precipitation 
 Ensemble='r1i1p1' # not change 
 Domain='EUR-44' # EUR-44i, EUR-22, EUR-22i, EUR-11i

# Define the index of domain for study region. Say original domain is 100 (lon) by 60 (lat) grids, and your study area falls between 50 - 54 in longitude and 41 to 45 in latitude. 
 idx_lon = np.array( range(49,54) ) #49 instead of 50 as numpy array is 0-based indexing
 idx_lat = np.array( range(40,45) )

## =========================================================================================

 # Connect to ESGF data portal using your openid and password. May need a single run of *.wget file first to
 # retrieve personal credientials
 lm = LogonManager()
 lm.logon_with_openid(your_openID, your_password)

conn = SearchConnection(esgf_node, distrib=True)

 # Make query based on the facet fields. 
 ctx = conn.new_context( project=Project, 
 time_frequency=Time_frequency,
 experiment=Experiment, 
 variable=Variable,
 ensemble=Ensemble, 
 domain=Domain )

# ctx.hit_count # show total number of results found

 # loop over the hit_count results
 for each_hit in range(0, ctx.hit_count):

ds = ctx.search()[each_hit]
 files = ds.file_context().search() # use "len(files)" to see total files found

url_list_sorted = sorted( [i.opendap_url for i in files] ) # url list sorted to get assending timestamp, e.g., from 1/1/2005 to 12/31/2099
 filename_out = i.filename[0:-21] # get file name, the [0:21] means a subset of filename string

 counter = 0

data_list = []
 for each_url in url_list_sorted:
 counter += 1 
 meta_data = cf.read(each_url) # read the meta-data information from OpenDAP NetCDF url
 if len(meta_data) > 1: # in some cases, the returned meta_data contains two fields, i.e., len(meta_data)>1, and only the second field contains the variable of interest
 data_list.append(meta_data[1])
 else:
 data_list.append(meta_data[0])

# Aggregating the time
 dataset_agg = cf.aggregate(data_list) # this will concatenate all files with small time period into a single file with complete time series

# Slicing dataset
 if len(dataset_agg.shape) == 4:
 dataset_sliced = dataset_agg.subspace[0,:,idx_lat,idx_lon] # for 4-D , e.g., [height, time, lat, lon], .subspace() function is used to extract the subset of the data
 elif len(dataset_agg.shape) == 3:
 dataset_sliced = dataset_agg.subspace[:,idx_lat,idx_lon] # for 3-D , e.g., [time, lat, lon], .subspace() function is used to extract the subset of the data

# Save data into desired format (e.g., using savemat from 'scipy' package)
 cf.write(dataset_sliced, filename_out+'.nc', mode='w')

 lm.logoff() # logoff

if __name__ == "__main__":
 main()

Sorry for the lengthy and poor organized post, but I hope you will find it useful. For any doubts or suggestions, please feel free to contact me (likymice@gmail.com).

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.

From Writing NetCDF Files in C to Loading NetCDF Files in R

So much data from such little models…

It’s been my experience that even simple models can generate lots of data. If you’re a regular reader of this blog, I can imagine you’ve had similar experiences as well. My most recent experience with this is the work I’ve done with the Dynamic Integrated Climate-Economic model (DICE). I had inherited a port of the 2007 version of the model, which would print relevant output to the screen. During my initial runs with the model, I would simply redirect the output to ascii files for post-processing. I knew that eventually I would be adding all sorts of complexity to this model, ultimately leading to high-dimensional model output and rendering the use of ascii files as impractical. I knew that I would need a better way to handle all this data. So in updating the model to the 2013 version, I decided to incorporate support for netCDF file generation.

You can find details about the netCDF file format through Unidata (a University Cooperation for Atmospheric Research [UCAR] Community Program) and through some of our previous blog posts (here, here, and here). What’s important to note here is that netCDF is a self-describing file format designed to manage high-dimensional hierarchical data sets.

I had become accustomed to netCDF files in my previous life as a meteorologist. Output from complex numerical weather prediction models would often come in netCDF format. While I had never needed to generate my own netCDF output files, I found it incredibly easy and convenient to process them in R (my preferred post-processing and visualization software). Trying to incorporate netCDF output support in my simple model seemed daunting at first, but after a few examples I found online and a little persistence, I had netCDF support incorporated into the DICE model.

The goal of this post is to guide you through the steps to generate and process a netCDF file. Some of our earlier posts go through a similar process using the Python and Matlab interfaces to the netCDF library. While I use R for post-processing, I generally use C/C++ for the modeling; thus I’ll step through generating a netCDF file in C and processing the generated netCDF file in R on a Linux machine.

Edit:  I originally put a link to following code at the bottom of this post.  For convenience, here’s a link to the bitbucket repository that contains the code examples below.

Writing a netCDF file in C…

Confirm netCDF installation

First, be sure that netCDF is installed on your computing platform. Most scientific computing clusters will have the netCDF library already installed. If not, contact your system administrator to install the library as a module. If you would like to install it yourself, Unidata provides the source code and great documentation to step you through the process. The example I provide here isn’t all that complex, so any recent version (4.0+) should be able to handle this with no problem.

Setup and allocation

Include the header files

With the netCDF libraries installed, you can now begin to code netCDF support into your model. Again, I’ll be using C for this example. Begin by including the netCDF header file with your other include statements:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>

Setup an error handler

The netCDF library includes a nice way of handling possible errors from the various netCDF functions. I recommend writing a simple wrapper function that can take the returned values of the netCDF functions and produce the appropriate error message if necessary:

void ncError(int val)
{
  printf("Error: %s\n", nc_strerror(val));
  exit(2);
}

Generate some example data

Normally, your model will have generated important data at this point. For the sake of the example, let’s generate some data to put into a netCDF file:

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 10;
  int dim2_size = 5;
  int dim3_size = 20;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
      for(j = 0; j < dim2_size; j++) {
          for(k = 0; k < dim3_size; k++) {
              x[i][j][k] = (i+j+k) * 0.2;
              y[i][j][k] = (i+j+k) * 1.7;
              z[i][j][k] = (i+j+k) * 2.4;
          }
      }
    }

This generates three variables, each with three different size dimensions. Think of this, for example, as variables on a 3-D map with dimensions of [latitude, longitude, height]. In my modeling application, my dimensions were [uncertain state-of-the-world, BORG archive solution, time].

Allocate variables for IDs

Everything needed in creating a netCDF file depends on integer IDs, so the next step is to allocate variables for the netCDF file id, the dimension ids, and the variable ids:

// Allocate space for netCDF dimension ids
int dim1id, dim2id, dim3id;
  
// Allocate space for the netcdf file id
int ncid;
  
// Allocate space for the data variable ids
int xid, yid, zid;

Each one of these IDs will be returned through reference by the netCDF functions. While we’re at it, let’s make a variable to hold the return status of the netCDF function calls:

// Allocate return status variable
int retval;

Define the meta-data

Now we will start to build the netCDF file. This is a two-part process. The first part is defining the meta-data for the file and the second part is assigning the data.

Create an empty netCDF file

First, create the file:

// Setup the netcdf file
if((retval = nc_create("example.nc", NC_NETCDF4, &ncid))) { ncError(retval); }

Note that we store the return status of the function call in retval and test the return status for an error. If there’s an error, we pass retval to our error handler. The first parameter to the function call is the name of the netCDF file. The second parameter is a flag that determines the type of netCDF file. Here we use the latest-and-greatest type of NETCDF4, which includes the HDF5/zlib compression features. If you don’t need these features, or you need a version compatible with older versions of netCDF libraries, then use the default or 64-bit offset (NC_64BIT_OFFSET) versions. The third parameter is the netCDF integer ID used for assigning variables to this file.

 Add the dimensions

Now that we have a clean netCDF file to work with, let’s add the dimensions we’ll be using:

 // Define the dimensions in the netcdf file
 if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
 // Gather the dimids into an array for defining variables in the netcdf file
 int dimids[ndims];
 dimids[0] = dim1id;
 dimids[1] = dim2id;
 dimids[2] = dim3id;

Just as before, we catch and test the function return status for any errors. The function nc_def_dim() takes four parameters. First is the netCDF file ID returned when we created the file. The second parameter is the name of the dimension. Here we’re using “dimX_size” – you would want to use something descriptive of this dimension (i.e. latitude, time, solution, etc.). The third parameter is the size of this dimension (i.e. number of latitude, number of solutions, etc.). The last is the ID for this dimension, which will be used in the next step of assigning variables. Note that we create an array of the dimension IDs to use in the next step.

 Add the variables

The last step in defining the meta-data for the netCDF file is to add the variables:

// Define the netcdf variables
if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

The nc_def_var() function takes 6 parameters. These include (in order) the netCDF file ID, the variable name to be displayed in the file, the type of data the variable contains, the number of dimensions of the variable, the IDs for each of the dimensions, and the variable ID (which is returned through reference). The type of data in our example is NC_FLOAT, which is a 32-bit floating point. The netCDF documentation describes the full set of data types covered. The IDs for each dimension are passed as that combined array of dimension IDs we made earlier.

 Optional: Add variable attributes

This part is optional, but is incredibly useful and true to the spirit of making a netCDF file. When sharing a netCDF file, the person receiving the file should have all the information they need about the data within the file itself. This can be done by adding “attributes”. For example, let’s add a “units” attribute to each of the variables:

 // OPTIONAL: Give these variables units
 if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }

The function nc_put_att_text() puts a text-based attribute onto a variable. The function takes the netCDF ID, the variable ID, the name of the attribute, the length of the string of characters for the attribute, and the text associated with the attribute. In this case, we’re adding an attribute called “units”. Variable ‘x’ has units of “cm”, which has a length of 2. Variable ‘y’ has units of “degC”, which has a length of 4 (and so on). You can apply text-based attributes as shown here or numeric-based attributes using the appropriate nc_put_att_X() function (see documentation for the full list of numeric attribute functions). You can also apply attributes to dimensions by using the appropriate dimension ID or set a global attribute using the ID “0” (zero).

 End the meta-data definition portion

At this point, we’ve successfully created a netCDF file and defined the necessary meta-data. We can now end the meta-data portion:

 // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }

…and move on to the part 2 of the netCDF file creation process.

Populate the file with data

Put your data into the netCDF file

Here, all we do is put data into the variables we defined in the file:

 // Write the data to the file
 if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }

The function nc_put_var() takes three parameters: the netCDF file ID, the variable ID, and the memory address of the start of the multi-dimensional data array. At this point, the data will be written to the variable in the netCDF file. There is a way to write to the netCDF file in data chunks, which can help with memory management, and a way to use parallel I/O for writing data in parallel to the file, but I have no experience with that (yet). I refer those interested in these features to the netCDF documentation.

Finalize the netCDF file

That’s it! We’re done writing to the netCDF file. Time to close it completely:

 // Close the netcdf file
 if((retval = nc_close(ncid))) { ncError(retval); }

Compile and run the code

Let’s compile and run the code to generate the example netCDF file:

gcc -o netcdf_example netcdf_write_example.c -lnetcdf

Some common problems people run into here are not including the netCDF library flag at the end of the compilation call, not having the header files in the include-path, and/or not having the netCDF library in the library-path. Check your user environment to make sure the netCDF paths are included in your C_INCLUDE_PATH and LIBRARY_PATH:

env | grep –i netcdf

Once the code compiles, run it to generate the example netCDF file:

./netcdf_example

If everything goes according to plan, there should be a file called “example.nc” in the same directory as your compiled code. Let’s load this up in R for some post-processing.

 Reading a netCDF file in R…

Install and load the “ncdf4” package

To start using netCDF files in R, be sure to install the netCDF package “ncdf4”:

install.packages("ncdf4")
library(ncdf4)

Note that there’s also an “ncdf” package. The “ncdf” package reads and writes the classic (default) and 64-bit offset versions of netCDF file. I recommend against using this package as the new package “ncdf4” can handle the old file versions as well as the new netCDF4 version.  Turns out the “ncdf” package has been removed from the CRAN repository.  It’s just as well since the new “ncdf4” package obsoletes the “ncdf” package.


Open the netCDF file

With the library installed and sourced, let’s open the example netCDF file we just created:

 nc <- nc_open("example.nc")

This stores an open file handle to the netCDF file.

View summary of netCDF file

Calling or printing the open file handle will produce a quick summary of the contents of the netCDF file:

 print(nc)

This summary produces the names of the available variables, the appropriate dimensions, and any global/dimension/variable attributes.

Extract variables from the netCDF file

To extract those variables, use the command:

x <- ncvar_get(nc, "x")
y <- ncvar_get(nc, "y")
z <- ncvar_get(nc, "z")

At this point, the data you extracted from the netCDF file are loaded into your R environment as 3-dimensional arrays. You can treat these the same as you would any multi-dimensional array of data (i.e. subsetting, plotting, etc.). Note that the dimensions are reported in reverse order from which you created the variables.

dim(x)

 Close the netCDF file

When you’re done, close the netCDF file:

nc_close(nc)

And there you have it! Hopefully this step-by-step tutorial has helped you incorporate netCDF support into your project. The code I described here is available through bitbucket.

Happy computing!

~Greg

Simple Python script to create a command to call the Borg executable

Please see my GitHub for a new Python program that may be of interest. If the link becomes broken in the future, look up the ‘misc_scripts’ repository within ‘jrkasprzyk’ GitHub.

It saves you having to type out a long Borg command when using the ‘command line interface’, especially when there are many many decision variables. The script also contains a simple set of Python objects of interest: Objective contains the name and epsilon of an objective, and Decision contains the name, lower bound, and upper bound of an objective.

You can imagine that this would be able to facilitate complicated MOEA analyses such as random seed analysis, and the running of multiple problems. I have also used this to organize my thoughts when doing other analyses in Python such as sampling of the decision space, and different sorts of processing of MOEA output.

Useful Linux commands to handle text files and speed up work

Most of us, nice and normal human beings, tend to prefer programs with GUIs over typing commands on a command prompt because the former looks more “real” and is more visual than the latter. However, one thing we don’t know (or, sometimes, don’t want to know) is that learning a few terminal commands can dramatically increase productivity. These commands can save us a lot of time by sparing us from opening and closing programs, navigating through menus and windows, moving the mouse around, as well as moving the hand back and forth from the mouse to the keyboard.

This post will mention and briefly describe some useful “intermediate level” Linux commands (some basic commands are presented in this post by Jon Herman), which can be called from a Linux OS, Cygwin (mostly), or Mac. Among the numerous tedious tasks these commands can greatly simplify is the particularly interesting chore of handling text files, be they scripts or data files. Commands for other tasks are covered as well. Keep in mind that the symbol * is a wild card (character that can mean any string of characters when searching for files), which is really useful when the goal is to repeatedly apply one command to multiple files. For all commands listed here skip the “$” character.

DELIMITER SEPARATED FILES HANDLING

  • Remove columns 30 to 36 (starting from 0) from a comma separated file and export the output to another file.
    $ cut -d',' -f1-30,36 input.file >> output.file

    (More information on the post by Joe Kasprzyk)

  • Print only columns 2 and 4 (starting from 1) of a comma separated file.
    $ awk -F "," '{print $2,$4}' input.file >> output.file
  • Count number of columns in a file separated either by spaces or commas:
    $ head -1 input.file | sed 's/[^, ]//g' | wc -c
    or:
    $ awk -F "[, ]" 'END{print NF}' input.file
  • Print lines of a comma separated file in which the value in the 2nd column is lower than 100 and the value in the 5th column is higher than 0.3:
    $ awk -F "," '$2<100 && $5>0.3' input.file >> output.file
  • Print lines between 10th and 20th lines (not inclusive) of a file:
    $ awk 'NR>10 && NR<20' input.file >> output.file
  • Add a string to the end of multiple files:
    $ echo "your string" | tee -a *.set
  • Add a string to the end of one file:
    $ echo "your string" >> file

FILE SEARCHING

  • Find all text files in a folder that contain a certain string:
    $ grep -rn './folder' -e your_string
  • Find files recursively (* is a wildcard):
    $ find -type f -name name_of*.file

FILES INFO

  • See the contents of a zip/tar file without extracting it. Press q to quit.
    $ less file.tar
  • Count number of lines in a file:
    $ wc -l your.file
  • List all files with a certain extension in a directory:
    $ ls *.ext
  • Print files and folders in tree fashion:
    $ tree
  • Print the size of all subfolders and files in (sub)folders to a certain max depth in the folder hierarchy:
    $ du -h -a --max-depth=2

IMAGE HANDLING

  • Convert svg files to png (you need to have Inkscape installed):
    $ inkscape input.svg -d 300 -e output.png
  • Convert svg files to pdf-latex (you need to have Inkscape installed):
    $ inkscape input.svg --export-pdf output.pdf --export-latex
  • Rotate a picture:
    $ convert Fig6_prim.png -rotate 90 Fig6_prim_rotated.png

MISCELLANEOUS

  • See the history of commands you have typed:
    $ history
  • See a calendar (month and year optional):
    $ cal [month] [year]
  • Combine pdf files into one (you need to have pdftk installed):
    $ pdftk file1.pdf file2.pdf file3.pdf cat output newfile.pdf
    or, to merge all pdf files in a directory:
    $ pdftk *.pdf cat output newfile.pdf

    In order to see how to combine only certain pagers of pdf files, as well as how to splits all pages into separate pdf files, see this page.

  • See the manual of a command:
    $ man command

Another useful idea is that of piping outputs of a command to another command. For example, if you want print the number of files in a directory, you can pipe the output of the ls command (list all files in a directory) to the wc -l command (count the number of lines in a string). For this, use the “|” character:

$ ls | wc -l

However, you may want instead to check the number of lines in all the 20 files in a directory at once, which can also be achieved by combining the ls and wc commands with the command xargs. The command then would look like:

$ ls | xargs wc -l

The command xargs breaks down the output from ls into one string per line and then calls wc -l for each line of the output of ls.

Hope all this saves you some time!