This is the first of what I hope will be a series of posts and discussions about NetCDF (Network Common Data Form), a data storage and access interface created by the Unidata team at UCAR. In this post, I’ll describe recent efforts to convert a horrifying amount of plaintext data into the NetCDF format using the netCDF4-python module. But first, a quick word on installation:
NetCDF Installation requires both the underlying NetCDF4 and HDF5 libraries, as well as the higher-level APIs for your language of choice (there are also APIs for R, Java, and I believe Matlab/Octave, in addition to the bedrock C/C++ and Fortran APIs). On Windows, you can probably just grab the appropriate binary from the link above, and install it as you would Numpy/Scipy. On OSX/Linux, you may encounter a series of installation dependencies with your package manager that will not be covered in this post. To test if Python can find your installation, for example, run import netCDF4 in your Python interpreter—importing without errors is a necessary but not sufficient condition for success.
Here is the situation. We have been running a global hydrology experiment in which we model the average monthly runoff at a 1-degree resolution for a number of parameter ensembles. There’s much more to it if you’re interested, but that’s the synopsis. Due to queue limits, we divided the ensembles across 200 separate jobs. I unwisely printed the output in tab-separated text files, which means we now have output files for every grid cell on the globe (about 16,000 for the land surface) for every job (200). I’ll spare you the math, that’s 3.2 million text files floating around. Each one is only about 7 KB, but the number of files is enough to drag down any filesystem–not to mention it’s difficult to document, maintain, and share.
Who wants to receive a tarball with 3.2 million files? Not me. Let’s try to do better.
One thing you’ll notice about this data is that it’s hierarchical. Every grid cell contains a number of parameter sets, and every parameter set contains monthly runoff values. If only some smart people had invented a data format to handle multidimensional arrays of hierarchical data! (Enter NetCDF, which now uses the Hierarchical Data Format, HDF5).
A NetCDF file consists of three main building blocks: Groups, Dimensions, and Variables. In this post I’m not using groups—this would be, for example, if you had data from different experiments, or multiple versions of the same experiment.
Dimensions describe, well, the dimensions of your multidimensional data. If you’re coming from a C background, you can think of this step like allocating memory. In Python, opening a NetCDF file for writing and creating some dimensions looks like this:
from netCDF4 import Dataset import numpy as np root_grp = Dataset('vic_LHS_climatology.nc', 'w', format='NETCDF4') root_grp.description = 'Results from VIC 10K Latin Hypercube ensemble, 60-year simulation on Blue Waters' # dimensions root_grp.createDimension('lat', 180) root_grp.createDimension('lon', 360) root_grp.createDimension('month', 12) root_grp.createDimension('ensemble', 10000)
The call to Dataset returns the object to access the “root group” of the file. Keep this around, because you’ll use it for all future reads/writes. Then we create our dimensions. Here, the dimensions are latitude, longitude, month, and parameter ensemble. We specify a name for each of these, as well as their size. Notice that we haven’t assigned any values here, only the amount of space in memory that each dimension requires.
Next we create our variables, which each use one or more of the dimensions above:
# variables latitudes = root_grp.createVariable('latitude', 'f4', ('lat',)) longitudes = root_grp.createVariable('longitude', 'f4', ('lon',)) vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('lat', 'lon', 'ensemble', 'month',), fill_value=-9999.0) obs_runoff = root_grp.createVariable('obs_runoff', 'f4', ('lat', 'lon', 'month'), fill_value=-9999.0) vic_runoff.units = 'mm/month' obs_runoff.units = 'mm/month' # set the variables we know first latitudes = np.arange(-90.5, 89.5, 1.0) longitudes = np.arange(0.5, 360.5, 1.0)
The createVariable function allows us to specify a name, a datatype (in this case, a 32-bit floating point value specified by f4), and importantly, the list of dimensions used to specify this variable. Note that latitude/longitude are both dimensions and variables; we first defined their length, and now we are going to assign them values. We do not need to do this for the month and ensemble dimensions, because their array indices will be the same as their values (1-12 and 1-10000, respectively). The vic_runoff variable uses all four dimensions, so we are saying that this will be a 4-dimensional array. The obs_runoff variable does not depend on the ensemble dimension, so it is only 3-dimensional.
Something that is both dead simple and incredibly useful: you can specify metadata like units directly in the file! So here we can specify the units of these variables, and it will follow the data wherever it goes. It’s a little embarrassing to think that I haven’t been doing this before.
Finally, the latitude and longitude variables already have values that we know—namely, a 1-degree grid across the globe—so we can assign them straightaway. Note that the NetCDF variables are just NumPy arrays, and we manipulate them accordingly.
Now comes the hard part. We need to read in all of our text data and stick it into the two multidimensional arrays vic_runoff and obs_runoff. I’ll spare you the exact details, but here’s the main idea:
for grid cells: for job files: read in text column data save as a chunk of a temporary variable # write the temporary array to netcdf file for this grid cell: vic_runoff[lati,loni,:,:] = tempstore
The reason for using temporary variables is that the dataset will not fit in working memory. Thus we need to read it in chunks and assign it to part of the vic_runoff array. The line where this assignment occurs is writing the data to disk, so it’s going to be much slower than your regular NumPy array assignment. In general, you want to order the dimensions the same way you plan to loop over them—I originally had the dimensions of vic_runoff as (ensemble, month, lat, lon) and believe me, it was slow. This is similar to most programming languages, where the fastest array access occurs along the primary dimensions.
At the end, a call to root_grp.close() wraps things up. And now our massive unwieldy glob of text files has been condensed into a single NetCDF file that also contains metadata! (Read: we can share the dataset without shame!) Future posts will cover working with NetCDF datasets after they’ve been created and visualizing their contents.
Here is the full gist for the more adventurous: