In the previous post I described the process of creating a NetCDF file to store model output on a lat-lon grid. The main variable in this file was named vic_runoff (runoff from the VIC hydrologic model), and I set it up as a 4-dimensional array along the following dimensions: latitude (180), longitude (360), ensemble (10000), and month (12). I included the _FillValue attribute to account for the sea surface grid cells, since we are only doing land surface hydrology here.
The problem is that the full lat-lon grid is being stored, including the sea surface grid cells (which all contain fill values of -9999.0). I mistakenly assumed that these would be compressed in the file, but they were not by default: the 180 x 360 x 10000 x 12 matrix of 32-bit floats takes up about 29 GB, exactly as one would expect. This is an awful lot of wasted space.
There are two approaches we might try here. First, when we create the variables in the NetCDF file, we can include gzip compression like so:
vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('ncells', 'ensemble', 'month',), fill_value=-9999.0, zlib=True)
I’ve found this approach to slow things down considerably, both when creating the file and when reading from it. A simpler idea is just to stop storing the sea surface grid cells. Instead, we can create a dimension called ncells that just lists our land surface grid cells (at 1-degree resolution, there are 15836 land surface cells). The setup looks like this:
# New file root_grp = Dataset('vic_LHS_climatology_cells.nc', 'w', format='NETCDF4') root_grp.description = '(Cells only) Results from VIC 10K Latin Hypercube ensemble, 60-year simulation on Blue Waters' # dimensions root_grp.createDimension('ncells', 15836) root_grp.createDimension('month', 12) ensemble = root_grp.createDimension('ensemble', 10000) # variables latitudes = root_grp.createVariable('latitude', 'f4', ('ncells',)) longitudes = root_grp.createVariable('longitude', 'f4', ('ncells',)) vic_runoff = root_grp.createVariable('vic_runoff', 'f4', ('ncells', 'ensemble', 'month',), fill_value=-9999.0) obs_runoff = root_grp.createVariable('obs_runoff', 'f4', ('ncells', 'month'), fill_value=-9999.0)
Note that latitude and longitude are still variables in the NetCDF file, because we need to store the locations of these grid cells. But we replace the dimensions lat and lon with ncells, which ensures that we will not be storing a whole bunch of wasted space.
Below is a gist showing an example of copying an existing NetCDF file (with a full lat-lon grid) into a new NetCDF file containing only a list of the grid cells for which we have data. This reduces the filesize from 29 GB to 7 GB, without any compression! The only downside is that it takes a bit more effort to plot the data, as I’ll describe in the next post.