Map making in Matlab

Map making in Matlab

Greetings,

This weeks post will cover the basics of generating maps in Matlab.  Julie’s recent post showed how to do some of this in Python, but, Matlab is also widely used by the community.  You can get a lot done with Matlab, but in this post we’ll just cover a few of the basics.

We’ll start off by plotting a map of the continental United States, with the states.  We used three  this with three commands: usamap, shaperead, and geoshow.  usamap creates an empty map axes having the Lambert Projection covering the area of the US, or any state or collection of states.  shaperead reads shapefiles (duh) and returns a Matlab geographic data structure, composed of both geographic data and attributes.  This Matlab data structure then interfaces really well with various Matlab functions (duh).  Finally, geoshow plots geographic data, in our case on the map axes we defined.  Here’s some code putting it all together.

hold on
figure1 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
tightmap
hold off

Note that ‘usastatehi’ is a shapefile containing the US states (duh) that’s distributed with Matlab. The above code generates this figure:

conus_blank

Now, suppose we wanted to plot some data, say a precipitation forecast, on our CONUS map.  Let’s assume our forecast is being made at many points (lat,long).  To interpolate between the points for plotting we’ll use Matlab’s griddata function.  Once we’ve done this, we use the Matlab’s contourm command.  This works exactly like the normal contour function, but the ‘m’ indicates it plots map data.

xi = min(x):0.5:max(x);
yi = min(y):0.5:max(y);
[XI, YI] = meshgrid(xi,yi);
ZI = griddata(x,y,V,XI,YI);

hold on
figure2 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])

contourm(YI,-1*XI,ZI)
tightmap
hold off

Here x, y, and V are vectors of long, lat, and foretasted precipitation respectively.  This code generates the following figure:

conus_contour

Wow!  Louisiana is really getting hammered!  Let’s take a closer look.  We can do this by changing the entry to usamap to indicate we want to consider only Louisiana.  Note, usamap accepts US postal code abbreviations.

ax = usamap('LA');

Making that change results in this figure:

LA_contour

Neat!  We can also look at two states and add annotations.  Suppose, for no reason in particular, you’re interested in the location of Tufts University relative to Cornell.  We can make a map to look at this with the textm and scatterm functions.  As before, the ‘m’ indicates the functions  plot on a map axes.

hold on
figure4 = figure;
ax = usamap({'MA','NY'});

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
scatterm(42.4075,-71.1190,100,'k','filled')
textm(42.4075+0.2,-71.1190+0.2,'Tufts','FontSize',30)

scatterm(42.4491,-76.4842,100,'k','filled')
textm(42.4491+0.2,-76.4842+0.2,'Cornell','FontSize',30)
tightmap
hold off

This code generates the following figure.

Cornell_Tufts

Cool! Now back to forecasts.  NOAA distributes short term Quantitative Precipitation Forecasts (QPFs) for different durations every six hours.  You can download these forecasts in the form of shapefiles from a NOAA server.  Here’s an example of a 24-hour rainfall forecast made at 8:22 AM UTC on April 29.

fill_94qwbg

Wow, that’s a lot of rain!  Can we plot our own version of this map using Matlab!  You bet!  Again we’ll use usamap, shaperead, and geoshow.  The for loop, (0,1) scaling, and log transform are simply to make the color map more visually appealing for the post.  There’s probably a cleaner way to do this, but this got the job done!

figure5 = figure;
ax = usamap('conus');
S=shaperead('94q2912','UseGeoCoords',true);

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
p = colormap(jet);

N = max(size(S));
d = zeros(N,1);
for i = 1:N
 d(i) = log(S(i).QPF);
end

y=floor(((d-min(d))/range(d))*63)+1;
col = p(y,:);
for i = 1:N
 geoshow(S(i),'FaceColor',col(i,:),'FaceAlpha',0.5)%,'SymbolSpec', faceColors)
end

This code generates the following figure:

conus_shape

If you are not plotting in the US, Matlab also has a worldmap command.  This works exactly the same as usamap, but now for the world (duh).  Matlab is distibuted with a shapefile ‘landareas.shp’ which contains all of the land areas in the world (duh).  Generating a global map is then trivial:

figure6 = figure;

worldmap('World')
land = shaperead('landareas.shp', 'UseGeoCoords', true);
geoshow(land, 'FaceColor', [0.15 0.5 0.15])

Which generates this figure.

globe

 

Matlab also comes with a number of other included that might be of interest.  For instance, shapefiles detailing the locations of major world cities, lakes, and rivers.  We can plot those with the following code:

figure7 = figure;

worldmap('World')
land = shaperead('landareas.shp', 'UseGeoCoords', true);
geoshow(land, 'FaceColor', [0.15 0.5 0.15])
lakes = shaperead('worldlakes', 'UseGeoCoords', true);
geoshow(lakes, 'FaceColor', 'blue')
rivers = shaperead('worldrivers', 'UseGeoCoords', true);
geoshow(rivers, 'Color', 'blue')
cities = shaperead('worldcities', 'UseGeoCoords', true);
geoshow(cities, 'Marker', '.', 'Color', 'red')

Which generates the figure:

globe_river

But suppose we’re interested in one country or a group of countries.  worldmap works in the same usamap does.  Also, you can plot continents, for instance Europe.

worldmap('Europe')

Europe.png

Those are the basics, but there are many other capabilities, including 3-D projections. I can cover this in a later post if there is interest.

contour

That’s it for now!

Alluvial Plots

Alluvial Plots

We all love parallel coordinates plots and use them all the time to display our high dimensional data and tell our audience a good story. But sometimes we may have large amounts of data points whose tradeoffs’ existence or lack thereof cannot be clearly verified, or the data to be plotted is categorical and therefore awkwardly displayed in a parallel coordinates plot.

One possible solution to both issues is the use of alluvial plots. Alluvial plots work similarly to parallel coordinates plots, but instead of having ranges of values in the axes, it contains bins whose sizes in an axis depends on how many data points belong to that bin. Data points that fall within the same categories in all axes are grouped into alluvia (stripes), whose thicknesses reflect the number of data points in each alluvium.

Next are two examples of alluvial plots, the fist displaying categorical data and the second displaying continuous data that would normally be plotted in a parallel coordinates plot. After the examples, there is code available to generate alluvial plots in R (I know, I don’t like using R, but creating alluvial plots in R is easier than you think).

Categorical data

The first example (Figure 1) comes from the cran page for the alluvial plots package page. It uses alluvial plots to display data about all Titanic’s passengers/crew and group them into categories according to class, sex, age, and survival status.

alluvial_cran

Figure 1 – Titanic passenger/crew data. Yellow alluvia correspond to survivors and gray correspond to deceased. The size of each bin represents how many data points (people) belong to that category in a given axis, while the thickness of each alluvium represent how many people fall within the same categories in all axes. Source: https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html.

Figure 1 shows that most of the passengers were male and adults, that the crew represented a substantial amount of the total amount of people in the Titanic, and that, unfortunately, there were more deceased than survivors. We can also see that a substantial amount of the people in the boat were male adult crew members who did not survive, which can be inferred by the thickness of the grey alluvium that goes through all these categories — it can also be seen by the lack of an alluvia hitting the Crew and Child bins, that (obviously) there were no children crew members. It can be also seen that 1st class female passengers was the group with the greatest survival rate (100%, according to the plot), while 3rd class males had the lowest (ballpark 15%, comparing the yellow and gray alluvia for 3rd class males).

Continuous data

The following example shows the results of policy modeling for a fictitious water utility using three different policy formulations. Each data point represents the modeled performance of a given candidate policy in six objectives, one in each axis. Given the uncertainties associated with the models used to generate this data, the client utility company is more concerned about whether or not a candidate policy would meet certain performance criteria according to the model (Reliability > 99%, Restriction Frequency < 20%, and Financial Risk < 10%) than about the actual objective values. The utility also wants to have a general idea of the tradeoffs between objectives.

Figure 2 was created to present the modeling results to the client water utility. The colored alluvia represent candidate policies that meet the utility’s criteria, and grey lines represent otherwise. The continuous raw data used to generate this plot was categorized following ranges whose values are meaningful to the client utility, with the best performing bin always put in the bottom of the plot. It is important to notice that the height of the bins represent the number of policies that belong to that bin, meaning that the position of the gap between two stacked bins does not represent a value in an axis, but the fraction of the policies that belong to each bin. It can be noticed from Figure 2 that it is relatively difficult for any of the formulations to meet the Reliability > 99% criteria established by the utility. It is also striking that a remarkably small number of policies from the first two formulations and none of the policies from the third formulation meet the criteria established by the utilities. It can also be easily seen by following the right alluvia that the vast majority of the solutions with smaller net present costs of infrastructure investment obtained with all three formulations perform poorly in the reliability and restriction frequency objectives, which denotes a strong tradeoff. The fact that such tradeoffs could be seen when the former axis is on the opposite side of the plot to the latter two is a remarkable feature of alluvial plots.

tradeoffs_all3.svg.png

Figure 2 – Alluvial plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The parallel coordinates plots in Figure 3 displays the same information as the alluvial plot in Figure 2. It can be readily seen that the analysis performed above, especially when it comes to the tradeoffs, would be more easily done with Figure 2 than with Figure 3. However, if the actual objective values were important for the analysis, Figure 3 would be needed either by itself or in addition to Figure 2, the latter being used likely as a pre-screening or for a higher level analysis of the results.

tradeoffs_all-svg

Figure 3 – Parallel coordinates plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The R code used to create Figure 1 can be found here. The code below was used to create Figure 2 — The packages “alluvia”l and “dplyr” need to be installed before attempting to use the provided code, for example using the R command install.packages(package_name). Also, the user needs to convert its continuous data into categorical data, so that each row corresponds to a possible combination of bins in all axis (one column per axis) plus a column (freqs) representing the frequencies with which each combination of bins is seen in the data.

# Example datafile: snippet of file "infra_tradeoffs_strong_freqs.csv"
Reliability, Net Present Cost of Inf. Investment, Peak Financial Costs, Financial Risk, Restriction Frequency, Jordan Lake Allocation, freqs
2<99,0<60,0<25,0<10,2>20,0<70,229
0>99,2>60,0<25,0<10,2>20,0<70,0
2<99,2>60,0<25,0<10,2>20,0<70,168
0>99,0<60,2>25,0<10,2>20,0<70,0
2<99,0<60,2>25,0<10,2>20,0<70,3
0>99,2>60,2>25,0<10,2>20,0<70,2
2<99,2>60,2>25,0<10,2>20,0<70,45
0>99,0<60,0<25,2>10,2>20,0<70,0
2<99,0<60,0<25,2>10,2>20,0<70,317
0>99,2>60,0<25,2>10,2>20,0<70,0
2<99,2>60,0<25,2>10,2>20,0<70,114
# load packages and prepare data
library(alluvial)
library(dplyr)

itss <- read.csv('infra_tradeoffs_strong_freqs.csv')
itsw <- read.csv('infra_tradeoffs_weak_freqs.csv')
itsn <- read.csv('infra_tradeoffs_no_freqs.csv')

# preprocess the data (convert do dataframe)
itss %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_strong
itsw %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_weak
itsn %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_no

# setup output file
svg(filename="tradeoffs_3_formulations.svg",
width=8,
height=8,
pointsize=18)
p <- par(mfrow=c(3,1))
par(bg = 'white')

# create the plots
alluvial(
its_strong[,1:6],
freq=its_strong$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_strong$n < 1
)
alluvial(
its_weak[,1:6],
freq=its_weak$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_weak$n < 1
)
alluvial(
its_no[,1:6],
freq=its_no$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_no$n < 1
)
dev.off()

Saving d3.parcoords to SVG

d3.parcoords is a great library for making interactive parallel coordinate plots. A major issue, however, is that it is pain to get the resulting plots into a format suitable for publication. In this blog post, I will show how we can turn a d3.parcoords plot into an SVG document, which we can save locally. SVG is an XML based format for vector graphics, so it is ideal for publications.

This blog post is an example of how to get the SVG data. It is however far from complete, and there might be better ways of achieving some of the steps. Any comments or suggestions on how to improve the code are welcome. I wrote this while learning javascript, without any prior experience with respect to web technology.

First, how is a d3.parcoords plot structured? It is composed of five elements: 4 HTML5 canvas layers, and a single SVG layer. the SVG layer contains the axis for each dimension. The 4 canvas layers are marks, highlight, brushed, and foreground. I am not sure what the function is of the first two, but brushed contains the lines that are selected through brushing, while foreground contains all the remaining lines.

In order to export a d3.parcoords figure as pure svg, we need to somehow replace the HTML canvas with something that has the same interface, but generates SVG instead. Luckily there are several javascript libraries that do this. See http://stackoverflow.com/questions/8571294/method-to-convert-html5-canvas-to-svg for an overview. In this example, I am using http://gliffy.github.io/canvas2svg/ , which is a recent library that still appears to be maintained.

The basic idea is the following:

  • replace the normal HTML5 canvas.context for each layer with the one from canvas2svg, and render the plot
  • extract the axis svg
  • extract the SVG from the 5 canvas layers, and combine the 5 layers into a single svg document
  • save it
  • reset the canvas

To make this work, we are depending on several javascript libraries in addition to the default dependencies of d3.parcoords. These are

Replace canvas.context

In order to replace the canvas.context for each layer, we iterate over the names of the layers. d3.parcoords saves the contexts in an internal object, indexed by name. We keep track of the old context for each layer, because this makes restoring a lot easier at the end. We instantiate the C2S context (the class provided by canvas2svg), by specifying the width and height of the plot. In this case, I have hardcoded them for simplicity, but it would be better to extract them from the HTML or CSS.

const layerNames = ["marks", "highlight", "brushed", "foreground"];

const oldLayers = {};
let oldLayerContext;
let newLayerContext;
let layerName;
for (let i=0; i<canvasLayers.length; i++){
    layerName = layerNames[i];

    oldLayerContext = pc0.ctx[layerName]; //pc0 is the d3.parcoords plot
    newLayerContext = new C2S(720, 200); 

    oldLayers[layerName] = oldLayerContext;
    pc0.ctx[layerName] = newLayerContext;
}
pc0.render();

Extract the Axis svg

Getting the axis svg is straightforward. We select the svg element in the dom, serialise it to a string and next use jQuery to create a nice XML document out of the string.

const svgAxis = new XMLSerializer().serializeToString(d3.select('svg').node());
const axisXmlDocument = $.parseXML(svgAxis);

The only problem with this approach is that the SVG does not contain the style information, which is provided in the CSS. So, we need to inline this information. To do so, I created two helper functions. The first helper function allows us to set an attribute on elements that have the same tag. The second does the same, but based on class name.

// helper function for saving svg
function setAttributeByTag(xmlDocument, tagName, attribute, value){
    const paths = xmlDocument.getElementsByTagName(tagName);
    for (let i = 0; i < paths.length; i++) {
        paths[i].setAttribute(attribute, value);
    }
}

// helper function for saving svg
function setAttributeByClass(xmlDocument, className, attribute, value){
    const paths = xmlDocument.getElementsByClassName(className);
    for (let i = 0; i < paths.length; i++) {
        paths[i].setAttribute(attribute, value);
    }
}

We can now  use  these helper functions to inline some CSS information. Note that this is an incomplete subset of all the CSS information used by d3.parcoords. A future extension would be to extract all the d3.parcoord style information from the CSS and inline it.

setAttributeByTag(axisXmlDocument, "axis", "fill", "none");
setAttributeByTag(axisXmlDocument, "path", "stroke", "#222");
setAttributeByTag(axisXmlDocument, "line", "stroke", "#222");
setAttributeByClass(axisXmlDocument, "background", "fill", "none");

Extract the SVG from each layer

We now  have an XML document to which we can add the SVG data of each of our layers. In order to keep track of the structure of the SVG, I have chosen to first create a new group node, and subsequently add each layer to this new group as a child. To make sure that this group is positioned correctly, I clone the main group node of the axis svg, remove it’s children, and insert this new node at the top of the XML document.

const oldNode = axisXmlDocument.getElementsByTagName('g')[0];
const newNode = oldNode.cloneNode(true);
while (newNode.hasChildNodes()){
    newNode.removeChild(newNode.lastChild);
}
axisXmlDocument.documentElement.insertBefore(newNode, oldNode);

There is some trickery involved in what I am doing here. SVG groups are rendered on top of each other, in the order in which they appear in the XML document. It appears that one can provide a z-order as well according to the SVG 2.0 specification, but I have not pursued that direction here. By adding the newly created node to the top, I ensure that the axis information is at the end of the XML document, and thus always on top of all the other layers. For the same reason, I have also deliberately sorted the canvas layer names.

Now  that we have a new node, we can iterate over our canvas layers and extract the svg data from them. Next, we parse the xml string to turn it into an XML document. We have to overwrite a transform attribute that is used when working on a retina screen, this matters for a html canvas but not for svg. For convenience, I also add the layer name as a class attribute, so in our SVG, we can easily spot each of the canvas layers. The XML document for a given layer contains two main nodes. The first node contains the defs tag, which we don’t need. The second node contains the actual SVG data, which is what we do need.

let svgLines;
let xmlDocument;
for (let i=0; i<layerNames.length; i++){
    // get svg for layer
    layerName = layerNames[i];
    svgLines = pc0.ctx[layerName].getSerializedSvg(true);
    xmlDocument = $.parseXML(svgLines);

    // scale is set to 2,2 on retina screens, this is relevant for canvas
    // not for svg, so we explicitly overwrite it
    xmlDocument.getElementsByTagName("g")[0].setAttribute("transform", "scale(1,1)");

    // for convenience add the name of the layer to the group as class
    xmlDocument.getElementsByTagName("g")[0].setAttribute("class", layerName);

    // add the group to the node
    // each layers has 2 nodes, a defs node and the actual svg
    // we can safely ignore the defs node
    newNode.appendChild(xmlDocument.documentElement.childNodes[1]);
}

Save it

We have all our SVG data in the xml document. All that is left is to turn this back into a string, format the string properly, turn it into a blob, and save it. We can achieve this in three lines.

// turn merged xml document into string
// we also beautify the string, but this is optional
const merged = vkbeautify.xml(new XMLSerializer().serializeToString(axisXmlDocument.documentElement));

// turn the string into a blob and use FileSaver.js to enable saving it
const blob = new Blob([merged], {type:"application/svg+xml"});
saveAs(blob, "parcoords.svg");

Reset context

We now  have saver our SVG file locally, but we have to still put back our old canvas context’s. We have stored these, so we can simply loop over the layer names and put back the old context. In principle, this last step might not be necessary, but I work on machines with a retina screen and ran into scaling issues when trying to use C2s context’s outside of the save function.

// we are done extracting the SVG information so
// put the original canvas contexts back
for (let i=0; i<layerNames.length; i++){
    pc0.ctx[layerNames[i]] = oldLayers[layerNames[i]]
}
pc0.render();

Putting it all together

I have a repo on github with the full code including dependencies etc: https://github.com/quaquel/parcoords .

The code shown in this blog is not complete. For example, brushed plots will not display nice and require some post processing of the SVG.

For those that are more familiar with D3.parcoords, note how the coloring of the lines is dependent on which axis you select. I have connected the color to a click event on the axis to make this possible.

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.

Making Movies of Time-Evolving Global Maps with Python

Making Movies of Time-Evolving Global Maps with Python

Hi All,

These past few months I’ve been working with the Global Change Assessment Model (GCAM) which is an integrated assessment model (IAM) that combines models of the global climate and economic systems. I wrote an earlier post on compiling GCAM on a Unix cluster.  This post discusses some visualization tools I’ve developed for GCAM output.

GCAM models energy and agriculture systems at a regional level, where the world is composed of 32 regions.  We’re interested in tracking statistics (like the policy cost of stabilization) over time and across regions.  This required three things:

  1. The ability to draw a global map.
  2. The ability to shade individual political units on that map.
  3. The ability to animate this map.

Dr. Jon Herman has already posted a good example of how to do (1) in python using matplotlib’s Basemap.  We’ll appropriate some of his example for this example.  The Basemap has the option to draw coastlines and boundaries, but these boundaries are not tied to shapes, meaning that you can’t assign different colors to individual countries (task (2) above).  To do that, we need a shapefile containing information about political boundaries.  You can find these for free from a number of sources online, but I like Natural Earth.  They provide data on many different scales. For this application I downloaded their coarsest data set.  To give each country a shade which is tied to data, we use matplotlib’s color map.  The basic plan is to generate a colored map for each time-step in our data, and then to animate the maps using the convert linux command.

Now that we’ve described roughly how we’ll proceed, a word about the data we’re dealing with and how I’ve handled it.  GCAM has 32 geo-political regions, some of which are individual countries (like the USA or China), while others are groups of countries (like Australia & New Zealand). I stored this information using a list of lists (i.e. a 32-element list, where each element is a list of countries in that region). I’ve creatively named this variable list_list in this example (see code below). For each of the regions GCAM produces a time series of policy costs as a fraction of GDP every 5 years from 2020-2100. I’ve creatively named this variable data. We want to tie the color of a country in each time to its policy cost relative to costs across countries and times.  To do this, I wrote the following (clumsy!) Python function, which I explain below.


def world_plot(data,idx,MN,MX):
 from mpl_toolkits.basemap import Basemap
 import matplotlib.pyplot as plt
 from matplotlib.patches import Polygon
 from matplotlib.collections import PatchCollection
 import matplotlib.cm as cm
 import matplotlib as mpl
 import numpy as np

 norm = mpl.colors.Normalize(vmin=MN, vmax=MX)
 cmap = cm.coolwarm
 colors=cm.ScalarMappable(norm=norm, cmap=cmap)
 colors.set_array(data)
 a = np.zeros([32,4])
 a = colors.to_rgba(data)

 fig = plt.figure(figsize=(10,10))
 ax = fig.add_subplot(111)

 m = Basemap(projection='robin', lon_0=0,resolution='c')
 m.drawmapboundary(fill_color='white', zorder=-1)
 m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
 m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)

 year = [1990,2005,2010,2015,2020,2025,2030,2035,2040,2045,2050,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100]
 GCAM_32 = ['PRI','USA','VIR']
 GCAM_1 = ['BDI','COM','DJI','ERI','ETH','KEN','MDG','MUS','REU','RWA','SDS','SDN','SOM','UGA','SOL']
 GCAM_2 = ['DZA','EGY','ESH','LBY','MAR','TUN','SAH']
 GCAM_3 = ['AGO','BWA','LSO','MOZ','MWI','NAM','SWZ','TZA','ZMB','ZWE']
 GCAM_4 = ['BEN','BFA','CAF','CIV','CMR','COD','COG','CPV','GAB','GHA','GIN','GMB','GNB','GNQ','LBR','MLI','MRT','NER','NGA','SEN','SLE','STP','TCD','TGO']
 GCAM_6 = ['AUS','NZL']
 GCAM_7 = ['BRA']
 GCAM_8 = ['CAN']
 GCAM_9 = ['ABW','AIA','ANT','ATG','BHS','BLZ','BMU','BRB','CRI','CUB','CYM','DMA','DOM','GLP','GRD','GTM','HND','HTI','JAM','KNA','LCA','MSR','MTQ','NIC','PAN','SLV','TTO','VCT']
 GCAM_10 = ['ARM','AZE','GEO','KAZ','KGZ','MNG','TJK','TKM','UZB']
 GCAM_11 = ['CHN','HKG','MAC']
 GCAM_13 = ['BGR','CYP','CZE','EST','HUN','LTU','LVA','MLT','POL','ROM','SVK','SVN']
 GCAM_14 = ['AND','AUT','BEL','CHI','DEU','DNK','ESP','FIN','FLK','FRA','FRO','GBR','GIB','GRC','GRL','IMN','IRL','ITA','LUX','MCO','NLD','PRT','SHN','SMR','SPM','SWE','TCA','VAT','VGB','WLF']
 GCAM_15 = ['BLR','MDA','UKR']
 GCAM_16 = ['ALB','BIH','HRV','MKD','MNE','SCG','SRB','TUR','YUG']
 GCAM_17 = ['CHE','ISL','LIE','NOR','SJM']
 GCAM_18 = ['IND']
 GCAM_19 = ['IDN']
 GCAM_20 = ['JPN']
 GCAM_21 = ['MEX']
 GCAM_22 = ['ARE','BHR','IRN','IRQ','ISR','JOR','KWT','LBN','OMN','PSE','QAT','SAU','SYR','YEM']
 GCAM_23 = ['PAK']
 GCAM_24 = ['RUS']
 GCAM_25 = ['ZAF']
 GCAM_26 = ['GUF','GUY','SUR','VEN']
 GCAM_27 = ['BOL','CHL','ECU','PER','PRY','URY']
 GCAM_28 = ['AFG','ASM','BGD','BTN','LAO','LKA','MDV','NPL']
 GCAM_29 = ['KOR']
 GCAM_30 = ['BRN','CCK','COK','CXR','FJI','FSM','GUM','KHM','KIR','MHL','MMR','MNP','MYS','MYT','NCL','NFK','NIU','NRU','PCI','PCN','PHL','PLW','PNG','PRK','PYF','SGP','SLB','SYC','THA','TKL','TLS','TON','TUV','VNM','VUT','WSM']
 GCAM_31 = ['TWN']
 GCAM_5 = ['ARG']
 GCAM_12 = ['COL']

 list_list = [GCAM_1,GCAM_2,GCAM_3,GCAM_4,GCAM_5,GCAM_6,GCAM_7,GCAM_8,GCAM_9,GCAM_10,GCAM_11,GCAM_12,GCAM_13,GCAM_14,GCAM_15,GCAM_16,GCAM_17,GCAM_18,GCAM_19,GCAM_20,GCAM_21,GCAM_22,GCAM_23,GCAM_24,GCAM_25,GCAM_26,GCAM_27,GCAM_28,GCAM_29,GCAM_30,GCAM_31,GCAM_32]
 m.readshapefile('ne_110m_admin_0_countries_lakes/ne_110m_admin_0_countries_lakes','comarques')
 num = len(list_list)
 for info, shape in zip(m.comarques_info,m.comarques):
 for i in range(num):
 if info['adm0_a3'] in list_list[i]:
 patches1 = []
 patches1.append( Polygon(np.array(shape), True) )
 ax.add_collection(PatchCollection(patches1,facecolor=a[i,:],edgecolor='k',linewidths=1.,zorder=2))
 ax.set_title('Policy Cost',fontsize=25,y=1.01)#GDP Adjusted Policy Cost#Policy Cost#Policy Cost Reduction from Technology
 plt.annotate('%s'%year[idx],xy=(0.1,0.2),xytext=(0.1,0.2),xycoords='axes fraction',fontsize=30)
 cb = m.colorbar(colors,'right')
 cb.ax.tick_params(labelsize=14)
 filename = &amp;quot;out/map_%s.png&amp;quot; %(str(idx).rjust(3,&amp;quot;0&amp;quot;))
 plt.show()
 fig.savefig(filename)
 return

The function’s name is world_plot and it’s inputs are:

  1. The raw data for a specific time step.
  2. The index of the time step for the map we are working with (e.g. idx=0 for 2020).
  3. The minimum and maximum of the data across countries and time.

(1) is plotted, (2) is used to name the resulting png figure (line 73), and (3) is used to scale the color colormap (line 11).  On lines 2-8 we import the necessary Python packages, which in this case are pretty standard Matplotlib packages and numpy.  On lines 11-16 we generate a numpy array which contains the rgba color code for each of the data points in data.  In lines 18-19 we create the pyplot figure object.

On lines 21-24 we create and format the Basemap object.  Note that on line 21 I’ve selected the Robinson projection, but that the Basemap provides many options.

Lines 26-60 are specific for this application, and certainly could have been handled more compactly if I wanted to invest the time.  year is a list of time steps for our GCAM experiment, and lines 27-58 contain lists of three letter ID codes for each GCAM region, which are assembled into a list of lists (creatively called list_list) on line 60.

On line 61 we read the data from the shapefile database which was downloaded from Natural Earth. From lines 63-68 we loop through the info and shape attributes of the shapefile database, and determine which of the GCAM geo-political units each of the administrative units in the database is associated with.  Once this is determined, the polygon associated with that administrative unit is given the correct color (lines 66-68).

Lines 69-72 are doing some final formatting and labeling, and in lines 73-75 we are giving the file a unique name (tied to the time step plotted) and saving the images to some output directory.

When we put this function into a loop over time, we generate a sequence of figures looking something like this:

test_017

To convert this sequence of PNGs to a gif file, we use the convert command in linux (or in my case Cygwin).  So, we go to the command line and cd into the directory where we’ve saved our figures and type:

convert -delay 45 -loop 0 *.png globe_Cost_Reduction_faster.gif

Here the delay flag controls the framerate of the gif (in milliseconds), the loop flag controls whether the gif repeats, next I’m using a wildcat to include all of the pngs in the output directory, and the final input is the resulting name of the gif. The final product:

globe_GDP_Cost_Low_faster

 

SALib v0.7.1: Group Sampling & Nonuniform Distributions

This post discusses the changes to the Python library SALib in version 0.7.1, with some examples of how to use the new capabilities. The two major additions in this version were: group sampling for Sobol’ sensitivity analysis and specifying nonuniform distributions for variables.

Sobol’ Indices Group Sampling

Previous versions of SALib allowed one to calculate the first-order, total-order, and second-order indices for individual input parameters. These same indices can be defined for groups of input parameters (see Saltelli (2002) for more discussion). The main change is adding an item called ‘groups’  to the problem dictionary, which specifies the group of each parameter. Here is some example code. Notice in the ‘groups’  entry in the problem definition.


from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.util import read_param_file
import numpy as np

# example function
def sampleModel(x):
    y = x[:,0]**1.5 + x[:,1] + 2*x[:,2] + x[:,3] + np.exp(0.3*x[:,4]) + x[:,5] \
         + 2*x[:,1]*x[:,4] + (x[:,0]*x[:,5])**2
    return y

# problem definition
prob_gps_code = {
'num_vars':6,
'names': ['P1','P2','P3','P4','P5','P6'],
'bounds':[[0.0, 1.0], [2.0, 3.0], [0.5, 1.0], [0.0, 5.0], [-0.5, 0.5], [0.0, 1.0]],
'groups':['group1','group2','group1','group1','group2','group3']
}
# generating parameter values
param_vals_gps_code = saltelli.sample(prob_gps_code, 10000,calc_second_order=True)

# calculating model output values
Y_gps_code = sampleModel(param_vals_gps_code)

# completing Sobol' sensitivity analysis
Si_gps_code = sobol.analyze(prob_gps_code,Y_gps_code,calc_second_order=True,print_to_console=True)

The output from this code is given below. In this case the first-order indices (S1’s) are the index that is closed for the group. The S1 for group1 ( P1, P3, and P4) would be equivalent to summing:  S1 for P1, P3, and P4; S2 for (P1 & P3), (P1 & P4), and (P3 & P4); and S3 for (P1 & P3 & P4). All of the equations used in calculating the sensitivity indices are the same, but now they are for groups of variables.

Group S1 S1_conf ST ST_conf
group1 0.471121 0.081845 0.472901 0.011769
group2 0.498600 0.078950 0.497005 0.013081
group3 0.030502 0.019188 0.031736 0.001041

Group_1 Group_2 S2 S2_conf
group1 group2 0.000618 0.159951
group1 group3 0.002170 0.161403
group2 group3 -0.003324 0.155224

Note: You can also use the read_param_file() function to define the problem. For the above example the problem file would look like:

P1,0.0,1.0,group1
P2,2.0,3.0,group2
P3,0.5,1.0,group1
P4,0.0,5.0,group1
P5,-.5,.5,group2
P6,0.0,1.0,group3

Nonuniform Distributions

Often the variables in a sensitivity analysis are assumed to be distributed uniformly over some interval. In the updated version of the SALib it is possible to specify whether the each input parameter is triangular, normal, lognormal, or uniform. Each of these distributions interprets the ‘bounds’ in the problem dictionary separately, as listed below.

  • Triangular, “triang” (assumed lower bound of 0)
    • first “bound” is width of distribution (scale, must be greater than 0)
    • second “bound” is location of peak as a fraction of the scale (must be on [0,1])
  • Normal, “norm”
    • first “bound” is the mean (location)
    • second “bound” is the standard deviation (scale, must be greater than 0)
  • Lognormal, “lognorm” (natural logarithms, assumed lower bound of 0)
    • first “bound” is the ln-space mean
    • second “bound” is the ln-space standard deviation (must be greater than 0)
  • Uniform, “unif”
    • first “bound” is the lower bound
    • second “bound” is the upper bound (must be greater than lower bound)

Triangular and lognormal distributions with a non-zero lower bound can be obtained by adding the lower bound to the generated parameters before sending the input data to be evaluated by the model.

Building on the same example as above, the problem dictionary and related analysis would be completed as follows.

# problem definition
prob_dists_code = {
'num_vars':6,
'names': ['P1','P2','P3','P4','P5','P6'],
'bounds':[[0.0,1.0], [1.0, 0.75], [0.0, 0.2], [0.0, 0.2], [-1.0,1.0], [1.0, 0.25]],
'dists':['unif','triang','norm','lognorm','unif','triang']
}

# generating parameter values
param_vals_dists_code = saltelli.sample(prob_dists_code, 10000,calc_second_order=True)

# calculating model output
Y_dists_code = sampleModel(param_vals_dists_code)

# complete Sobol' sensitivity analysis
Si_dists_code = sobol.analyze(prob_dists_code,Y_dists_code,calc_second_order=True,print_to_console=True)&lt;/pre&gt;

The output from this analysis is given below, which is consistent with the format in previous versions of SALib.

Parameter S1 S1_conf ST ST_conf
P1 0.106313 0.030983 0.110114 0.003531
P2 0.037785 0.027335 0.085197 0.003743
P3 0.128797 0.029834 0.128702 0.003905
P4 0.034284 0.016997 0.034193 0.001141
P5 0.579715 0.071509 0.627896 0.017935
P6 0.062791 0.021743 0.065357 0.002221
Parameter_1 Parameter_2 S2 S2_conf
P1 P2 0.001783 0.060174
P1 P3 0.001892 0.060389
P1 P4 0.001753 0.060155
P1 P5 0.001740 0.062130
P1 P6 0.004774 0.060436
P2 P3 -0.003539 0.051611
P2 P4 -0.003500 0.051186
P2 P5 0.044591 0.054837
P2 P6 -0.003585 0.051388
P3 P4 -0.000562 0.058972
P3 P5 -0.000533 0.059584
P3 P6 -0.000480 0.059923
P4 P5 -0.000364 0.034382
P4 P6 -0.000191 0.034301
P5 P6 -0.001293 0.137576

Note 1: You can also use the read_param_file() function to define the problem. The one catch is when you want to use nonuniform distributions without grouping the variables. In this case the fourth column in the input file (column for ‘groups’) must be the parameter name repeated from the first column. For the above example the problem file would look like:

P1,0.0,1.0,P1,unif
P2,1.0,.75,P2,triang
P3,0.0,.2,P3,norm
P4,0.0,.2,P4,lognorm
P5,-1.0,1.0,P5,unif
P6,1.0,.25,P6,triang

Note 2: If you are uncertain that the distribution transformation yielded the desired results, especially since the ‘bounds’ are interpreted differently by each distribution, you can check by plotting histograms of the data. The histograms of the data used in the example are shown below. (The data was actually saved to a .txt file for reference and then imported to R to plot these histograms, but matplotlib has a function histogram().)

allHist

References

Saltelli, Andrea (2002). Making best use of model evaluations to compute sensitivity indices. Computer Physics Communications 145(2):280-297. doi:10.1016/S0010-4655(02)00280-1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.