Generating Interactive Visuals in R

Visuals vs. Visual Analytics 

How do visuals differ from visual analytics? In a scientific sense, a visual is a broad term for any picture, illustration, or graph that can be used to convey an idea. However, visual analytics is more than just generating a graph of complex data and handing it to a decision maker. Visual analytic tools help create graphs that allow the user to interact with the data, whether that involves manipulating a graph in three-dimensional space or allowing users to filter or brush for solutions that match certain criteria. Ultimately, visual analytics seeks to help in making decisions as fast as possible and to “enable learning through continual [problem] reformulation” (Woodruff et al., 2013) by presenting large data sets in an organized way so that the user can better recognize patterns and make inferences.

My goal with this blog post is to introduce two R libraries that are particularly useful to develop interactive graphs that will allow for better exploration of a three-dimensional space. I have found that documentation on these libraries and potential errors was sparse, so this post will consolidate my hours of Stack Overflow searching into a step-by-step process to produce beautiful graphs!

R Libraries

            Use rgl to create a GIF of a 3D graph

Spinning graphs can be especially useful to visualize a 3D Pareto front and a nice visualization for a Power Point presentation. I will be using an example three-objective Pareto set from Julie’s work on the Red River Basin for this tutorial. The script has been broken down and explained in the following sections.

#Set working directory

#Read in in csv of pareto set

#Create three vectors for the three objectives

In this first block of code, the working directory is set, the data set is imported from a CSV file, and each column of the data frame is saved as a vector that is conveniently named. Now we will generate the plot.

#call the rgl library

#Adjust the size of the window

If the rgl package isn’t installed on your computer yet, simply type install.packages(“rgl”) into the console. Otherwise, use the library function in line 2 to call the rgl package. The next line of code is used to adjust the window that the graph will pop up in. The default window is very small and as such, the movie will have a small resolution if the window is not adjusted!

#Plot the set in 3D space

plot3d(hydropower,deficit,flood,col=brewer.pal(8,"Blues"), size=2, type='s', alpha=0.75)

Let’s plot these data in 3D space. The first three components of the plot3d function are the x,y, and z vectors respectively. The rest of the parameters are subject to your personal preference. I used the Color Brewer (install package “RColorBrewer”) to color the data points in different blue gradients. The first value is the number of colors that you want, and the second value is the color set. Color Brewer sets can be found here: My choice of colors is random, so I opted not to create a color scale. Creating a color scale is more involved in rgl. One option is to split your data into classes and to use legend3d and the cut function to cut your legend into color levels. Unfortunately, there simply isn’t an easy way to create a color scale in rgl. Finally, I wanted my data points to be spheres, of size 2, that were 50% transparent, which is specified with type, size, and alpha respectively. Plot3d will open a window with your graph. You can use your mouse to rotate it.

Now, let’s make  a movie of the graph. The movie3d function requires that you install ImageMagick, a software that allows you to create a GIF from stitching together multiple pictures. ImageMagick also has cool functionalities like editing, resizing, and layering pictures. It can be installed into your computer through R using the first two lines of code below. Make sure not to re-run these lines once ImageMagick is installed.  Note that ImageMagick doesn’t have to be installed in your directory, just on your computer.


#Create a spinning movie of your plot
movie3d(spin3d(axis = c(0, 0, 1)), duration = 20,
 dir = getwd())

Finally, the last line of code is used to generate the movie. I have specified that I want the plot to spin about the z axis, specified a duration (you can play around with the number to see what suits your data), and that I want the movie to be saved in my current working directory. The resulting GIF is below. If the GIF has stopped running, reload the page and scroll down to this section again.


I have found that creating the movie can be a bit finicky and the last step is where errors usually occur. When you execute your code, make sure that you keep the plot window open while ImageMagick stitches together the snapshots otherwise you will get an error. If you have errors, please feel free to share because I most likely had them at one point and was able to ultimately fix them.

Overall, I found this package to be useful for a quick overview of the 3D space, but I wasn’t pleased with the way the axes values and titles overlap sometimes when the graph spins. The way to work around this is to set the labels and title to NULL and insert your own non-moving labels and title when you add the GIF to a PowerPoint presentation.

            Use plotly to create an interactive scatter

I much prefer the plotly package to rgl for the aesthetic value, ease of creating a color scale, and the ability to mouse-over points to obtain coordinate values in a scatter plot. Plotly is an open source JavaScript graphing library but has an R API. The first step is to create a Plotly account at: Once you have confirmed your email address, head to to get an API key. Click the “regenerate key” button and you’ll get a 20 character key that will be used to create a shareable link to your chart. Perfect, now we’re ready to get started!



#Set environment variables

Sys.setenv("plotly_api_key"="insert key here")

#Read in pareto set data


Set the working directory, install the relevant libraries, set the environment variables and load in the data set. Be sure to insert your API key. You will need to regenerate a new key every time you make a new graph. Also, note that your data must be in the form of a data frame for plotly to work.

#Plot your data

plot= plot_ly(pareto, x = ~WcAvgHydro, y = ~WcAvgDeficit, z = ~WcAvgFlood, color = ~WcAvgFlood, colors = c('#00d6f7', '#ebfc2f'))

#Add axes titles
layout(title="Pareto Set", scene = list(xaxis = list(title = 'Hydropower'),yaxis = list(title = 'Deficit'), zaxis = list(title = 'Flood')))
#call the name of your plot to appear in the viewer

To correctly use the plotly command, the first input needed is the data frame, followed by the column names of the x,y, and z columns in the data frame. Precede each column name with a “~”.

I decided that I wanted the colors to scale with the value of the z variable. The colors were defined using color codes available at Use the layout function to add a main title and axis labels. Finally call the name of your plot and you will see it appear in the viewer at the lower right of your screen.If your viewer shows up blank with only the color scale, click on the viewer or click “zoom”. Depending on how large the data set is, it may take some time for the graph to load.

#Create a link to your chart and call it to launch the window
chart_link = api_create(plot, filename = "public-graph")

Finally create the chart link using the first line of code above and the next line will launch the graph in Plotly. Copy and save the URL and anyone with it can access your graph, even if they don’t have a Plotly account. Play around with the cool capabilities of my Plotly graph, like mousing over points, rotating, and zooming!



Woodruff, M.J., Reed, P.M. & Simpson, T.W. Struct Multidisc Optim (2013) 48: 201.

James J. Thomas and Kristin A. Cook (Ed.) (2005). Illuminating the Path: The R&D Agenda for Visual Analytics National Visualization and Analytics Center.

Map making in Matlab

Map making in Matlab


This weeks post will cover the basics of generating maps in Matlab.  Julie’s recent post showed how to do some of this in Python, but, Matlab is also widely used by the community.  You can get a lot done with Matlab, but in this post we’ll just cover a few of the basics.

We’ll start off by plotting a map of the continental United States, with the states.  We used three  this with three commands: usamap, shaperead, and geoshow.  usamap creates an empty map axes having the Lambert Projection covering the area of the US, or any state or collection of states.  shaperead reads shapefiles (duh) and returns a Matlab geographic data structure, composed of both geographic data and attributes.  This Matlab data structure then interfaces really well with various Matlab functions (duh).  Finally, geoshow plots geographic data, in our case on the map axes we defined.  Here’s some code putting it all together.

hold on
figure1 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
hold off

Note that ‘usastatehi’ is a shapefile containing the US states (duh) that’s distributed with Matlab. The above code generates this figure:


Now, suppose we wanted to plot some data, say a precipitation forecast, on our CONUS map.  Let’s assume our forecast is being made at many points (lat,long).  To interpolate between the points for plotting we’ll use Matlab’s griddata function.  Once we’ve done this, we use the Matlab’s contourm command.  This works exactly like the normal contour function, but the ‘m’ indicates it plots map data.

xi = min(x):0.5:max(x);
yi = min(y):0.5:max(y);
[XI, YI] = meshgrid(xi,yi);
ZI = griddata(x,y,V,XI,YI);

hold on
figure2 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])

hold off

Here x, y, and V are vectors of long, lat, and foretasted precipitation respectively.  This code generates the following figure:


Wow!  Louisiana is really getting hammered!  Let’s take a closer look.  We can do this by changing the entry to usamap to indicate we want to consider only Louisiana.  Note, usamap accepts US postal code abbreviations.

ax = usamap('LA');

Making that change results in this figure:


Neat!  We can also look at two states and add annotations.  Suppose, for no reason in particular, you’re interested in the location of Tufts University relative to Cornell.  We can make a map to look at this with the textm and scatterm functions.  As before, the ‘m’ indicates the functions  plot on a map axes.

hold on
figure4 = figure;
ax = usamap({'MA','NY'});

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])

hold off

This code generates the following figure.


Cool! Now back to forecasts.  NOAA distributes short term Quantitative Precipitation Forecasts (QPFs) for different durations every six hours.  You can download these forecasts in the form of shapefiles from a NOAA server.  Here’s an example of a 24-hour rainfall forecast made at 8:22 AM UTC on April 29.


Wow, that’s a lot of rain!  Can we plot our own version of this map using Matlab!  You bet!  Again we’ll use usamap, shaperead, and geoshow.  The for loop, (0,1) scaling, and log transform are simply to make the color map more visually appealing for the post.  There’s probably a cleaner way to do this, but this got the job done!

figure5 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
states = shaperead('usastatehi',...
 'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
p = colormap(jet);

N = max(size(S));
d = zeros(N,1);
for i = 1:N
 d(i) = log(S(i).QPF);

col = p(y,:);
for i = 1:N
 geoshow(S(i),'FaceColor',col(i,:),'FaceAlpha',0.5)%,'SymbolSpec', faceColors)

This code generates the following figure:


If you are not plotting in the US, Matlab also has a worldmap command.  This works exactly the same as usamap, but now for the world (duh).  Matlab is distibuted with a shapefile ‘landareas.shp’ which contains all of the land areas in the world (duh).  Generating a global map is then trivial:

figure6 = figure;

land = shaperead('landareas.shp', 'UseGeoCoords', true);
geoshow(land, 'FaceColor', [0.15 0.5 0.15])

Which generates this figure.



Matlab also comes with a number of other included that might be of interest.  For instance, shapefiles detailing the locations of major world cities, lakes, and rivers.  We can plot those with the following code:

figure7 = figure;

land = shaperead('landareas.shp', 'UseGeoCoords', true);
geoshow(land, 'FaceColor', [0.15 0.5 0.15])
lakes = shaperead('worldlakes', 'UseGeoCoords', true);
geoshow(lakes, 'FaceColor', 'blue')
rivers = shaperead('worldrivers', 'UseGeoCoords', true);
geoshow(rivers, 'Color', 'blue')
cities = shaperead('worldcities', 'UseGeoCoords', true);
geoshow(cities, 'Marker', '.', 'Color', 'red')

Which generates the figure:


But suppose we’re interested in one country or a group of countries.  worldmap works in the same usamap does.  Also, you can plot continents, for instance Europe.



Those are the basics, but there are many other capabilities, including 3-D projections. I can cover this in a later post if there is interest.


That’s it for now!

A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.


Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

u = xE

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

PCA visualization

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

\tilde{x} = \tilde{u}\tilde{E}^T


\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set

\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)

\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.




Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in 
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,... 
 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,... 
 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Canadaigua min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')



Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

Alluvial Plots

Alluvial Plots

We all love parallel coordinates plots and use them all the time to display our high dimensional data and tell our audience a good story. But sometimes we may have large amounts of data points whose tradeoffs’ existence or lack thereof cannot be clearly verified, or the data to be plotted is categorical and therefore awkwardly displayed in a parallel coordinates plot.

One possible solution to both issues is the use of alluvial plots. Alluvial plots work similarly to parallel coordinates plots, but instead of having ranges of values in the axes, it contains bins whose sizes in an axis depends on how many data points belong to that bin. Data points that fall within the same categories in all axes are grouped into alluvia (stripes), whose thicknesses reflect the number of data points in each alluvium.

Next are two examples of alluvial plots, the fist displaying categorical data and the second displaying continuous data that would normally be plotted in a parallel coordinates plot. After the examples, there is code available to generate alluvial plots in R (I know, I don’t like using R, but creating alluvial plots in R is easier than you think).

Categorical data

The first example (Figure 1) comes from the cran page for the alluvial plots package page. It uses alluvial plots to display data about all Titanic’s passengers/crew and group them into categories according to class, sex, age, and survival status.


Figure 1 – Titanic passenger/crew data. Yellow alluvia correspond to survivors and gray correspond to deceased. The size of each bin represents how many data points (people) belong to that category in a given axis, while the thickness of each alluvium represent how many people fall within the same categories in all axes. Source:

Figure 1 shows that most of the passengers were male and adults, that the crew represented a substantial amount of the total amount of people in the Titanic, and that, unfortunately, there were more deceased than survivors. We can also see that a substantial amount of the people in the boat were male adult crew members who did not survive, which can be inferred by the thickness of the grey alluvium that goes through all these categories — it can also be seen by the lack of an alluvia hitting the Crew and Child bins, that (obviously) there were no children crew members. It can be also seen that 1st class female passengers was the group with the greatest survival rate (100%, according to the plot), while 3rd class males had the lowest (ballpark 15%, comparing the yellow and gray alluvia for 3rd class males).

Continuous data

The following example shows the results of policy modeling for a fictitious water utility using three different policy formulations. Each data point represents the modeled performance of a given candidate policy in six objectives, one in each axis. Given the uncertainties associated with the models used to generate this data, the client utility company is more concerned about whether or not a candidate policy would meet certain performance criteria according to the model (Reliability > 99%, Restriction Frequency < 20%, and Financial Risk < 10%) than about the actual objective values. The utility also wants to have a general idea of the tradeoffs between objectives.

Figure 2 was created to present the modeling results to the client water utility. The colored alluvia represent candidate policies that meet the utility’s criteria, and grey lines represent otherwise. The continuous raw data used to generate this plot was categorized following ranges whose values are meaningful to the client utility, with the best performing bin always put in the bottom of the plot. It is important to notice that the height of the bins represent the number of policies that belong to that bin, meaning that the position of the gap between two stacked bins does not represent a value in an axis, but the fraction of the policies that belong to each bin. It can be noticed from Figure 2 that it is relatively difficult for any of the formulations to meet the Reliability > 99% criteria established by the utility. It is also striking that a remarkably small number of policies from the first two formulations and none of the policies from the third formulation meet the criteria established by the utilities. It can also be easily seen by following the right alluvia that the vast majority of the solutions with smaller net present costs of infrastructure investment obtained with all three formulations perform poorly in the reliability and restriction frequency objectives, which denotes a strong tradeoff. The fact that such tradeoffs could be seen when the former axis is on the opposite side of the plot to the latter two is a remarkable feature of alluvial plots.


Figure 2 – Alluvial plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The parallel coordinates plots in Figure 3 displays the same information as the alluvial plot in Figure 2. It can be readily seen that the analysis performed above, especially when it comes to the tradeoffs, would be more easily done with Figure 2 than with Figure 3. However, if the actual objective values were important for the analysis, Figure 3 would be needed either by itself or in addition to Figure 2, the latter being used likely as a pre-screening or for a higher level analysis of the results.


Figure 3 – Parallel coordinates plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The R code used to create Figure 1 can be found here. The code below was used to create Figure 2 — The packages “alluvia”l and “dplyr” need to be installed before attempting to use the provided code, for example using the R command install.packages(package_name). Also, the user needs to convert its continuous data into categorical data, so that each row corresponds to a possible combination of bins in all axis (one column per axis) plus a column (freqs) representing the frequencies with which each combination of bins is seen in the data.

# Example datafile: snippet of file "infra_tradeoffs_strong_freqs.csv"
Reliability, Net Present Cost of Inf. Investment, Peak Financial Costs, Financial Risk, Restriction Frequency, Jordan Lake Allocation, freqs
# load packages and prepare data

itss <- read.csv('infra_tradeoffs_strong_freqs.csv')
itsw <- read.csv('infra_tradeoffs_weak_freqs.csv')
itsn <- read.csv('infra_tradeoffs_no_freqs.csv')

# preprocess the data (convert do dataframe)
itss %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_strong
itsw %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_weak
itsn %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_no

# setup output file
p <- par(mfrow=c(3,1))
par(bg = 'white')

# create the plots
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_strong$n < 1
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_weak$n < 1
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_no$n < 1
Plotting geographic data from geojson files using Python

Plotting geographic data from geojson files using Python

Hi folks,

I’m writing today about plotting geojson files with Matplotlib’s Basemap.  In a previous post I laid out how to plot shapefiles using Basemap.

geojson is an open file format for representing geographical data based on java script notation.  They are composed of points, lines, and polygons or ‘multiple’ (e.g. multipolygons composed of several polygons), with accompanying properties.  The basic structure is one of names and vales, where names are always strings and values may be strings, objects, arrays, or logical literal.

The geojson structure we will be considering here is a collection of features, where each feature contains a geometry and properties.  Each geojson feature must contain properties and geometry.  Properties could be things like country name, country code, state, etc.  The geometry must contain a type (point, line, polygons, etc.) and coordinates (likely an array of lat-long). Below is an excerpt of a geojson file specifying Agro-Ecological Zones (AEZs) within the various GCAM regions.

"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "id": 1, "properties": { "ID": 1.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 99.5, 78.5 ], [ 98.33203125, 78.735787391662598 ], [ 98.85723876953125, 79.66796875 ], [ 99.901641845703125, 79.308036804199219 ], [ 99.5, 78.5 ] ] ] ] } },
{ "type": "Feature", "id": 2, "properties": { "ID": 2.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 104.5, 78.0 ], [ 104.0, 78.0 ], [ 99.5, 78.0 ], [ 99.5, 78.5 ], [ 100.2957763671875, 78.704218864440918 ], [ 102.13778686523437, 79.477890968322754 ], [ 104.83050537109375, 78.786871910095215 ], [ 104.5, 78.0 ] ] ] ] } },
{ "type": "Feature", "id": 3, "properties": { "ID": 3.000000, "GRIDCODE": 2713.000000, "CTRYCODE": 27.000000, "CTRYNAME": "Canada", "AEZ": 13.000000, "GCAM_ID": "Canada-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -99.5, 77.5 ], [ -100.50860595703125, 77.896504402160645 ], [ -101.76053619384766, 77.711499214172363 ], [ -104.68202209472656, 78.563323974609375 ], [ -105.71781158447266, 79.692866325378418 ], [ -99.067413330078125, 78.600395202636719 ], [ -99.5, 77.5 ] ] ] ] } }

Now that we have some understanding of the geojson structure, plotting the information therein should be as straightforward as traversing that structure and tying geometries to data.  We do the former using the geojson python package and the latter using pretty basic python manipulation.  To do the actual plotting, we’ll use PolygonPatches from the descartes library and recycle most of the code from my previous post.

We start by importing the necessary libraries and then open the geojson file.

import geojson
from descartes import PolygonPatch
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np

with open("aez-w-greenland.geojson") as json_file:
    json_data = geojson.load(json_file)

We then define a MatplotLib Figure, and generate a Basemap object as a ‘canvas’ to draw the geojson geometries on.

ax = plt.figure(figsize=(10,10)).add_subplot(111)#fig.gca()

m = Basemap(projection='robin', lon_0=0,resolution='c')
m.drawmapboundary(fill_color='white', zorder=-1)
m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawcoastlines(color='0.6', linewidth=1)

Next, we iterate over the nested features in this file and pull out the coordinate list defining each feature’s geometry (line 2).  In lines 4-5 we also pull out the feature’s name and AEZ, which I can tie to GCAM data.

for i in range(2799):
    coordlist = json_data.features[i]['geometry']['coordinates'][0]
    if i < 2796:
        name = json_data.features[i]['properties']['CTRYNAME']
        aez =  json_data.features[i]['properties']['AEZ']

    for j in range(len(coordlist)):
        for k in range(len(coordlist[j])):

    poly = {"type":"Polygon","coordinates":coordlist}#coordlist
    ax.add_patch(PolygonPatch(poly, fc=[0,0.5,0], ec=[0,0.3,0], zorder=0.2 ))


Line 9 is used to convert the coordinate list from lat/long units to meters.  Depending on what projection you’re working in and what units your inputs are in, you may or may not need to do this step.

The final lines are used to add the polygon to the figure, and to make the face color of each polygon green and the border dark green. Which generates the figure:


To get a bit more fancy, we could tie the data to a colormap and then link that to the facecolor of the polygons.  For instance, the following figure shows the improvement in maize yields over the next century in the shared socio-economic pathway 1 (SSP 1), relative to a reference scenario (SSP 2).


Saving d3.parcoords to SVG

d3.parcoords is a great library for making interactive parallel coordinate plots. A major issue, however, is that it is pain to get the resulting plots into a format suitable for publication. In this blog post, I will show how we can turn a d3.parcoords plot into an SVG document, which we can save locally. SVG is an XML based format for vector graphics, so it is ideal for publications.

This blog post is an example of how to get the SVG data. It is however far from complete, and there might be better ways of achieving some of the steps. Any comments or suggestions on how to improve the code are welcome. I wrote this while learning javascript, without any prior experience with respect to web technology.

First, how is a d3.parcoords plot structured? It is composed of five elements: 4 HTML5 canvas layers, and a single SVG layer. the SVG layer contains the axis for each dimension. The 4 canvas layers are marks, highlight, brushed, and foreground. I am not sure what the function is of the first two, but brushed contains the lines that are selected through brushing, while foreground contains all the remaining lines.

In order to export a d3.parcoords figure as pure svg, we need to somehow replace the HTML canvas with something that has the same interface, but generates SVG instead. Luckily there are several javascript libraries that do this. See for an overview. In this example, I am using , which is a recent library that still appears to be maintained.

The basic idea is the following:

  • replace the normal HTML5 canvas.context for each layer with the one from canvas2svg, and render the plot
  • extract the axis svg
  • extract the SVG from the 5 canvas layers, and combine the 5 layers into a single svg document
  • save it
  • reset the canvas

To make this work, we are depending on several javascript libraries in addition to the default dependencies of d3.parcoords. These are

Replace canvas.context

In order to replace the canvas.context for each layer, we iterate over the names of the layers. d3.parcoords saves the contexts in an internal object, indexed by name. We keep track of the old context for each layer, because this makes restoring a lot easier at the end. We instantiate the C2S context (the class provided by canvas2svg), by specifying the width and height of the plot. In this case, I have hardcoded them for simplicity, but it would be better to extract them from the HTML or CSS.

const layerNames = ["marks", "highlight", "brushed", "foreground"];

const oldLayers = {};
let oldLayerContext;
let newLayerContext;
let layerName;
for (let i=0; i<canvasLayers.length; i++){
    layerName = layerNames[i];

    oldLayerContext = pc0.ctx[layerName]; //pc0 is the d3.parcoords plot
    newLayerContext = new C2S(720, 200); 

    oldLayers[layerName] = oldLayerContext;
    pc0.ctx[layerName] = newLayerContext;

Extract the Axis svg

Getting the axis svg is straightforward. We select the svg element in the dom, serialise it to a string and next use jQuery to create a nice XML document out of the string.

const svgAxis = new XMLSerializer().serializeToString('svg').node());
const axisXmlDocument = $.parseXML(svgAxis);

The only problem with this approach is that the SVG does not contain the style information, which is provided in the CSS. So, we need to inline this information. To do so, I created two helper functions. The first helper function allows us to set an attribute on elements that have the same tag. The second does the same, but based on class name.

// helper function for saving svg
function setAttributeByTag(xmlDocument, tagName, attribute, value){
    const paths = xmlDocument.getElementsByTagName(tagName);
    for (let i = 0; i < paths.length; i++) {
        paths[i].setAttribute(attribute, value);

// helper function for saving svg
function setAttributeByClass(xmlDocument, className, attribute, value){
    const paths = xmlDocument.getElementsByClassName(className);
    for (let i = 0; i < paths.length; i++) {
        paths[i].setAttribute(attribute, value);

We can now  use  these helper functions to inline some CSS information. Note that this is an incomplete subset of all the CSS information used by d3.parcoords. A future extension would be to extract all the d3.parcoord style information from the CSS and inline it.

setAttributeByTag(axisXmlDocument, "axis", "fill", "none");
setAttributeByTag(axisXmlDocument, "path", "stroke", "#222");
setAttributeByTag(axisXmlDocument, "line", "stroke", "#222");
setAttributeByClass(axisXmlDocument, "background", "fill", "none");

Extract the SVG from each layer

We now  have an XML document to which we can add the SVG data of each of our layers. In order to keep track of the structure of the SVG, I have chosen to first create a new group node, and subsequently add each layer to this new group as a child. To make sure that this group is positioned correctly, I clone the main group node of the axis svg, remove it’s children, and insert this new node at the top of the XML document.

const oldNode = axisXmlDocument.getElementsByTagName('g')[0];
const newNode = oldNode.cloneNode(true);
while (newNode.hasChildNodes()){
axisXmlDocument.documentElement.insertBefore(newNode, oldNode);

There is some trickery involved in what I am doing here. SVG groups are rendered on top of each other, in the order in which they appear in the XML document. It appears that one can provide a z-order as well according to the SVG 2.0 specification, but I have not pursued that direction here. By adding the newly created node to the top, I ensure that the axis information is at the end of the XML document, and thus always on top of all the other layers. For the same reason, I have also deliberately sorted the canvas layer names.

Now  that we have a new node, we can iterate over our canvas layers and extract the svg data from them. Next, we parse the xml string to turn it into an XML document. We have to overwrite a transform attribute that is used when working on a retina screen, this matters for a html canvas but not for svg. For convenience, I also add the layer name as a class attribute, so in our SVG, we can easily spot each of the canvas layers. The XML document for a given layer contains two main nodes. The first node contains the defs tag, which we don’t need. The second node contains the actual SVG data, which is what we do need.

let svgLines;
let xmlDocument;
for (let i=0; i<layerNames.length; i++){
    // get svg for layer
    layerName = layerNames[i];
    svgLines = pc0.ctx[layerName].getSerializedSvg(true);
    xmlDocument = $.parseXML(svgLines);

    // scale is set to 2,2 on retina screens, this is relevant for canvas
    // not for svg, so we explicitly overwrite it
    xmlDocument.getElementsByTagName("g")[0].setAttribute("transform", "scale(1,1)");

    // for convenience add the name of the layer to the group as class
    xmlDocument.getElementsByTagName("g")[0].setAttribute("class", layerName);

    // add the group to the node
    // each layers has 2 nodes, a defs node and the actual svg
    // we can safely ignore the defs node

Save it

We have all our SVG data in the xml document. All that is left is to turn this back into a string, format the string properly, turn it into a blob, and save it. We can achieve this in three lines.

// turn merged xml document into string
// we also beautify the string, but this is optional
const merged = vkbeautify.xml(new XMLSerializer().serializeToString(axisXmlDocument.documentElement));

// turn the string into a blob and use FileSaver.js to enable saving it
const blob = new Blob([merged], {type:"application/svg+xml"});
saveAs(blob, "parcoords.svg");

Reset context

We now  have saver our SVG file locally, but we have to still put back our old canvas context’s. We have stored these, so we can simply loop over the layer names and put back the old context. In principle, this last step might not be necessary, but I work on machines with a retina screen and ran into scaling issues when trying to use C2s context’s outside of the save function.

// we are done extracting the SVG information so
// put the original canvas contexts back
for (let i=0; i<layerNames.length; i++){
    pc0.ctx[layerNames[i]] = oldLayers[layerNames[i]]

Putting it all together

I have a repo on github with the full code including dependencies etc: .

The code shown in this blog is not complete. For example, brushed plots will not display nice and require some post processing of the SVG.

For those that are more familiar with D3.parcoords, note how the coloring of the lines is dependent on which axis you select. I have connected the color to a click event on the axis to make this possible.

Plotting Probability Ellipses for Bivariate Normal Distributions

Plotting probability ellipses can be a useful way to visualize bivariate normal distributions. A probability ellipse represents a contour of constant probability within which a certain percentage of the distribution lies. The width and orientation of probability ellipses can yield information about the correlation between the two data points of interest.

To plot probability ellipses of a bivariate normal distribution, you need to have a vector containing the means of both data sets of interest as well as the covariance matrix for the two data sets. Each probability ellipse is centered around the means of the two data sets and oriented in the direction of the first eigenvector of the covariance matrix. The length of the primary axis of each ellipse is proportional to the value of the percentile of the Chi Squared distribution for the given percentile the ellipse represents.

Below, I’ve coded an example (using Matlab) that is presented in the very informative textbook Statistical Methods in the Atmospheric Sciences by Dan Wilks. The example uses one month of daily minimum temperature data from the towns of Ithaca and Canandaigua, New York. Notice that the 50% ellipse in the center of the plot encloses roughly half the data points, indicating that the contours were plotted correctly.


Probability Ellipses for Ithaca and Canandaigua Minimum Temperatures


% Plotting  probability ellipses of the bivariate normal distribution
% By: D. Gold
% Created: 10/28/16
% This script will plot the multivariate cumulative probability contours 
% of two data sets that are fit to a multivariate normal distribution

clear all; close all; clc;

%%% Create Data Sets %%%
Ith_MinT = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0, 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23];

Can_MinT = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23, 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23];

%%% Calculate moments of the data sets %%%

% Observed
Ith_mean = mean(Ith_MinT); % T mean
Can_mean= mean(Can_MinT); % Td mean
CV = cov(Ith_MinT,Can_MinT); % covariance of T and Td
[Evec, Eval]=eig(CV); % Eigen values and vectors of covariance matrix

%%% Plot observed multivariate contours %%

% Observed data
xCenter = Ith_mean; % ellipses centered at sample averages
yCenter = Can_mean;
theta = 0 : 0.01 : 2*pi; % angles used for plotting ellipses

% compute angle for rotation of ellipse
% rotation angle will be angle between x axis and first eigenvector
x_vec= [1 ; 0]; % vector along x-axis
cosrotation =dot(x_vec,Evec(:,1))/(norm(x_vec)*norm(Evec(:,1))); 
rotation =pi/2-acos(cosrotation); % rotation angle
R  = [sin(rotation) cos(rotation); ...
      -cos(rotation)  sin(rotation)]; % create a rotation matrix

% create chi squared vector
chisq = [1.368 4.605 3.2188  5.991]; % percentiles of chi^2 dist df=2

% size ellipses for each quantile
for i = 1:length(chisq)
    % calculate the radius of the ellipse
    xRadius(i)=(chisq(i)*Eval(1,1))^.5; % primary
    yRadius(i)=(chisq(i)*Eval(2,2))^.5; % secondary
    % lines for plotting ellipse
    x{i} = xRadius(i)* cos(theta);
    y{i} = yRadius(i) * sin(theta);
    % rotate ellipse
    rotated_Coords{i} = R*[x{i} ; y{i}];
    % center ellipse

% Set up plot
xlabel('Ithaca Minimum Temperature (F)')
ylabel('Canandagua Minimum Temperature (F)(F)')
hold on

% Plot data points

% Plot contours
for j = 1:length(chisq)
legend('Data points','50%', '80%', '90%', '99%')