Networks on maps: exploring spatial connections using NetworkX and Basemap

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here 1.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

plt.figure(figsize = (12,8))
m = Basemap(projection='merc',llcrnrlon=-160,llcrnrlat=15,urcrnrlon=-60,
urcrnrlat=50, lat_ts=0, resolution='l',suppress_ticks=True)
mx, my = m(pos_data['lon'].values, pos_data['lat'].values)
pos = {}
for count, elem in enumerate(pos_data['nodes']):
     pos[elem] = (mx[count], my[count])
nx.draw_networkx_edges(G, pos = pos, edge_color='blue', alpha=0.1, arrows = False)
m.drawcountries(linewidth = 2)
m.drawstates(linewidth = 0.2)
plt.savefig("map.png", dpi = 300)

The resulting figure is the following, corresponding to Fig. 5B from the original paper.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()

plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

1 – If you are interested in how this is done, I used the National Counties Gazetteer file from the US Census Bureau and looked up each code to get its lat and lon.

Plotting geographic data from geojson files using Python

Plotting geographic data from geojson files using Python

Hi folks,

I’m writing today about plotting geojson files with Matplotlib’s Basemap.  In a previous post I laid out how to plot shapefiles using Basemap.

geojson is an open file format for representing geographical data based on java script notation.  They are composed of points, lines, and polygons or ‘multiple’ (e.g. multipolygons composed of several polygons), with accompanying properties.  The basic structure is one of names and vales, where names are always strings and values may be strings, objects, arrays, or logical literal.

The geojson structure we will be considering here is a collection of features, where each feature contains a geometry and properties.  Each geojson feature must contain properties and geometry.  Properties could be things like country name, country code, state, etc.  The geometry must contain a type (point, line, polygons, etc.) and coordinates (likely an array of lat-long). Below is an excerpt of a geojson file specifying Agro-Ecological Zones (AEZs) within the various GCAM regions.

"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "id": 1, "properties": { "ID": 1.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 99.5, 78.5 ], [ 98.33203125, 78.735787391662598 ], [ 98.85723876953125, 79.66796875 ], [ 99.901641845703125, 79.308036804199219 ], [ 99.5, 78.5 ] ] ] ] } },
{ "type": "Feature", "id": 2, "properties": { "ID": 2.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 104.5, 78.0 ], [ 104.0, 78.0 ], [ 99.5, 78.0 ], [ 99.5, 78.5 ], [ 100.2957763671875, 78.704218864440918 ], [ 102.13778686523437, 79.477890968322754 ], [ 104.83050537109375, 78.786871910095215 ], [ 104.5, 78.0 ] ] ] ] } },
{ "type": "Feature", "id": 3, "properties": { "ID": 3.000000, "GRIDCODE": 2713.000000, "CTRYCODE": 27.000000, "CTRYNAME": "Canada", "AEZ": 13.000000, "GCAM_ID": "Canada-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -99.5, 77.5 ], [ -100.50860595703125, 77.896504402160645 ], [ -101.76053619384766, 77.711499214172363 ], [ -104.68202209472656, 78.563323974609375 ], [ -105.71781158447266, 79.692866325378418 ], [ -99.067413330078125, 78.600395202636719 ], [ -99.5, 77.5 ] ] ] ] } }

Now that we have some understanding of the geojson structure, plotting the information therein should be as straightforward as traversing that structure and tying geometries to data.  We do the former using the geojson python package and the latter using pretty basic python manipulation.  To do the actual plotting, we’ll use PolygonPatches from the descartes library and recycle most of the code from my previous post.

We start by importing the necessary libraries and then open the geojson file.

import geojson
from descartes import PolygonPatch
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np

with open("aez-w-greenland.geojson") as json_file:
    json_data = geojson.load(json_file)

We then define a MatplotLib Figure, and generate a Basemap object as a ‘canvas’ to draw the geojson geometries on.

ax = plt.figure(figsize=(10,10)).add_subplot(111)#fig.gca()

m = Basemap(projection='robin', lon_0=0,resolution='c')
m.drawmapboundary(fill_color='white', zorder=-1)
m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawcoastlines(color='0.6', linewidth=1)

Next, we iterate over the nested features in this file and pull out the coordinate list defining each feature’s geometry (line 2).  In lines 4-5 we also pull out the feature’s name and AEZ, which I can tie to GCAM data.

for i in range(2799):
    coordlist = json_data.features[i]['geometry']['coordinates'][0]
    if i < 2796:
        name = json_data.features[i]['properties']['CTRYNAME']
        aez =  json_data.features[i]['properties']['AEZ']

    for j in range(len(coordlist)):
        for k in range(len(coordlist[j])):

    poly = {"type":"Polygon","coordinates":coordlist}#coordlist
    ax.add_patch(PolygonPatch(poly, fc=[0,0.5,0], ec=[0,0.3,0], zorder=0.2 ))


Line 9 is used to convert the coordinate list from lat/long units to meters.  Depending on what projection you’re working in and what units your inputs are in, you may or may not need to do this step.

The final lines are used to add the polygon to the figure, and to make the face color of each polygon green and the border dark green. Which generates the figure:


To get a bit more fancy, we could tie the data to a colormap and then link that to the facecolor of the polygons.  For instance, the following figure shows the improvement in maize yields over the next century in the shared socio-economic pathway 1 (SSP 1), relative to a reference scenario (SSP 2).


Making Movies of Time-Evolving Global Maps with Python

Making Movies of Time-Evolving Global Maps with Python

Hi All,

These past few months I’ve been working with the Global Change Assessment Model (GCAM) which is an integrated assessment model (IAM) that combines models of the global climate and economic systems. I wrote an earlier post on compiling GCAM on a Unix cluster.  This post discusses some visualization tools I’ve developed for GCAM output.

GCAM models energy and agriculture systems at a regional level, where the world is composed of 32 regions.  We’re interested in tracking statistics (like the policy cost of stabilization) over time and across regions.  This required three things:

  1. The ability to draw a global map.
  2. The ability to shade individual political units on that map.
  3. The ability to animate this map.

Dr. Jon Herman has already posted a good example of how to do (1) in python using matplotlib’s Basemap.  We’ll appropriate some of his example for this example.  The Basemap has the option to draw coastlines and boundaries, but these boundaries are not tied to shapes, meaning that you can’t assign different colors to individual countries (task (2) above).  To do that, we need a shapefile containing information about political boundaries.  You can find these for free from a number of sources online, but I like Natural Earth.  They provide data on many different scales. For this application I downloaded their coarsest data set.  To give each country a shade which is tied to data, we use matplotlib’s color map.  The basic plan is to generate a colored map for each time-step in our data, and then to animate the maps using the convert linux command.

Now that we’ve described roughly how we’ll proceed, a word about the data we’re dealing with and how I’ve handled it.  GCAM has 32 geo-political regions, some of which are individual countries (like the USA or China), while others are groups of countries (like Australia & New Zealand). I stored this information using a list of lists (i.e. a 32-element list, where each element is a list of countries in that region). I’ve creatively named this variable list_list in this example (see code below). For each of the regions GCAM produces a time series of policy costs as a fraction of GDP every 5 years from 2020-2100. I’ve creatively named this variable data. We want to tie the color of a country in each time to its policy cost relative to costs across countries and times.  To do this, I wrote the following (clumsy!) Python function, which I explain below.

def world_plot(data,idx,MN,MX):
 from mpl_toolkits.basemap import Basemap
 import matplotlib.pyplot as plt
 from matplotlib.patches import Polygon
 from matplotlib.collections import PatchCollection
 import as cm
 import matplotlib as mpl
 import numpy as np

 norm = mpl.colors.Normalize(vmin=MN, vmax=MX)
 cmap = cm.coolwarm
 colors=cm.ScalarMappable(norm=norm, cmap=cmap)
 a = np.zeros([32,4])
 a = colors.to_rgba(data)

 fig = plt.figure(figsize=(10,10))
 ax = fig.add_subplot(111)

 m = Basemap(projection='robin', lon_0=0,resolution='c')
 m.drawmapboundary(fill_color='white', zorder=-1)
 m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
 m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)

 year = [1990,2005,2010,2015,2020,2025,2030,2035,2040,2045,2050,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100]
 GCAM_32 = ['PRI','USA','VIR']
 GCAM_1 = ['BDI','COM','DJI','ERI','ETH','KEN','MDG','MUS','REU','RWA','SDS','SDN','SOM','UGA','SOL']
 GCAM_2 = ['DZA','EGY','ESH','LBY','MAR','TUN','SAH']
 GCAM_3 = ['AGO','BWA','LSO','MOZ','MWI','NAM','SWZ','TZA','ZMB','ZWE']
 GCAM_4 = ['BEN','BFA','CAF','CIV','CMR','COD','COG','CPV','GAB','GHA','GIN','GMB','GNB','GNQ','LBR','MLI','MRT','NER','NGA','SEN','SLE','STP','TCD','TGO']
 GCAM_6 = ['AUS','NZL']
 GCAM_7 = ['BRA']
 GCAM_8 = ['CAN']
 GCAM_9 = ['ABW','AIA','ANT','ATG','BHS','BLZ','BMU','BRB','CRI','CUB','CYM','DMA','DOM','GLP','GRD','GTM','HND','HTI','JAM','KNA','LCA','MSR','MTQ','NIC','PAN','SLV','TTO','VCT']
 GCAM_10 = ['ARM','AZE','GEO','KAZ','KGZ','MNG','TJK','TKM','UZB']
 GCAM_11 = ['CHN','HKG','MAC']
 GCAM_13 = ['BGR','CYP','CZE','EST','HUN','LTU','LVA','MLT','POL','ROM','SVK','SVN']
 GCAM_14 = ['AND','AUT','BEL','CHI','DEU','DNK','ESP','FIN','FLK','FRA','FRO','GBR','GIB','GRC','GRL','IMN','IRL','ITA','LUX','MCO','NLD','PRT','SHN','SMR','SPM','SWE','TCA','VAT','VGB','WLF']
 GCAM_15 = ['BLR','MDA','UKR']
 GCAM_16 = ['ALB','BIH','HRV','MKD','MNE','SCG','SRB','TUR','YUG']
 GCAM_17 = ['CHE','ISL','LIE','NOR','SJM']
 GCAM_18 = ['IND']
 GCAM_19 = ['IDN']
 GCAM_20 = ['JPN']
 GCAM_21 = ['MEX']
 GCAM_22 = ['ARE','BHR','IRN','IRQ','ISR','JOR','KWT','LBN','OMN','PSE','QAT','SAU','SYR','YEM']
 GCAM_23 = ['PAK']
 GCAM_24 = ['RUS']
 GCAM_25 = ['ZAF']
 GCAM_26 = ['GUF','GUY','SUR','VEN']
 GCAM_27 = ['BOL','CHL','ECU','PER','PRY','URY']
 GCAM_28 = ['AFG','ASM','BGD','BTN','LAO','LKA','MDV','NPL']
 GCAM_29 = ['KOR']
 GCAM_30 = ['BRN','CCK','COK','CXR','FJI','FSM','GUM','KHM','KIR','MHL','MMR','MNP','MYS','MYT','NCL','NFK','NIU','NRU','PCI','PCN','PHL','PLW','PNG','PRK','PYF','SGP','SLB','SYC','THA','TKL','TLS','TON','TUV','VNM','VUT','WSM']
 GCAM_31 = ['TWN']
 GCAM_5 = ['ARG']
 GCAM_12 = ['COL']

 list_list = [GCAM_1,GCAM_2,GCAM_3,GCAM_4,GCAM_5,GCAM_6,GCAM_7,GCAM_8,GCAM_9,GCAM_10,GCAM_11,GCAM_12,GCAM_13,GCAM_14,GCAM_15,GCAM_16,GCAM_17,GCAM_18,GCAM_19,GCAM_20,GCAM_21,GCAM_22,GCAM_23,GCAM_24,GCAM_25,GCAM_26,GCAM_27,GCAM_28,GCAM_29,GCAM_30,GCAM_31,GCAM_32]
 num = len(list_list)
 for info, shape in zip(m.comarques_info,m.comarques):
 for i in range(num):
 if info['adm0_a3'] in list_list[i]:
 patches1 = []
 patches1.append( Polygon(np.array(shape), True) )
 ax.set_title('Policy Cost',fontsize=25,y=1.01)#GDP Adjusted Policy Cost#Policy Cost#Policy Cost Reduction from Technology
 plt.annotate('%s'%year[idx],xy=(0.1,0.2),xytext=(0.1,0.2),xycoords='axes fraction',fontsize=30)
 cb = m.colorbar(colors,'right')
 filename = &amp;quot;out/map_%s.png&amp;quot; %(str(idx).rjust(3,&amp;quot;0&amp;quot;))

The function’s name is world_plot and it’s inputs are:

  1. The raw data for a specific time step.
  2. The index of the time step for the map we are working with (e.g. idx=0 for 2020).
  3. The minimum and maximum of the data across countries and time.

(1) is plotted, (2) is used to name the resulting png figure (line 73), and (3) is used to scale the color colormap (line 11).  On lines 2-8 we import the necessary Python packages, which in this case are pretty standard Matplotlib packages and numpy.  On lines 11-16 we generate a numpy array which contains the rgba color code for each of the data points in data.  In lines 18-19 we create the pyplot figure object.

On lines 21-24 we create and format the Basemap object.  Note that on line 21 I’ve selected the Robinson projection, but that the Basemap provides many options.

Lines 26-60 are specific for this application, and certainly could have been handled more compactly if I wanted to invest the time.  year is a list of time steps for our GCAM experiment, and lines 27-58 contain lists of three letter ID codes for each GCAM region, which are assembled into a list of lists (creatively called list_list) on line 60.

On line 61 we read the data from the shapefile database which was downloaded from Natural Earth. From lines 63-68 we loop through the info and shape attributes of the shapefile database, and determine which of the GCAM geo-political units each of the administrative units in the database is associated with.  Once this is determined, the polygon associated with that administrative unit is given the correct color (lines 66-68).

Lines 69-72 are doing some final formatting and labeling, and in lines 73-75 we are giving the file a unique name (tied to the time step plotted) and saving the images to some output directory.

When we put this function into a loop over time, we generate a sequence of figures looking something like this:


To convert this sequence of PNGs to a gif file, we use the convert command in linux (or in my case Cygwin).  So, we go to the command line and cd into the directory where we’ve saved our figures and type:

convert -delay 45 -loop 0 *.png globe_Cost_Reduction_faster.gif

Here the delay flag controls the framerate of the gif (in milliseconds), the loop flag controls whether the gif repeats, next I’m using a wildcat to include all of the pngs in the output directory, and the final input is the resulting name of the gif. The final product:



Python – Clip raster data with a shapefile

The problem: you have some georeferenced raster data, but you’re only interested in the points that lie inside of a shapefile boundary. This could be a country, state, watershed, etc. — it doesn’t matter, as long as you have a shapefile to define the boundary.

(Aside: georeferenced raster data can come in a number of formats, some are listed here.)

The solution: we’re going to clip the raster data to remove points outside the shapefile boundary, and then plot it with Python.

The ingredients: If you’re following along, we’ll be using some CMIP5 precipitation projections at 5-minute spatial resolution, which you can find here. (In particular, the 2070 projections from the GFDL-CM3 model under the rcp45 emissions scenario, but any of them will do). You’ll also need a shapefile, so if you don’t have one handy you can grab the US states shapefile.

(If you’re interested in watershed boundary shapefiles, here are the latest HUC classifications from USGS. If you grab the file, it contains all of HUC{2,4,6,8,10,12} shapefiles. Be warned, it is large, and you may be able to find your specific watershed elsewhere.)

If we plot the raster data—in this case, projected average August precip in millimeters—on top of the shapefile, we’ll get something like this, where the data covers all of North America:


I’ll put an example of making this plot with Python+Basemap at the end of the post.

To clip the raster with our US shapefile, we’ll be using GDAL, a collection of tools for manipulating spatial data. On Windows, the GDAL binaries and Python wrapper libraries all come bundled with the Python(x,y) installation, which is what I’m using. On *nix, pip install gdal should achieve the same thing, but you may need to fiddle around with base packages and installation settings.

Anyway, once you have the GDAL binaries, go to your command line and run this:

gdalwarp -cutline path/to/states.shp \
-crop_to_cutline -dstnodata "-9999.0" \
my_raster_file.tif \

The -dstnodata option specifies the “nodata” value to use in the destination file, where points outside the shapefile will be masked. If all goes well, you should now have my_raster_file_usa.tif which you can plot:


Which is the same data, of course, but clipped! You should be able to do this with any state, county, or watershed you want, as long as the resolution of the raster isn’t coarser than the area of the shapefile (in which case you would only be clipping one pixel).

As promised, here’s how to make these plots in Python. The line cmap.set_under('1.0') is deceptively important, because it sets all the “nodata” values in the file to a white background (remember, we used -9999.0 as the nodata value). Thanks for reading!

Update 3/14/17: Ben asked a good question about how this approach will treat pixels along the edge of the shapefile. Some visual investigation suggests that it (1) removes all pixels that extend beyond the border, and (2) adds a “border” of zero-valued pixels along the edge. It does not seem to aggregate the fractional pixels.

[gist jdherman/7434f7431d1cc0350dbe]