Visualizing large directed networks with ggraph in R

How you choose to visualize complex multidimensional data can significantly shape the insights your audience derives from the plots. My colleague Antonia has written a couple of excellent blog posts on analyzing geospatial networks in Python using the NetworkX library which can be found here and here. I generally lean towards Python for coding but have recently come around on R, mostly because of how easy it is in R to make pretty network visualizations. I will go over some basic network visualizations in R using the igraph and ggraph libraries in this blog post. All the code and data I am using can be found here.

The data I will be using in this post is a processed and cleaned csv-file of Upper Colorado River Basin (UCRB) user interactions obtained from CDSS. The Colorado River is one of the most important river systems in North America, supplying water to millions of Americans. The river has been facing a record 20 year-drought — a situation that is further complicated by prior appropriation doctrine where senior users can put junior ones out of priority until their own needs are met.

Shown below is a snippet of the dataset. Column A shows the user whose water right was put out of priority in 2002 by a water right of column C. Column E shows the total number of days that a water right of column A was put out priority by one of column C in 2002. The rest of the columns contain user attributes.

Now on to turning this lifeless spreadsheet into some pretty pictures. We begin by importing all the necessary libraries.

library(tidyverse)
library(igraph)
library(ggraph)
library(dplyr)
library(ggplot2)

Next we will import the csv-file shown above, and create a list of nodes and edges. This information can be used by the igraph library to create a directed network object. In this directed network, the source node of each edge is priorityWdid (column C) while the destination node is analysisWdid (column A), since the former is putting a call on the latter to divert flow away.

# read network .csv file
data <- read.csv('network_csv_files/priorityWdid/v2_network_2002.csv')

# create nodes
from <- unique(data[c('priorityWdid','priorityStructure', 'priorityNetAbs', 
                      'priorityStreamMile')]) %>% rename(wdid = priorityWdid) %>%
  rename(structure = priorityStructure) %>% rename(netAbs = priorityNetAbs) %>%
  rename(streamMile = priorityStreamMile)
to <- unique(data[c('analysisWdid','analysisStructure', 'analysisNetAbs',
                    'analysisStreamMile')]) %>% rename(wdid = analysisWdid) %>% 
  rename(structure = analysisStructure) %>% rename(netAbs = analysisNetAbs) %>%
  rename(streamMile = analysisStreamMile)
nodes <- unique(rbind(from, to))

# create edges
edges <- data[c('priorityWdid', 'analysisWdid', 'sumWtdCount')] %>% rename(from = priorityWdid) %>% rename(to = analysisWdid)

# create network using igraph package
network <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE)

This network has over 1400 nodes! That would be an unreadable mess of a visualization, so let us filter down this network to only include interactions between the 200 senior-most water rights in the UCRB.

# only include top 200 users by seniority
users <- read.csv('../data/CDSS_WaterRights.csv')
users <- head(users[order(users$Priority.Admin.No),], 200)

top_users <- users$WDID
for (i in 1:length(top_users)){
  top_users[i] <- toString(top_users[i])
}

# create subnetwork with only top 200 users by seniority
sub_nodes <- intersect(nodes$wdid, top_users)
subnet <- induced.subgraph(network, sub_nodes)

Excellent! Only 37 of the top 200 users in the UCRB interacted with each other in 2002. This is a much more manageable number for plotting. Let us start with the most basic visualization. In keeping with ggplot syntax, we can conveniently just keep adding plot specifications with a “+”.

# basic network graph
ggraph(subnet, layout = 'stress') +
  ggtitle('2002') + 
  geom_edge_link() +
  geom_node_point() +
  theme_graph()

Alright this is nice, but not particularly insightful. We have no idea which user each node corresponds to, and this figure would have us believe that all nodes and edges were created equal. When we constructed the network we actually added in a bunch of node and edge attributes, which we can now use to make our visuals more informative.

Shown below is a more helpful visualization. I went ahead and added some attributes to the network, and labelled all the nodes with their structure names. The nodes are sized by their out-degree and colored by their stream mile. The edge widths are determined by the total number of days the source node put the destination out of priority. In order to do this, I leveraged an amazing feature of ggplot2 and ggraph called aesthetic mapping which is quick way to map variables to visual cues on a plot. It automatically scales the data and creates a legend, which we can then further customize.

# plot graph in circular layout
ggraph(subnet, layout = "circle") +
  ggtitle('2002: "circle" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure), repel = TRUE, size=2) +
  theme_graph()

The above network has a circle layout because it’s the easiest to read and is replicable. But there are actually a number of layouts available to choose from. Here is another one of my favorites, graphopt. While this layout is harder to read, it does a better job of revealing clusters in the network. The only change I had to make to the code above was swap out the word ‘circle’ for ‘graphopt’!

# plot graph in graphopt layout
set.seed(1998)
ggraph(subnet, layout = "graphopt") +
  ggtitle('2002: "graphopt" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure), repel = TRUE, size=2) +
  theme_graph()

The above graph would be a lot easier to read if it weren’t for the long labels cluttering everything up. One way to deal with this is to adjust the opacity (alpha) of the text by the degree of the node. This way only the important and central nodes will have prominent labels. Again, all I have to do is add two extra words in line 12 of the code block above. Notice that I did set show.legend to False because I don’t want a legend entry for text opacity in my plot.

ggraph(subnet, layout = "graphopt") +
  ggtitle('2002: "graphopt" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure, alpha = deg), repel = TRUE, size=2, show.legend = F) +
  theme_graph()

This is just a small sampling of the possibilities for network visualization in R. I have only just begun exploring the igraph and ggraph libraries, but the syntax is fairly intuitive, and the resultant plots are highly customizable. The data-to-viz blog is a pretty incredible resource to look at other network visualizations in R, if you are interested.

Basic network analysis on a directed network using NetworkX

This post is a follow up from my last one, where I’ll be demonstrating some of the basic network analysis capabilities of the Python library NetworkX. I will be using the same data and all my scripts can be found in the same repository. The data we’re using represent flows of food between US counties, which I am limiting to the 95th percentile of largest flows so the network is of a reasonable size for this simple analysis. Given that these are flows (i.e., from one place to another) this is referred to as a directed network, with every edge (link) having a source and a destination. NetworkX allows the analysis and visualization of several types of networks, illustrated below.

Source

Undirected Networks: Edges have no direction, the relationships are always reciprocal or symmetric, for example A is friends with B.
Directed Networks: Edges have direction and relationships don’t have to be reciprocal, for example B sends an email to A.
Weighted Networks: Edges contain some quantitative information indicating the weight of a relationship, for example A owes $6 dollars to B, B owes $13 dollars to C, etc.
Signed Networks: Edges in these networks also carry information about the type of interaction between the two nodes, positive or negative, for example A and B are friends but B and C are enemies.
Multi Networks: Several connections or attributes might exist between two nodes, for example A gave some 6 apples and 3 pears to B, B gave 4 pears and 8 peaches to C, etc.

I will use the rest of this blogpost to demonstrate some simple analysis that can be performed on a directed network like this one. This analysis is only demonstrative of the capabilities – obviously, US counties have several other connections between them in real life and the food network is only used here as a demonstration testbed, not to solve actual connectivity problems.

We’ll be answering questions such as:

  • How connected are counties to others?
  • Are there counties that are bigger ‘exporters’ than ‘importers’ of goods?
  • Can I send something from any one county to any other using only the already established connections in this network?
  • If yes, what is the largest number of connections that I would need? Are there counties with no connections between them?

Node connectivity is typically measured by the node’s degree. In undirected networks, this is simply the number of connections a node has. In directed networks, we can also distinguish between connections where the node is the source and where the node is the destination. To estimate them using NetworkX, we can use G.out_degree() and G.in_degree(), respectively. We can also calculate the average (in or out) degree by dividing by the total number of nodes. In this case they’re both around 3.08, i.e., on average, every node has three connections. Of course this tells us very little about our network, which is why most often people like to see the distribution of degrees. This is easily generated by sorting the degree values and plotting them with matplotlib.

nnodes = G.number_of_nodes()
degrees_in = [d for n, d in G.in_degree()]
degrees_out = [d for n, d in G.out_degree()]
avrg_degree_in = sum(degrees_in) / float(nnodes)
avrg_degree_out = sum(degrees_out) / float(nnodes)

in_values = sorted(set(degrees_in))
in_hist = [degrees_in.count(x) for x in in_values]
out_values = sorted(set(degrees_out))
out_hist = [degrees_out.count(x) for x in out_values]

plt.figure()
plt.plot(in_values,in_hist,'ro-') # in-degree
plt.plot(out_values,out_hist,'bo-') # out-degree
plt.legend(['In-degree','Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Food distribution network')
plt.close()

This shows that this network is primarily made up of nodes with few connections (degree<5) and few nodes with larger degrees. Distributions like this are common for real-world networks [1, 2], often times they follow an exponential or a log-normal distribution, sometimes a power law distribution (also referred to as “scale free”).

We can also compare the in- and out-degrees of the nodes in this network which would give us information about counties that export to more counties than they import from and vice versa. For example, in the figure below, points below the diagonal line represent counties that import from more places than they export to.

To address the last two prompt questions, we are essentially concerned with network connectness. In directed networks such as this one, we can distinguish between strongly connected and weakly connected notions. A network is weakly connected if there is an undirected path between any pair of nodes (i.e., ignoring edge direction), and strongly connected if there is a directed path between every pair of vertices (i.e., only following edge direction) [3]. The networks below are all weakly but not strongly connected:

NetworkX can help answer these questions for our network, using existent and intuitive functionality. Executing:

nx.is_strongly_connected(G)
nx.is_weakly_connected(G)

will return False for both. This means that using the already established connections and directions, not all nodes can be reached by all other nodes. If we ignore the directions (weak connectedness) this remains the case. This implies that our network is made up of more than one components, i.e., connected subgraphs of our network. For example the undirected graph below has three components:

Strongly connected components in directed graphs also consider the direction of each edge. For example the directed graph below also has three components:

Weakly connected components in directed graphs are identified by ignoring the direction of the edges, so in the above example, the graph has one weakly connected component.

To examine these components for our network we can use nx.strongly_connected_components(G) and nx.weakly_connected_components(G) which would produce lists of all strongly or weakly connected components in the network, respectively, in this case 1156 strongly connected and 111 weakly connected components. In both cases this includes one giant component including most of the network nodes, 1220 in the strongly connected and 2348 in the weakly connected case, and hundreds of small components with fewer than 10 nodes trading between them.

Finally, we can draw these strongly and weakly connected components. In the top row of figure below, I show the largest components identified by each definition and in the bottom row all other components in the network, according to each definition.

References:
[1] Broido, A.D., Clauset, A. Scale-free networks are rare. Nat Commun 10, 1017 (2019). https://doi.org/10.1038/s41467-019-08746-5
[2] http://networksciencebook.com/
[3] Skiena, S. “Strong and Weak Connectivity.” §5.1.2 in Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Reading, MA: Addison-Wesley, pp. 172-174, 1990.

Networks on maps: exploring spatial connections using NetworkX and Basemap

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here 1.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

plt.figure(figsize = (12,8))
m = Basemap(projection='merc',llcrnrlon=-160,llcrnrlat=15,urcrnrlon=-60,
urcrnrlat=50, lat_ts=0, resolution='l',suppress_ticks=True)
mx, my = m(pos_data['lon'].values, pos_data['lat'].values)
pos = {}
for count, elem in enumerate(pos_data['nodes']):
     pos[elem] = (mx[count], my[count])
nx.draw_networkx_edges(G, pos = pos, edge_color='blue', alpha=0.1, arrows = False)
m.drawcountries(linewidth = 2)
m.drawstates(linewidth = 0.2)
m.drawcoastlines(linewidth=2)
plt.tight_layout()
plt.savefig("map.png", dpi = 300)
plt.show()

The resulting figure is the following, corresponding to Fig. 5B from the original paper.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(G.degree())
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()

plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)
plt.show()

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

1 – If you are interested in how this is done, I used the National Counties Gazetteer file from the US Census Bureau and looked up each code to get its lat and lon.