# Visualizing large directed networks with ggraph in R

How you choose to visualize complex multidimensional data can significantly shape the insights your audience derives from the plots. My colleague Antonia has written a couple of excellent blog posts on analyzing geospatial networks in Python using the NetworkX library which can be found here and here. I generally lean towards Python for coding but have recently come around on R, mostly because of how easy it is in R to make pretty network visualizations. I will go over some basic network visualizations in R using the igraph and ggraph libraries in this blog post. All the code and data I am using can be found here.

The data I will be using in this post is a processed and cleaned csv-file of Upper Colorado River Basin (UCRB) user interactions obtained from CDSS. The Colorado River is one of the most important river systems in North America, supplying water to millions of Americans. The river has been facing a record 20 year-drought — a situation that is further complicated by prior appropriation doctrine where senior users can put junior ones out of priority until their own needs are met.

Shown below is a snippet of the dataset. Column A shows the user whose water right was put out of priority in 2002 by a water right of column C. Column E shows the total number of days that a water right of column A was put out priority by one of column C in 2002. The rest of the columns contain user attributes.

Now on to turning this lifeless spreadsheet into some pretty pictures. We begin by importing all the necessary libraries.

```library(tidyverse)
library(igraph)
library(ggraph)
library(dplyr)
library(ggplot2)
```

Next we will import the csv-file shown above, and create a list of nodes and edges. This information can be used by the igraph library to create a directed network object. In this directed network, the source node of each edge is priorityWdid (column C) while the destination node is analysisWdid (column A), since the former is putting a call on the latter to divert flow away.

```# read network .csv file

# create nodes
from <- unique(data[c('priorityWdid','priorityStructure', 'priorityNetAbs',
'priorityStreamMile')]) %>% rename(wdid = priorityWdid) %>%
rename(structure = priorityStructure) %>% rename(netAbs = priorityNetAbs) %>%
rename(streamMile = priorityStreamMile)
to <- unique(data[c('analysisWdid','analysisStructure', 'analysisNetAbs',
'analysisStreamMile')]) %>% rename(wdid = analysisWdid) %>%
rename(structure = analysisStructure) %>% rename(netAbs = analysisNetAbs) %>%
rename(streamMile = analysisStreamMile)
nodes <- unique(rbind(from, to))

# create edges
edges <- data[c('priorityWdid', 'analysisWdid', 'sumWtdCount')] %>% rename(from = priorityWdid) %>% rename(to = analysisWdid)

# create network using igraph package
network <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE)
```

This network has over 1400 nodes! That would be an unreadable mess of a visualization, so let us filter down this network to only include interactions between the 200 senior-most water rights in the UCRB.

```# only include top 200 users by seniority

top_users <- users\$WDID
for (i in 1:length(top_users)){
top_users[i] <- toString(top_users[i])
}

# create subnetwork with only top 200 users by seniority
sub_nodes <- intersect(nodes\$wdid, top_users)
subnet <- induced.subgraph(network, sub_nodes)
```

Excellent! Only 37 of the top 200 users in the UCRB interacted with each other in 2002. This is a much more manageable number for plotting. Let us start with the most basic visualization. In keeping with ggplot syntax, we can conveniently just keep adding plot specifications with a “+”.

```# basic network graph
ggraph(subnet, layout = 'stress') +
ggtitle('2002') +
geom_node_point() +
theme_graph()
```

Alright this is nice, but not particularly insightful. We have no idea which user each node corresponds to, and this figure would have us believe that all nodes and edges were created equal. When we constructed the network we actually added in a bunch of node and edge attributes, which we can now use to make our visuals more informative.

Shown below is a more helpful visualization. I went ahead and added some attributes to the network, and labelled all the nodes with their structure names. The nodes are sized by their out-degree and colored by their stream mile. The edge widths are determined by the total number of days the source node put the destination out of priority. In order to do this, I leveraged an amazing feature of ggplot2 and ggraph called aesthetic mapping which is quick way to map variables to visual cues on a plot. It automatically scales the data and creates a legend, which we can then further customize.

```# plot graph in circular layout
ggraph(subnet, layout = "circle") +
ggtitle('2002: "circle" Layout') +
geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue',
arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
labs(edge_width = "Number of days") +
geom_node_point(aes(size = deg, colour=streamMile)) +
labs(colour = "Stream mile") +
labs(size = "Out-degree") +
scale_color_gradient(low = "skyblue", high = "darkblue") +
scale_edge_width(range = c(0.2, 2)) +
geom_node_text(aes(label = structure), repel = TRUE, size=2) +
theme_graph()
```

The above network has a circle layout because it’s the easiest to read and is replicable. But there are actually a number of layouts available to choose from. Here is another one of my favorites, graphopt. While this layout is harder to read, it does a better job of revealing clusters in the network. The only change I had to make to the code above was swap out the word ‘circle’ for ‘graphopt’!

```# plot graph in graphopt layout
set.seed(1998)
ggraph(subnet, layout = "graphopt") +
ggtitle('2002: "graphopt" Layout') +
geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue',
arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
labs(edge_width = "Number of days") +
geom_node_point(aes(size = deg, colour=streamMile)) +
labs(colour = "Stream mile") +
labs(size = "Out-degree") +
scale_color_gradient(low = "skyblue", high = "darkblue") +
scale_edge_width(range = c(0.2, 2)) +
geom_node_text(aes(label = structure), repel = TRUE, size=2) +
theme_graph()
```

The above graph would be a lot easier to read if it weren’t for the long labels cluttering everything up. One way to deal with this is to adjust the opacity (alpha) of the text by the degree of the node. This way only the important and central nodes will have prominent labels. Again, all I have to do is add two extra words in line 12 of the code block above. Notice that I did set show.legend to False because I don’t want a legend entry for text opacity in my plot.

```ggraph(subnet, layout = "graphopt") +
ggtitle('2002: "graphopt" Layout') +
geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue',
arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
labs(edge_width = "Number of days") +
geom_node_point(aes(size = deg, colour=streamMile)) +
labs(colour = "Stream mile") +
labs(size = "Out-degree") +
scale_color_gradient(low = "skyblue", high = "darkblue") +
scale_edge_width(range = c(0.2, 2)) +
geom_node_text(aes(label = structure, alpha = deg), repel = TRUE, size=2, show.legend = F) +
theme_graph()
```

This is just a small sampling of the possibilities for network visualization in R. I have only just begun exploring the igraph and ggraph libraries, but the syntax is fairly intuitive, and the resultant plots are highly customizable. The data-to-viz blog is a pretty incredible resource to look at other network visualizations in R, if you are interested.