Basic network analysis on a directed network using NetworkX

This post is a follow up from my last one, where I’ll be demonstrating some of the basic network analysis capabilities of the Python library NetworkX. I will be using the same data and all my scripts can be found in the same repository. The data we’re using represent flows of food between US counties, which I am limiting to the 95th percentile of largest flows so the network is of a reasonable size for this simple analysis. Given that these are flows (i.e., from one place to another) this is referred to as a directed network, with every edge (link) having a source and a destination. NetworkX allows the analysis and visualization of several types of networks, illustrated below.

Source

Undirected Networks: Edges have no direction, the relationships are always reciprocal or symmetric, for example A is friends with B.
Directed Networks: Edges have direction and relationships don’t have to be reciprocal, for example B sends an email to A.
Weighted Networks: Edges contain some quantitative information indicating the weight of a relationship, for example A owes $6 dollars to B, B owes $13 dollars to C, etc.
Signed Networks: Edges in these networks also carry information about the type of interaction between the two nodes, positive or negative, for example A and B are friends but B and C are enemies.
Multi Networks: Several connections or attributes might exist between two nodes, for example A gave some 6 apples and 3 pears to B, B gave 4 pears and 8 peaches to C, etc.

I will use the rest of this blogpost to demonstrate some simple analysis that can be performed on a directed network like this one. This analysis is only demonstrative of the capabilities – obviously, US counties have several other connections between them in real life and the food network is only used here as a demonstration testbed, not to solve actual connectivity problems.

We’ll be answering questions such as:

  • How connected are counties to others?
  • Are there counties that are bigger ‘exporters’ than ‘importers’ of goods?
  • Can I send something from any one county to any other using only the already established connections in this network?
  • If yes, what is the largest number of connections that I would need? Are there counties with no connections between them?

Node connectivity is typically measured by the node’s degree. In undirected networks, this is simply the number of connections a node has. In directed networks, we can also distinguish between connections where the node is the source and where the node is the destination. To estimate them using NetworkX, we can use G.out_degree() and G.in_degree(), respectively. We can also calculate the average (in or out) degree by dividing by the total number of nodes. In this case they’re both around 3.08, i.e., on average, every node has three connections. Of course this tells us very little about our network, which is why most often people like to see the distribution of degrees. This is easily generated by sorting the degree values and plotting them with matplotlib.

nnodes = G.number_of_nodes()
degrees_in = [d for n, d in G.in_degree()]
degrees_out = [d for n, d in G.out_degree()]
avrg_degree_in = sum(degrees_in) / float(nnodes)
avrg_degree_out = sum(degrees_out) / float(nnodes)

in_values = sorted(set(degrees_in))
in_hist = [degrees_in.count(x) for x in in_values]
out_values = sorted(set(degrees_out))
out_hist = [degrees_out.count(x) for x in out_values]

plt.figure()
plt.plot(in_values,in_hist,'ro-') # in-degree
plt.plot(out_values,out_hist,'bo-') # out-degree
plt.legend(['In-degree','Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Food distribution network')
plt.close()

This shows that this network is primarily made up of nodes with few connections (degree<5) and few nodes with larger degrees. Distributions like this are common for real-world networks [1, 2], often times they follow an exponential or a log-normal distribution, sometimes a power law distribution (also referred to as “scale free”).

We can also compare the in- and out-degrees of the nodes in this network which would give us information about counties that export to more counties than they import from and vice versa. For example, in the figure below, points below the diagonal line represent counties that import from more places than they export to.

To address the last two prompt questions, we are essentially concerned with network connectness. In directed networks such as this one, we can distinguish between strongly connected and weakly connected notions. A network is weakly connected if there is an undirected path between any pair of nodes (i.e., ignoring edge direction), and strongly connected if there is a directed path between every pair of vertices (i.e., only following edge direction) [3]. The networks below are all weakly but not strongly connected:

NetworkX can help answer these questions for our network, using existent and intuitive functionality. Executing:

nx.is_strongly_connected(G)
nx.is_weakly_connected(G)

will return False for both. This means that using the already established connections and directions, not all nodes can be reached by all other nodes. If we ignore the directions (weak connectedness) this remains the case. This implies that our network is made up of more than one components, i.e., connected subgraphs of our network. For example the undirected graph below has three components:

Strongly connected components in directed graphs also consider the direction of each edge. For example the directed graph below also has three components:

Weakly connected components in directed graphs are identified by ignoring the direction of the edges, so in the above example, the graph has one weakly connected component.

To examine these components for our network we can use nx.strongly_connected_components(G) and nx.weakly_connected_components(G) which would produce lists of all strongly or weakly connected components in the network, respectively, in this case 1156 strongly connected and 111 weakly connected components. In both cases this includes one giant component including most of the network nodes, 1220 in the strongly connected and 2348 in the weakly connected case, and hundreds of small components with fewer than 10 nodes trading between them.

Finally, we can draw these strongly and weakly connected components. In the top row of figure below, I show the largest components identified by each definition and in the bottom row all other components in the network, according to each definition.

References:
[1] Broido, A.D., Clauset, A. Scale-free networks are rare. Nat Commun 10, 1017 (2019). https://doi.org/10.1038/s41467-019-08746-5
[2] http://networksciencebook.com/
[3] Skiena, S. “Strong and Weak Connectivity.” §5.1.2 in Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Reading, MA: Addison-Wesley, pp. 172-174, 1990.

Networks on maps: exploring spatial connections using NetworkX and Basemap

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here 1.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

plt.figure(figsize = (12,8))
m = Basemap(projection='merc',llcrnrlon=-160,llcrnrlat=15,urcrnrlon=-60,
urcrnrlat=50, lat_ts=0, resolution='l',suppress_ticks=True)
mx, my = m(pos_data['lon'].values, pos_data['lat'].values)
pos = {}
for count, elem in enumerate(pos_data['nodes']):
     pos[elem] = (mx[count], my[count])
nx.draw_networkx_edges(G, pos = pos, edge_color='blue', alpha=0.1, arrows = False)
m.drawcountries(linewidth = 2)
m.drawstates(linewidth = 0.2)
m.drawcoastlines(linewidth=2)
plt.tight_layout()
plt.savefig("map.png", dpi = 300)
plt.show()

The resulting figure is the following, corresponding to Fig. 5B from the original paper.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(G.degree())
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()

plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)
plt.show()

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

1 – If you are interested in how this is done, I used the National Counties Gazetteer file from the US Census Bureau and looked up each code to get its lat and lon.