Writing sharable Python code

Most of us in academia do not have formal training in computer science, yet for many of us writing and sharing code is an essential part of our research. While the quality of code produced by many graduate students can be quite impressive, many of us never learned the basic CS principles that allow programs to be sharable and inheritable by other programmers. Last summer I proctored an online class in Python fundamentals. The course was for beginner programmers and, though it covered material simpler than the scripts I write for my PhD, I learned a lot about how to properly document and structure Python functions to make them usable by others. In this post I’ll discuss two important concepts I took away from this course: writing proper function specifications and enforcing preconditions using assert statements.

Function Specifications

One of the the most important aspects of writing a quality Python function is proper specification. While the term may sound generic, a specification actually has a very precise definition and implementation for a Python function. In practice, a specification is a docstring, a “string literal” that occurs as the first statement in a function, module, class or method, formed by a bracketed set of “””. Here is an example of a simple function I wrote with a specification:


def radians_to_degrees(theta):
    """
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float
    """
    return theta * (180.0/3.14159)

The function specification is everything between the sets of “””. When Python sees this docstring at the front of a function definition, it automatically is stored as the “__doc__” associated with the function. With this specification in place, any user that loads this function can access its __doc__ by typing help(radians_to_degrees), which will print the following to the terminal:

Help on function radians_to_degrees in module __main__:

radians_to_degrees(theta)
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float

The help function will print anything in the docstring at the start of a function, but a it is good practice for the specification to have the following elements:

  1. A one-line summary of the function at the very beginning, if the function is a “fruitful function” (meaning it returns something), this line should tell what it returns. In my example I note that my function returns theta converted to degrees.
  2. A more detailed description of the function. In my example this simply provides the type of the return value.
  3. A description of the function’s parameter(s)
  4. Any preconditions that are necessary for the code to run properly (more on this later)

I should note that officially the Python programming language has a more flexible set of requirements for function specifications, which can be found here, but the attributes above are a good starting point for writing clear specifications.

Properly specifying a Python function will clarify the function’s intended use and provide instructions for how new users can utilize it. It will also help you document your code for formal release if you ever publish it. Google any of your favorite Python functions and you’ll likely be brought to a page that has a fancy looking version of the function’s specification. These pages can be automatically generated by tools such as Spinx that create them right from the function’s definition.

Aside from clarifying and providing instructions for your function, specifications provide a means of creating a chain of accountability for any problems with your code. This chain of accountability is created through precondition statements (element four above). A precondition statement dictates requirements for the function to run properly. Preconditions may specify the type of parameter input (i.e. x is a float) or a general statement about the parameter (x < 0).

For large teams of many developers and users of functions, precondition statements create a chain of accountability for code problems. If the preconditions are violated and a code crashes, then it is the responsibility of the user, who did not use the code properly. On the other hand, if the preconditions were met and the code crashes, it is the responsibility of the developer, who did not properly specify the code.

Asserting preconditions to catch errors early

One way to improve the usability of a code is to catch precondition violations with statements that throw specific errors. In Python, this may be done using assert statements. Assert statements evaluate a boolean expression and can display a custom error message if the expression returns false. For example, I can modify my radians_to_degrees function with the following assert statement (line 10):

def radians_to_degrees(theta):
    """
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float
    """
    assert type(theta) == float, repr(theta) + ' is not float'
    return theta * (180.0/3.14159)

A helpful function in my assert statement above is “repr”, which will return a printable representation of a given object (i.e. for a string it will keep the quotation marks around it). Now if I were to enter ‘a’ into my function, I would get the following return:

AssertionError: 'a' is not float

This is trivial for my example, but can save a lot of time when running large and complex functions. Rather than seeing 30 lines of “traceback…”, the user will be immediately brought to the source of the error. In general, you should try to make assert statements as precise as possible to pinpoint the violation.

It’s important to note that not every conceivable precondition violation requires an assert statement. There is a tradeoff between code efficiency and number of assert statements (checking a precondition takes time). Determining which preconditions to strictly enforce with assert statements is a balancing act and will be different for each application. It’s important to remember though, that even if you do not have an assert statement for a precondition, you should still list all preconditions in the function specification to preserve the chain of accountability.

Further resources

This post concerns specifications for Python functions. However, the use of docstrings is also important when coding modules, classes and public methods. Guidance on how these docstrings may be formatted can be found here.

While this post focused on Python, informative function specifications are important for all programming languages. Some resources on documentation in other languages can be found below:

Basic network analysis on a directed network using NetworkX

This post is a follow up from my last one, where I’ll be demonstrating some of the basic network analysis capabilities of the Python library NetworkX. I will be using the same data and all my scripts can be found in the same repository. The data we’re using represent flows of food between US counties, which I am limiting to the 95th percentile of largest flows so the network is of a reasonable size for this simple analysis. Given that these are flows (i.e., from one place to another) this is referred to as a directed network, with every edge (link) having a source and a destination. NetworkX allows the analysis and visualization of several types of networks, illustrated below.

Source

Undirected Networks: Edges have no direction, the relationships are always reciprocal or symmetric, for example A is friends with B.
Directed Networks: Edges have direction and relationships don’t have to be reciprocal, for example B sends an email to A.
Weighted Networks: Edges contain some quantitative information indicating the weight of a relationship, for example A owes $6 dollars to B, B owes $13 dollars to C, etc.
Signed Networks: Edges in these networks also carry information about the type of interaction between the two nodes, positive or negative, for example A and B are friends but B and C are enemies.
Multi Networks: Several connections or attributes might exist between two nodes, for example A gave some 6 apples and 3 pears to B, B gave 4 pears and 8 peaches to C, etc.

I will use the rest of this blogpost to demonstrate some simple analysis that can be performed on a directed network like this one. This analysis is only demonstrative of the capabilities – obviously, US counties have several other connections between them in real life and the food network is only used here as a demonstration testbed, not to solve actual connectivity problems.

We’ll be answering questions such as:

  • How connected are counties to others?
  • Are there counties that are bigger ‘exporters’ than ‘importers’ of goods?
  • Can I send something from any one county to any other using only the already established connections in this network?
  • If yes, what is the largest number of connections that I would need? Are there counties with no connections between them?

Node connectivity is typically measured by the node’s degree. In undirected networks, this is simply the number of connections a node has. In directed networks, we can also distinguish between connections where the node is the source and where the node is the destination. To estimate them using NetworkX, we can use G.out_degree() and G.in_degree(), respectively. We can also calculate the average (in or out) degree by dividing by the total number of nodes. In this case they’re both around 3.08, i.e., on average, every node has three connections. Of course this tells us very little about our network, which is why most often people like to see the distribution of degrees. This is easily generated by sorting the degree values and plotting them with matplotlib.

nnodes = G.number_of_nodes()
degrees_in = [d for n, d in G.in_degree()]
degrees_out = [d for n, d in G.out_degree()]
avrg_degree_in = sum(degrees_in) / float(nnodes)
avrg_degree_out = sum(degrees_out) / float(nnodes)

in_values = sorted(set(degrees_in))
in_hist = [degrees_in.count(x) for x in in_values]
out_values = sorted(set(degrees_out))
out_hist = [degrees_out.count(x) for x in out_values]

plt.figure()
plt.plot(in_values,in_hist,'ro-') # in-degree
plt.plot(out_values,out_hist,'bo-') # out-degree
plt.legend(['In-degree','Out-degree'])
plt.xlabel('Degree')
plt.ylabel('Number of nodes')
plt.title('Food distribution network')
plt.close()

This shows that this network is primarily made up of nodes with few connections (degree<5) and few nodes with larger degrees. Distributions like this are common for real-world networks [1, 2], often times they follow an exponential or a log-normal distribution, sometimes a power law distribution (also referred to as “scale free”).

We can also compare the in- and out-degrees of the nodes in this network which would give us information about counties that export to more counties than they import from and vice versa. For example, in the figure below, points below the diagonal line represent counties that import from more places than they export to.

To address the last two prompt questions, we are essentially concerned with network connectness. In directed networks such as this one, we can distinguish between strongly connected and weakly connected notions. A network is weakly connected if there is an undirected path between any pair of nodes (i.e., ignoring edge direction), and strongly connected if there is a directed path between every pair of vertices (i.e., only following edge direction) [3]. The networks below are all weakly but not strongly connected:

NetworkX can help answer these questions for our network, using existent and intuitive functionality. Executing:

nx.is_strongly_connected(G)
nx.is_weakly_connected(G)

will return False for both. This means that using the already established connections and directions, not all nodes can be reached by all other nodes. If we ignore the directions (weak connectedness) this remains the case. This implies that our network is made up of more than one components, i.e., connected subgraphs of our network. For example the undirected graph below has three components:

Strongly connected components in directed graphs also consider the direction of each edge. For example the directed graph below also has three components:

Weakly connected components in directed graphs are identified by ignoring the direction of the edges, so in the above example, the graph has one weakly connected component.

To examine these components for our network we can use nx.strongly_connected_components(G) and nx.weakly_connected_components(G) which would produce lists of all strongly or weakly connected components in the network, respectively, in this case 1156 strongly connected and 111 weakly connected components. In both cases this includes one giant component including most of the network nodes, 1220 in the strongly connected and 2348 in the weakly connected case, and hundreds of small components with fewer than 10 nodes trading between them.

Finally, we can draw these strongly and weakly connected components. In the top row of figure below, I show the largest components identified by each definition and in the bottom row all other components in the network, according to each definition.

References:
[1] Broido, A.D., Clauset, A. Scale-free networks are rare. Nat Commun 10, 1017 (2019). https://doi.org/10.1038/s41467-019-08746-5
[2] http://networksciencebook.com/
[3] Skiena, S. “Strong and Weak Connectivity.” §5.1.2 in Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Reading, MA: Addison-Wesley, pp. 172-174, 1990.

ggplot (Part 4) – Animated Geospatial Maps

Most of the time, we need to deal with complex systems that have several components, each with different properties and behavior. Usually, these properties vary with time and space, and understanding their spatiotemporal dynamics plays a key role in our understanding of the system performance and its vulnerabilities. Therefore, aggregation in time and space would hide variations that might be important in our analysis and understanding of the system behavior. In this blog post, I am going to show how we can add bar plots onto a map at different locations for better visualization of variables. Some examples for when this type of visualization is useful are irrigation districts that have different water rights or cropping patterns that affect their crop production and water requirements in dry or wet years; another example would be the sensitivity indices of specific model parameters that are clustered based on their location and varying based on the wetness and dryness of the system. To demonstrate this, I am going to create an animated graph that shows annual crop yield variations in different counties in the State of Washington. You can download the Washington counties’ data layer (WA_County_Boundaries-shp.zip) and NASS historical yield data for corn, winter wheat, and spring wheat (NASS.txt) from here.

library(rgdal)  
shp_counties <- readOGR(dsn = file.path("(your directory)/WA_County_Boundaries-shp/WA_County_Boundaries.shp"))

“shp_countiesis” is a SpatialPolygonsDataFrame object and has a different format than a regular dataframe. It usually has a few slots that contain different type of information. For example, “data” has non-geographic properties. We can explore the information for each polygon with:

head(shp_counties@data)
#The "JURISDIC_2" and "JURISDIC_3" columns both contain the names of counties.

To visualize it with ggplot, it has to first be converted into a dataframe. Then, we can use it with geom_polygon function:

library(ggplot2)
counties <- fortify(shp_counties, region="JURISDIC_2")
head(counties)
#long     lat order  hole piece    id   group
#1 -13131351 5984710     1 FALSE     1 Adams Adams.1
#2 -13131340 5983102     2 FALSE     1 Adams Adams.1
ggplot() +  geom_polygon(data = counties, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)

Now, I am going to extract the center of each polygon so that I can later add the bar plots to these coordinates:

library(rgeos)
counties_centroids<- as.data.frame(gCentroid(shp_counties, byid=T))

We are going to extract just a few years of data from the yield dataset to show on the map:

library(data.table)
yield<- fread("(your directory)/NASS.txt")
#   Year  Ag_District     Ag_District_Code  Data_Item  Value  County  JURISDIC_5 Group
#1: 1947 EAST CENTRAL               50     CORN, GRAIN  41.0   Adams    53001   CORN
#2: 1948 EAST CENTRAL               50     CORN, GRAIN  40.0   Adams    53001   CORN
years<- as.data.frame(unique(yield$Year))
years<- as.data.frame(years[34:42,])

For the final map, I am going to need a common legend for all of the bar plots in all of the counties and years in which I am interested. So, I need to know all of the categories that are available:

unique(yield$Group)
unique(yield$Data_Item)

Based on the “Group” column, we know that there are three main groups of crops: corn, winter wheat and spring wheat. Based on the “Data_Item” column, we know that there are three different types of winter and spring wheat (non-irrigated, non-irrigated with continuous cropping (CC), and non-irrigated with summer fallow (SF)). Note that we do not have all of these crop types for all of the counties and all the years, and the common legend should be created from a location that all the crop types are available, so that it can be applicable for all of the counties and years that I am eventually going to plot. For this reason, I am going to subset a dataset and create a bar plot for one county and year when all of the crop types are available:

sample_yield<- subset(yield, Year=="1981" & County=="Adams")

I want to use custom colors so that each crop Group has a different color and each Data_item corresponding to a Group share the same Group color theme, which gradually changes from darker to brighter. First, I create three color functions, each corresponding to 3 Groups in my dataset.

colfunc1 <- colorRampPalette(c("darkRED"))
colfunc2 <- colorRampPalette(c("darkBLUE", "lightBLUE"))
colfunc3 <- colorRampPalette(c("darkGREEN", "lightgreen"))

Now I am going to use these functions to create a color code for each Group and save it in “colors”:

colors<-c(colfunc1(nrow(subset(sample_yield,sample_yield$Group=="CORN"))),colfunc2(nrow(subset(sample_yield,sample_yield$Group=="SW"))),colfunc3(nrow(subset(sample_yield,sample_yield$Group=="WW"))))

Next, in the template bar plot, I can use the customized “colors” in scale_fill_manual() function:

ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
  scale_fill_manual(values=colors)+
  geom_bar(stat='identity', color = "black")

From this plot, we just need to extract the legend. Also, we need to change its font size since we are going to add it to another plot. You should try different sizes to find out which one is more appropriate for your dataset/figure. I am saving this plot in “tmp” while I remove the legend title and change the font size. Then, I extract the legend section by using the get_legend() function and convert it to a ggplot by using the as_ggplot() function.

tmp <- ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
  scale_fill_manual(values=colors)+
  geom_bar(stat='identity', color = "black")+theme(legend.title = element_blank(),legend.text =element_text(size=19,face="bold",colour="black") )
library(cowplot)
legend<- get_legend(tmp)
library(ggpubr)
tmp_legend<- as_ggplot(legend)

Now, I am going to save the counties map with the appropriate font size and title and add the extracted legend from the yield data to it with the geom_subview() funtion. You need to adjust the coordinates of that you want to show the legend on the map based on your data.

tmp_map <- ggplot(counties)+
  geom_polygon(aes(long, lat, group=group), fill="white")+
  geom_path(color="black", mapping=aes(long, lat, group=group), size=1.5)+
  theme(axis.text.y=element_text(size=17,face="bold",colour="black"),
        axis.text.x=element_text(size=17,face="bold",colour="black"),
        axis.title.y =element_text(size=17,face="bold",colour="black"),
        axis.title.x =element_text(size=17,face="bold",colour="black"),
        plot.title =element_text(size=25,face="bold",colour="black"))+
  labs(title = "NASS Yield Data (1980-1995)")
library(ggimage)
tmp_map_final<- tmp_map+ geom_subview(x=-13830000, y=5730000, subview=tmp_legend)

In the below section, I am going to show how we can loop through all of the years and then all the counties in the county map, use the center coordinate of each county (polygon), plot the corresponded bar plot, and print it in the center of each polygon. We can use the ggplotGrob() funtion to extract different pieces of the bar plot created with gpplot. With this function, we can treat each part of the bar plot (background, axis, labels, main plot panel, etc.) as a graphical object and move it to the coordinates that are in our interest. For example, if we just want to use the main panel and not any other components, we can extract the panel, and adjust the other components as we wish to present in the final graph.

In this example, I am adding all of the bar plots for all counties in one year as a graphical object in a list “barplot_list”.  Then, by using the annotation_custom() function I add each item in the “barplot_list” to the coordinates at the center of the polygons (counties) that I already extracted. Note that the orders of center coordinates and plots in the “barplot_list” are the same.

At the end, I just add the base map with the customized legend (“tmp_map_final”) and with the list of all bar plots with their customized locations (“barplot_annotation_list”). Then, I add them all in a list (“all_list“) and repeat this process for every year. The last step is to save this list with saveGIF()  to create an animated gif. Note that we can use the same procedure but replace the bar plot with other types of plots such as pie charts.

counties_list<- as.data.frame(unique(counties$id))  #list of counties  
all_list<- list()
for (y in 1:nrow(years)){   #loop through all of the years
nass_oneyear<- subset(yield,yield$Year==years[y,])   #extract one year  
barplot_list <- 
  ##create bar plot for each county
  lapply(1:length(shp_counties$JURISDIC_2), function(i) { 
    #extract one county  
    nass_oneyear_onecounty<- subset(nass_oneyear,nass_oneyear$County==counties_list[i,])
    # As I mentioned, for each county and year the number of crop types might be different. So, I need to customize the color for each sub-dataset using the manual color ramp that I previously defined for each item.
    nass_oneyear_onecounty$itemcolor<- "NA"
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="CORN, GRAIN"]<- colors[1]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED"]<- colors[2]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, CC"]<- colors[3]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, SF"]<- colors[4]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED"]<- colors[5]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, CC"]<- colors[6]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, SF"]<- colors[7]
    plots_comp <- ggplotGrob(
      ggplot(nass_oneyear_onecounty,aes(x=Group,y=Value,group=(itemcolor),fill=factor(Data_Item)))+
        scale_fill_manual(values=nass_oneyear_onecounty$itemcolor)+
        geom_bar(stat='identity', color = "black") +
        labs(x = NULL, y = NULL) + 
        theme(legend.position = "none", rect = element_blank(), axis.title.x = element_blank(), 
              axis.title.y = element_blank(),
              axis.text.x= element_blank(),
              axis.ticks = element_blank(),
              axis.text.y=element_text(size=14,face="bold",colour="black")) + coord_cartesian(expand=FALSE) 
    )})
barplots_list <- lapply(1:length(shp_counties), function(i) 
  annotation_custom(barplot_list[[i]], xmin = counties_centroids[i,1]- 28000, ymin = counties_centroids[i,2]- 28000, xmax =  counties_centroids[i,1]+ 28000, ymax = counties_centroids[i,2]+ 28000))
# xmin, ymin, xmax and ymax can be used to adjust  are horizontal and vertical location of the bar plots
all_list[[y]] <- list(tmp_map_final + barplots_list)
}

library(animation)
saveGIF(
  {lapply(all_list, print)}
  , "(your directory)/final.gif", interval = 2, ani.width = 1600, ani.height = 1200)

If we want to have labels for our bar plots (instead of having yield values at the y-axis), we may want to show the yield value corresponding to each item in a group. In this case I can use the below command lines within a loop before we create a graphical objects of ggplot (before plots_comp <- ggplotGrob(….) ), to add a label column that shows the cumulative yield value in each group.

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%  
  group_by(Group) %>%
mutate(label_y = cumsum(Value)-10)  #I subtracted 10 from these cumulative values just to print them inside of the bar plot sections. 

Or we can just have one label that shows the total yield in each group:

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%  
group_by(Group) %>%
 mutate(total = sum(Value))

Then we can add these labels to the bar plots by adding the geom_text() function in the ggplot section within ggplotGrob(), and specifying the column of interest: 

geom_text(aes(y = label_y, label = round(Value)),colour = "white",size=3,fontface='bold')

Instead of manually adjusting the position the of label (such as in the first example), “vjust” can be added to the geom_text() for modifying text alignment:

geom_text(aes(y = total, label = round(total)),colour = "black",size=3,fontface='bold',vjust = -0.35)