Taylor Diagram

The evaluation of model performance is a widely discussed and yet extremely controversial topic in hydrology, atmospheric science, and environmental studies. Generally, it is not super straightforward to quantify the accuracy of tools that simulate complex systems. One of the reasons is that these systems usually have various sub-systems. For example, a complex river system will simulate regulated streamflow, water allocation, dam operations, etc. This can imply that there are multiple goodness-of-fit objectives that cannot be fully optimized simultaneously. In this situation, there will always be tradeoffs in the accuracy of the system representation.

The other reason is that these complex systems are often state-dependent, and their behavior non-linearly changes as the system evolves. Finally, there are variables in the system that have several properties, and it is usually impossible to improve all of their properties at the same time (Gupta et al., 1998). Take streamflow as an example. In calculating model accuracy, we can focus on the daily fluctuation of streamflow or look into the changes weekly, monthly, and annually. We can also concentrate on the seasonality of streamflow, extreme low and high cases, and the persistence of these extreme events. Therefore, our results are not fully defensible if we only focus on one performance metric and make inferences that ignore many other components of the system. In this blog post, I am going to explain a visualization technique that allows us to draw three or more metrics of the system on the same plot. The plot is called a Taylor Diagram. The Taylor Diagram is not really a solution to the problem that I explained, but it does provide a systematic and mathematically sound way of demonstrating a few popular goodness-of-fit measures. The original version of the Taylor Diagram includes three performance metrics: i) standard deviation of simulated vs. observed; ii) correlation coefficient between observed and simulated; and iii) centered root mean square error (CRMSE). However, there are other versions of the Taylor Diagram that include more than these three metrics (e.g., here). The Taylor Diagram was developed by Karl E. Taylor (original paper) and has been frequently used by meteorologists and atmospheric scientists. In this blog post, I focus on streamflow.

Underlying Mathematical Relationships

As mentioned above, the Taylor Diagram draws the following three statistical metrics on a single plot: standard deviation, CRMSE, and correlation. The equation below relates these three:

Equation – 1

In this equation:

This relationship can be easily derived using the definition of CRMSE, which is:

Equation – 2

In this equation:

We can expand this equation to the following:

Equation – 3

The first two components of the equation are standard deviations of observed and simulated time series. To understand the third component of the equation, we need to recall the mathematical definition of correlation.

Equation – 4

If we multiply both sides of the above equation by standard deviation of observed and simulated, we see that the third components of the equations 3 and equation 4 are actually the same. The Taylor Diagram uses polar coordinates to visualize all of these components. Readers can refer to the original paper to find more discussion and the trigonometric proofs behind this form of plot.

Code Example

In this blog post, I am presenting a simple R package that can be used to create Taylor Diagrams. In this example, I use the same streamflow datasets that I explained in the previous blog post. First, you need to download the time series from GitHub and use the following commands to read these two time series into R.

Observed_Arrow<-read.table("~/Taylor Diagram/Arrow_observed.txt", header = T)
Simulated_Arrow<-read.table("~/Taylor Diagram/Arrow_simulated.txt", header = T)

In this post, I use the “openair” R library, which was originally developed for atmospheric chemistry data analysis. In addition to the Taylor Diagram, the package provides many helpful visualization options that can be accessed from here. Note that I was able to find at least one more R package that can be used to create Taylor Diagrams (i.e., plotrix ). There are also Python and MATLAB libraries that can be used to create Taylor Diagrams.

You can use the following to install the “openair” package and activate the library.

install.packages("openair")
library(openair)

The following command can be used to create a Taylor Diagram.

combined_dataset<-cbind(Observed_Arrow[ 18263:length(Observed_Arrow[,1]),], Arrow_sim=Simulated_Arrow[1:10592,4])

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim")

Interpretation of the Taylor Diagram

The Taylor Diagram indicates the baseline observed point where correlation is 1 and CRMSE is 0 (purple color). If the simulation point is close to the observed point, it means that they are similar in terms of standard deviation, their correlation is high, and their CRMSE is close to zero. There is also a black dashed line that represents the standard deviation of the observed time series. If the red dot is above the line, it means that the simulated data set has a higher variation. The other helpful information that we gain from the Taylor Diagram is the correlation between observed and simulated values. Higher correlation shows a higher level of agreement between observed and simulated data. The correlation goes down as a point moves toward higher sectors in the graph. Centered root mean square error also shows the quality of the simulation process; however, it puts more weight on outliers. The golden contour lines on the polar plot show the value of CRMSE. Basically, on this figure, the red point has a higher standard deviation, the correlation is around 90%, and the CRMSE is close to 20,000.

The package also allows us to look at the performance of the model during different months or years, and we can compare different simulation scenarios with the observed data. Here, I use the function to draw a point for each month. The “group” argument can be used to do that.

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month")

The function also provides a simple way to break down the data into different plots for other attributes. For example, I created four panels for four different annual periods:

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month", type = "Year")

Networks on maps: exploring spatial connections using NetworkX and Basemap

This blogpost is about generating network graphs interlaid on spatial maps. I’ll be using the data provided by this paper (in the supplementary material) which estimates flows of food across US counties. All the code I’m using here can be found here.

The dataset included in erl_14_8_084011_sd_3.csv of the supplementary material lists the tons of food transported per food category, using the standard classification of transported goods (SCTG) food categories included in the study. The last two columns, ori and des, indicate the origin and destination counties of each flow, using FIPS codes.

To draw the network nodes (the counties) in their geographic locations I had to identify lat and lon coordinates for each county using its FIPS code, which can be found here 1.

Now, let’s these connections in Python, using NetworkX and Basemap. The entire script is here, I’ll just be showing the important snippets below. In the paper, they limit the visualization to the largest 5% of food flows, which I can confirm is necessary otherwise the figure would be unreadable. We first load the data using pandas (or other package that reads csv files), identify the 95th percentile and restrict the data to only those 5% largest flows.

data = pd.read_csv('erl_14_8_084011_sd_3.csv')
threshold = np.percentile(data['total'], 95)
data = data.loc[(data['total'] > threshold)]

Using NetworkX, we can directly create a network out of these data. The most important things I need to define are the dataframe column that lists my source nodes, the column that lists my destination nodes and which attribute makes up my network edges (the connections between nodes), in this case the total food flows.

G = nx.from_pandas_edgelist(df=data, source='ori', target='des', edge_attr='total',create_using = nx.DiGraph())

Drawing this network without the spatial information attached (using the standard nx.draw(G)) looks something like below, which does hold some information about the structure of this network, but misses the spatial information we know to be associated with those nodes (counties).

To associate the spatial information with those nodes, we’ll employ Basemap to create a map and use its projection to convert the lat and lon values of each county to x and y positions for our matplotlib figure. When those positions are estimated and stored in the pos dictionary, I then draw the network using the specific positions. I finally also draw country and state lines. You’ll notice that I didn’t draw the entire network but only the edges (nx.draw_networkx_edges) in an effort to replicate the style of the figure from the original paper and to declutter the figure.

plt.figure(figsize = (12,8))
m = Basemap(projection='merc',llcrnrlon=-160,llcrnrlat=15,urcrnrlon=-60,
urcrnrlat=50, lat_ts=0, resolution='l',suppress_ticks=True)
mx, my = m(pos_data['lon'].values, pos_data['lat'].values)
pos = {}
for count, elem in enumerate(pos_data['nodes']):
     pos[elem] = (mx[count], my[count])
nx.draw_networkx_edges(G, pos = pos, edge_color='blue', alpha=0.1, arrows = False)
m.drawcountries(linewidth = 2)
m.drawstates(linewidth = 0.2)
m.drawcoastlines(linewidth=2)
plt.tight_layout()
plt.savefig("map.png", dpi = 300)
plt.show()

The resulting figure is the following, corresponding to Fig. 5B from the original paper.

I was also interested in replicating some of the analysis done in the paper, using NetworkX, to identify the counties most critical to the structure of the food flow network. Using the entire network now (not just the top 5% of flows) we can use NetworkX functions to calculate each node’s degree and between-ness centrality. The degree indicates the number of nodes a node is connected to, between-ness centrality is an indicator of the fraction of shortest paths between two nodes that pass through a specific node. These are network metrics that are unrelated to the physical distance between two counties and can be used (along with several other metrics) to make inferences about the importance and the position of a specific node in a network. We can calculate them in NetworkX as shown below and plot them using simple pyplot commands:

connectivity = list(G.degree())
connectivity_values = [n[1] for n in connectivity]
centrality = nx.betweenness_centrality(G).values()

plt.figure(figsize = (12,8))
plt.plot(centrality, connectivity_values,'ro')
plt.xlabel('Node centrality', fontsize='large')
plt.ylabel('Node connectivity', fontsize='large')
plt.savefig("node_connectivity.png", dpi = 300)
plt.show()

The resulting figure is shown below, matching the equivalent Fig. 6 of the original paper. As the authors point out, there are some counties in this network, those with high connectivity and high centrality, that are most critical to its structure: San Berndardino, CA; Riverside, CA; Los Angeles, CA; Shelby, TN; San Joaquin, CA; Maricopa, AZ; San Diego, CA; Harris, TX; and Fresno, CA.

1 – If you are interested in how this is done, I used the National Counties Gazetteer file from the US Census Bureau and looked up each code to get its lat and lon.

Making the Most of Predictor Data for Machine Learning Applications

This blog post intends to demonstrate some machine learning tips that I have acquired through my PhD so far that I hope can be helpful to others, especially if you are working in water resources applications/in basins with complex hydrology/with limited data.

Addressing Correlation and Multicollinearity

The first step in any machine learning problem is often to identify the information that will be most relevant to what you are predicting for your output. If you are in the rare situation where you have a large number of inputs available, the best course of action is not to assume that every single one of those inputs are improving the predictability of your model. If your inputs are highly collinear, then ultimately, you are adding dimensionality to your model without receiving any predicting power in return. A couple suggestions include:

  1. Assess the correlation of your inputs with your outputs in the training set. It’s a good idea to just note which variables are most correlated with your output, because those will likely be extremely important for prediction.
  2. As you add inputs, you should check for multicollinearity among the selected input variables. You can calculate a correlation matrix that will show correlation among the variables and develop a criterion for dropping highly correlated variables. You can also calculate the Variance Inflation Factor (VIF) which represents how the variance of the output is attributed to variance of single inputs.

There are many algorithms available for information selection, including an Iterative Input Selection (IIS) algorithm which uses a regression tree approach to iteratively select candidate inputs based on correlation and returns the most predictive subset of inputs.

Incorporation of Time as an Input Variable

Water resources applications tend to be characterized by some periodicity due to seasonality so it may be obvious that including the day/month of the year as inputs can provide information about the cyclic nature the data. What may be less obvious is making sure that the machine learning model understands the cyclic nature of the data. If raw day or month values are implemented (1-365 or 1-12) as predictor variables, the gives the impression that day 1 is vastly different from day 365 or that January and December are the least related months, which is not actually the case and can send the wrong message to the algorithm. The trick is to create two new features for each of the day and month time series that take the sin and sin and cosine of each value. Both sin and cosine are needed to get a unique mapping for each. If we plot this, you can see that the cyclic nature is preserved compared to the uneven jumps in the first figure.

Aggregation of Variables

When working with rainfall-runoff prediction, you might see peaks in runoff that are somewhat lagged from the precipitation event of interest. This can create difficulties for the machine learning model, which potentially could see a zero value for precipitation on a day with an increased outflow. This can be due to residence time and different types of storage in the basin. Using a memory-based model such as an LSTM and/or an aggregation of past precipitation over various timescales can help to capture these effects.

  1. Immediate Effects: Aggregation of precipitation over the last few hours to a few days will likely capture run-off responses and a few initial ‘buckets’ of storage (e.g. canopy, initial abstraction, etc.)
  2. Medium-Term Effects: Aggregation over a few days to weeks might capture the bulk of soil storage layers (and movement between them and the stream (interflow) or deep groundwater)
  3. Long-Term Effects: Ideally, we would have snow information like SWE or snowpack depth if it is relevant, but in my test cases, this information was not available. Aggregation of precipitation over the past a water year is a proxy that can be used to capture snow storage.
Interaction Variables

Often time, classifiers consider variables to be independent, but often times, the interaction of variables can explain specific events that would not be captured by the variables alone. Think of an instance in which you are trying to predict yield based on water and fertilizer input variables. If you have no water, but you have fertilizer, the yield will be zero. If you have water and no fertilizer, the yield will be some value greater than zero. However, if you have both water and fertilizer, the yield will be greater than what each variable can produce alone. This is a demonstration of interaction effects, and the way to encode this as a variable is to add another input that is amount of water *amount of fertilizer. With this term, the model can attempt to capture these secondary effects. In a snow-driven watershed with limited information, interactions terms can be used to help predict runoff as follows:

  1. For a given amount of precipitation the flow will change based on temperature. For instance, if a watershed receives 100mm of precipitation at 1C there will be lots of runoff. If a watershed receives 100mm of precipitation at -10C, there will be very little runoff because that precipitation will be stored a snow. However, if there is 100mm of precipitation at 10C, this will equate to lots of runoff + lots of snowmelt. (interaction term=precipitation*temperature)
  2. For a given amount of precipitation and temperature the flow will change based on time of year. For instance, 100mm of precipitation at 10C in October will lead to lots of run-off. However, that same amount of precipitation at 10C in February can create flooding events not only from the precipitation but the additional rain-on-snow events. (interaction term=precipitation*temperature*sin(month)*cos(month))
  3. Even independent of precipitation the response of the watershed to various temperatures will be affected by the month of the water year/day of year. A temperature of 30C in April will probably lead to a flood event just from the snowmelt it would trigger. That same temperature in September will likely lead to no change in runoff. (interaction term=temperature*sin(month)*cos(month))

Hopefully these tips can be useful to you on your quest for making the most of your machine learning information!