Sankey diagrams for USGS gauge data in python(?)

This post was inspired by the Sankey diagram in Figure 1 of this pre-print led by Dave Gold:Exploring the Spatially Compounding Multi-sectoral Drought Vulnerabilities in Colorado’s West Slope River Basins” (Gold, Reed & Gupta, In Review) which features a Sankey diagram of flow contributions to Lake Powell. I like the figure, and thought I’d make an effort to produce similar diagrams using USGS gauge data.

Sankey diagrams show data flows between different source and target destinations. Lot’s of people use them to visualize their personal/business cashflows. It’s an obvious visualization option for streamflows.

To explain the “(?)” in my title: When I started this, I’d realized quickly that I to choose one of two popular plotting packages: matplotlib or plotly.

I am a frequent matplotlib user and definitely appreciate the level of control in the figure generation process. However, it can sometimes be more time- and line-intensive designing highly customized figures using matplotlib. On the other hand, in my experience, plotly tools can often produce appealing graphics with less code. I am also drawn to the fact that the plotly graphics are interactive objects rather than static figures.

I decided to go with plotly to try something new. If you want to hear my complaints and thoughts on use context, you can skip to the conclusions below.

In the sections below, I provide some code which will:

  • Define a network of USGS gauge stations to include in the plot
  • Retrieve data from USGS gauge stations
  • Create a Sankey diagram using plotly showing streamflows across the network

Here, I focus on the Rio Grande river upstream of Albuquerque, NM. However you can plot a different streamflow network by modifying the dictionary of upstream nodes defining the network.


Plotting a Sankey streamflow network with plotly

The code used here requires both plotly and the pygeohydro package (for USGS data retrieval).

from pygeohydro import NWIS
import plotly.graph_objects as go

With that out of the way, we can get started.

Defining the flow network & data retrieval

I start by defining a dictionary called upstream_stations which defines the relationships between different gauges of interest.

This dictionary contains pairs of the form: {"GAUGE_ID" : ["LIST_OF", "UPSTREAM", "GAUGE_IDs"]}

If there is no upstream site, then include an empty list. For the Rio Grande network, this looks like:

# define relationships between each gauge and upstream sites
upstream_stations = {
    '08329918' : ['08319000', '08328950'], 
    '08319000' : ['08317400', '08317200'],
    '08328950' : [],
    '08317200' : [],
    '08317400' : ['08313000'],
    '08313000' : ['08290000', '08279500'],
    '08287000' : [],
    '08279500' : [],
    '08290000' : ['08287000', '08289000'],
    '08289000' : [],
}

# Get list of all stations from upstream_stations
all_stations = list(upstream_stations.keys())
for station, upstream in upstream_stations.items():
    all_stations += upstream
all_stations = list(set(all_stations))

Notice that I also made a list containing all the stations IDs. I use the pygeohydro package from the HyRiver suite of tools to get retrieve the gauge station data (Chegini, Li, & Leung, 2021). I often cite this package, and have written about it in a past post (“Efficient hydroclimatic data accessing with HyRiver for Python”).

Using the list of all_stations, I use the following code to pull daily streamflow data for each site from 2015-2020 (or some other specified dates):

def get_usgs_gauge_data(stations, dates):
    """
    Get streamflow data from USGS gauge stations using NWIS.
    """
    nwis = NWIS()
    df = nwis.get_streamflow(stations, dates, mmd=False)
    
    # get rid of USGS- in columns
    df.columns = df.columns.str.replace('USGS-', '')
    return df

# Get USGS flows
flows = get_usgs_gauge_data(all_stations, ('2015-01-01', '2020-12-31'))

For the Sankey diagram, we need a single flow value for each station. In this case I calculate an average of the annual total flows at each station:

# Get annual mean flows
agg_flows = flows.resample('Y').sum().agg('mean')

Creating the Sankey figure

At it’s core, a Sankey diagram is a visualization of a weighted network (also referred to as a graph) defined by:

  • Nodes
  • Links (aka Edges)
  • Weights

In our case, the nodes are the USGS gauge stations, the links are the connections between upstream and downstream gauges, and the weights are the average volumes of water flowing from one gauge to the next.

Each link is defined by a source and target node and a value. This is where the upstream_stations dictionary comes in. In the code block below, I set up the nodes and links, looping through upstream_stations to define all of the source-target relationships:

## Deinfe nodes and links
# Nodes are station IDs
nodes = all_stations
node_indices = {node: i for i, node in enumerate(nodes)}

# make links based on upstream-downstream relationships
links = {
    'source': [],
    'target': [],
    'value': [],
}

# loop through upstream_stations dict
for station, upstream_list in upstream_stations.items():
    for stn in upstream_list:
        if stn in agg_flows and station in agg_flows:
            links['source'].append(node_indices[stn])
            links['target'].append(node_indices[station])
            links['value'].append(agg_flows[stn])

Lastly, I define some node labels and assign colors to each node. In this case, I want to make the nodes black if they represent reservoir releases (gauges at reservoir outlets) or blue if they are simple gauge stations.

labels = {
    '08329918' : 'Rio Grande at Alameda', 
    '08319000' : 'San Felipe Gauge',
    '08328950' : 'Jemez Canyon Reservoir',
    '08317200' : 'Santa Fe River',
    '08317400' : 'Cochiti Reservoir',
    '08313000' : 'Rio Grande at Otowi Bridge',
    '08287000' : 'Abiquiu Reservoir',
    '08279500' : 'Rio Grande',
    '08290000' : 'Rio Chama',
    '08289000' : 'Rio Ojo Caliente',
}

# Create nodes labels and colors lists
node_labels = [labels[node] for node in nodes]
node_colors = ['black' if 'Reservoir' in label else 'dodgerblue' for label in node_labels]


Finally, the function to generate the figure:

def create_sankey_diagram(node_labels, links, node_colors, 
						  orientation='h',
                          size=(2000, 700)):
    """
    Create a Sankey diagram of using Plotly.
    
    Parameters
    ----------
    node_labels : list
        List of node labels.
    links : dict
        Dictionary with keys 'source', 'target', and 'value'.
    node_colors : list
        List of node colors.
    orientation : str
        Orientation of the diagram, 'h' for horizontal and 'v' for vertical.
        
    Returns
    -------
    sankey_fig : plotly.graph_objects.Figure
        Plotly figure object.
    """
    sankey_fig = go.Figure(go.Sankey(
        orientation=orientation,
        node=dict(
            pad=70,
            thickness=45,
            line=dict(color='dodgerblue', width=0.5),
            label=node_labels,
            color=node_colors
        ),
        link=dict(
            source=links['source'],
            target=links['target'],
            value=links['value'],
            color='cornflowerblue'
        )
    ))
    
    sankey_fig.update_layout(
        title_text="Rio Grande Streamflow ",
        font=dict(size=23),
        width=size[0],
        height=size[1]
    )
    return sankey_fig

There are some options for manipulating this figure script to better suit your needs. Specifically you may want to modify:

  • pad=70 : this is the horizontal spacing between nodes
  • thickness=45 : this is the thickness of the node elements

With our pre-prepped data from above, we can use the function like so:

sankey_fig = create_sankey_diagram(node_labels, 
								   links, 
								   node_colors, 
								   orientation='v', size=(1000, 1200))
sankey_fig

And here we have it:

I’d say it looks… okay. And admittedly this is the appearance after manipulating the node placement using the interactive interface.

It’s a squished vertically (which can be improved by making it a much taller figure). However my biggest issue is with the text being difficult to read.

Changing the orientation to horizontal (orientation='h') results in a slightly better looking figure. Which makes sense, since the Sankey diagram is often shown horizontal. However, this does not preserve the relationship to the actual North-South flow direction in the Rio Grande, so I don’t like it as much.

Conclusions

To answer the question posed by the title, “Sankey diagrams for USGS gauge data in python(?)”: Yes, sometimes. And sometimes something else.

Plotly complaints: While working on this post, I developed a few complaints with the plotly Sankey tools. Specifically:

  • It appears that the label text coloring cannot be modified. I don’t like the white edgecolor/blur effect, but could not get rid of this.
  • The font is very difficult to read… I had to make the text size very large for it to be reasonably legible.
  • You can only assign a single node thickness. I had wanted to make the reservoirs thick, and shrink the size of the gauge station nodes. However, it seems this cannot be done.
  • The diagrams appear low-resolution and I don’t see a way to save a high res version.

Ultimately, the plotly tools are very restrictive in the design of the graphic. However, this is a tradeoff in order to get the benefit of interactive graphics.

Plotly praise: The plotly Sankey tools have some advantages, specifically:

  • The plots are interactive
  • Plots can be saved as HTML and embedded in websites

These advantages make the plotly tools good for anyone who might want to have a dynamic and maybe frequently updated dashboard on a site.

On the other hand, if I wanted to prepare a publication-quality figure, where I had absolute control of the design elements, I’d likely turn to matplotlib. That way it could be saved as an SVG and further manipulated in a vector art program link Inkscape or Illustrator.

Thanks for reading!

References

Chegini, T., Li, H. Y., & Leung, L. R. (2021). HyRiver: Hydroclimate data retriever. Journal of Open Source Software, 6(66), 3175.

Links

Introduction to Bayesian Regression using PyMC

Motivation

Fans of this blog will know that uncertainty is often a focus for our group. When approaching uncertainty, Bayesian methods might be of interest since they explicitly provide uncertainty estimates during the modeling process.

PyMC is the best tool I have come across for Bayesian modeling in Python; this post gives a super brief introduction to this toolkit.

Introduction to PyMC

PyMC, described in their own words:
“… is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.”

In my opinion, the best part of PyMC is the flexibility and breadth of model design features. The space of different model configurations is massive. It allows you to make models ranging from simple linear regressions (shown here), to more complex hierarchical models, copulas, gaussian processes, and more.

Regardless of your model formulation, PyMC let’s you generate posterior estimates of model parameter distributions. These parameter distributions reflect the uncertainty in the model, and can propagate uncertainty into your final predictions.

The posterior estimates of model parameters are generated using Markov chain Monte Carlo (MCMC) methods. A detailed overview of MCMC is outside the scope of this post (maybe in a later post…).

In the simplest terms, MCMC is a method for estimating posterior parameter distributions for a Bayesian model. It generates a sequence of samples from the parameter space (which can be huge and complex), where the probability of each sample is proportional to its likelihood given the observed data. By collecting enough samples, MCMC generates an approximation of the posterior distribution, providing insights into the probable values of the model parameters along with their uncertainties. This is key when the models are very complex and the posterior cannot be directly defined.

The PyMC example gallery has lots of cool stuff to get you inspired, with examples that go far and beyond the simple linear regression case.


Demonstration:

When writing drafting this post, I wanted to include a demonstration which is (a) simple enough to cover in a brief post, and (b) relatively easy for others to replicate. I settled on the simple linear regression model described below, since this was able to be done using readily retrievable CAMELS data.

The example attempts to predict mean streamflow as a linear function of basin catchment area (both in log space). As you’ll see, it’s not the worst model, but its far from a good model; there is a lot of uncertainty!

CAMELS Data

For a description of the CAMELS dataset, see Addor, Newman, Mizukami and Clark (2017).

I pulled all of the national CAMELS data using the pygeohydro package from HyRiver which I have previously recommended on this blog. This is a convenient single-line code to get all the data:

import pygeohydro as gh

### Load camels data
camels_basins, camels_qobs = gh.get_camels()

The camels_basins variable is a dataframe with the different catchment attributes, and the camels_qobs is a xarray.Dataset. In this case we will only be using the camels_basins data.

The CAMELS data spans the continental US, but I want to focus on a specific region (since hydrologic patterns will be regional). Before going further, I filter the data to keep only sites in the Northeaster US:

# filter by mean long lat of geometry: NE US
camels_basins['mean_long'] = camels_basins.geometry.centroid.x
camels_basins['mean_lat'] = camels_basins.geometry.centroid.y
camels_basins = camels_basins[(camels_basins['mean_long'] > -80) & (camels_basins['mean_long'] < -70)]
camels_basins = camels_basins[(camels_basins['mean_lat'] > 35) & (camels_basins['mean_lat'] < 45)]

I also convert the mean flow data (q_mean) units from mm/day to cubic meters per day:

# convert q_mean from mm/day to m3/s
camels_basins['q_mean_cms'] = camels_basins['q_mean'] * (1e-3) *(camels_basins['area_gages2']*1000**2) * (1/(60*60*24)) 

And this is all the data we need for this crude model!

Bayesian linear model

The simple linear regression model (hello my old friend):

Normally you might assume that there is a single, best value corresponding to each of the model parameters (alpha and beta). This is considered a Frequentist perspective and is a common approach. In these cases, the best parameters can be estimated by minimizing the errors corresponding to a particular set of parameters (see least squares, for example.

However, we could take a different approach and assume that the parameters (intercept and slope) are random variables themselves, and have some corresponding distribution. This would constitute a Bayesian perspective.

Keeping with simplicity in this example, I will assume that the intercept and slope each come from a normal distribution with a mean and variance such that:

When it comes time to make inferences or predictions using our model, we can create a large number of predictions by sampling different parameter values from these distributions. Consequently, we will end up with a distribution of uncertain predictions.

PyMC implementation

I recommend you see the PyMC installation guide to help you get set up.

NOTE: The MCMC sampler used by PyMC is written in C and will be SIGNIFICANTLY faster if you provide have access to GCC compiler and specify the it’s directory using the the following:

import pymc as pm

import os
os.environ["THEANO_FLAGS"] = "gcc__cxxflags=-C:\mingw-w64\mingw64\bin"

You will get a warning if you don’t have this properly set up.

Now, onto the demo!

I start by retrieving our X and Y data from the CAMELS dataset we created above:

# Pull out X and Y of interest
x_ftr= 'area_gages2'
y_ftr = 'q_mean_cms'
xs = camels_basins[x_ftr] 
ys = camels_basins[y_ftr]

# Take log-transform 
xs = np.log(xs)
ys = np.log(ys)

At a glance, we see there is a reasonable linear relationship when working in the log space:

Two of the key features when building a model are:

  • The random variable distribution constructions
  • The deterministic model formulation

There are lots of different distributions available, and each one simply takes a name and set of parameter values as inputs. For example, the normal distribution defining our intercept parameter is:

alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)

The value of the parameter priors that you specify when construction the model may have a big impact depending on the complexity of your model. For simple models you may get away with having uninformative priors (e.g., setting mu=0), however if you have some initial guesses then that can help with getting reliable convergence.

In this case, I used a simple least squares estimate of the linear regression as the parameter priors:

slope_prior, intercept_prior = np.polyfit(xs.values.flatten(), ys.values.flatten(), 1)

Once we have our random variables defined, then we will need to formulate the deterministic element of our model prediction. This is the functional relationship between the input, parameters, and output. For our linear regression model, this is simply:

y_mu = alpha + beta * xs

In the case of our Bayesian regression, this can be thought of as the mean of the regression outputs. The final estimates are going to be distributed around the y_mu with the uncertainty resulting from the combinations of our different random variables.

Putting it all together now:

### PyMC linear model
with pm.Model() as model:
    
    # Priors
    alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)
    beta = pm.Normal('beta', mu=slope_prior, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # mean/expected value of the model
    mu = alpha + beta * xs

    # likelihood
    y = pm.Normal('y', mu=mu, sigma=sigma, observed=ys)

    # sample from the posterior
    trace = pm.sample(2000, cores=6)
 

With our model constructed, we can use the pm.sample() function to begin the MCMC sampling process and estimate the posterior distribution of model parameters. Note that this process can be very computationally intensive for complex models! (Definitely make sure you have the GCC set up correctly if you plan on needing to sample complex models.)

Using the sampled parameter values, we can create posterior estimates of the predictions (log mean flow) using the posterior parameter distributions:

## Generate posterior predictive samples
ppc = pm.sample_posterior_predictive(trace, model=model)

Let’s go ahead and plot the range of the posterior distribution, to visualize the uncertainty in the model estimates:

### Plot the posterior predictive interval
fig, ax = plt.subplots(ncols=2, figsize=(8,4))

# log space
az.plot_hdi(xs, ppc['posterior_predictive']['y'], 
            color='cornflowerblue', ax=ax[0], hdi_prob=0.9)
ax[0].scatter(xs, ys, alpha=0.6, s=20, color='k')
ax[0].set_xlabel('Log ' + x_ftr)
ax[0].set_ylabel('Log Mean Flow (m3/s)')

# original dim space
az.plot_hdi(np.exp(xs), np.exp(ppc['posterior_predictive']['y']), 
            color='cornflowerblue', ax=ax[1], hdi_prob=0.9)
ax[1].scatter(np.exp(xs), np.exp(ys), alpha=0.6, s=20, color='k')
ax[1].set_xlabel(x_ftr)
ax[1].set_ylabel('Mean Flow (m3/s)')
plt.suptitle('90% Posterior Prediction Interval', fontsize=14)
plt.show()

And there we have it! The figure on the left shows the data and posterior prediction range in log-space, while the figure on the right is in non-log space.

As mentioned earlier, it’s not the best model (wayyy to much uncertainty in the large-basin mean flow estimates), but at least we have the benefit of knowing the uncertainty distribution since we took the Bayesian approach!

That’s all for now; this post was really meant to bring PyMC to your attention. Maybe you have a use case or will be more likely to consider Bayesian approaches in the future.

If you have other Bayesian/probabilistic programming tools that you like, please do comment below. PyMC is one (good) option, but I’m sure other people have their own favorites for different reasons.


PyMC resources:

References

Addor, N., Newman, A. J., Mizukami, N. and Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, doi:10.5194/hess-21-5293-2017, 2017.

An introduction to linear programming for reservoir operations (part 1)

Motivation:

If you were like me, you may have been provided an overview of linear programming (LP) methods but craved a more hands-of demonstration, as opposed to the abstract/generic mathematical formulation that is often presented in lectures. This post is for the hydrology enthusiasts out there who want to improve their linear programming understanding by stepping through a contextual example.

In this post, I provide a very brief intro to linear programming then go through the process of applying a linear programming approach to a simple hydroelectric reservoir problem focused on determining a release rate that will consider both storage targets and energy generation. In a later post, I will come back to show how the formulation generated here can be implemented within simulation code.

Content:

  • An introduction to linear programming
    • Formulating an LP
    • Solving LP problems: Simplex
    • Relevance for water resource managers
  • A toy reservoir demonstration
    • The problem formulation
      • Constraints
      • Objective
  • Conclusions and what’s next

An introduction to linear programming

Linear programming (LP) is a mathematical optimization technique for making decisions under constraints. The aim of an LP problem is to find the best outcome (either maximizing or minimizing) given a set of linear constraints and a linear objective function.

Formulating an LP

The quality of an LP solution is only as good as the formulation used to generate it.

An LP formulation consists of three main components:

1. Decision Variables: These are the variables that you have control over. The goal here is to find the specific decision variable values that produce optimal outcomes. There are two decision variables shown in the figure below, x1 and x2.

2. Constraints: These are the restrictions which limit what decisions are allowed. (For our reservoir problem, we will use constraints to make sure total mass is conserved, storage levels stay within limits, etc.) The constraints functions are required to be linear, and are expressed as inequality relationships.

3. Objective Function: This is the (single) metric used to measure outcome performance. This function is required to be linear and it’s value is denoted Z in the figure below, where it depends on two decision variables (x1, and x2).

In practice, the list of constraints on the system can get very large. Fortunately matrix operations can be used to solve these problems quickly, although this requires that the problem is formatted in “standard form” shown above. The standard form arranges the coefficient values for the constraints within matrices A and b.

Solving LP problems: keep it Simplex

Often the number of potential solutions that satisfy your constraints will be very large (infinitely large for continuous variables), and you will want to find the one solution in this region that maximizes/minimizes your objective of interest (the “optimal solution”).

The set of all potential solutions is referred to as the “feasible space” (see the graphic below), with the constraint functions forming the boundary edges of this region. Note that by changing the constraints, you are inherently changing the feasible space and thus are changing the set of solutions that you are or are not considering when solving the problem.

The fundamental theorem of linear programming states that the optimal solution for an LP problem will exist at one of corners (or a vertex) of the feasible space.

Recognizing this, George Dantzig proposed the Simplex algorithm which strategically moves between vertices on the boundary feasible region, testing for optimality until the best solution is identified.

A detailed review of the Simplex algorithm warrants it’s own blog post, and wont be explained here. Fortunately there are various open LP solvers. For example, one option in Python is scipy.optimize.linprog().

Relevance for water resource managers

Why should you keep read this post?

If you work in water resource systems, you will likely need to make decisions in highly constrained settings. In these cases, LP methods can be helpful. In fact there are many scenarios in water resource management in which LP methods can (and historically have) been useful. Some general application contexts include:

  • Resource allocation problems: LP can be used to allocate water efficiently among different users like agriculture, industry, and municipalities.
  • Operations optimization: LP can guide the operations decisions of water reservoirs, deciding on storage levels, releases, and energy production to maximize objectives like water availability or energy generation from hydropower (the topic of this demonstration below).

Toy Reservoir Demonstration: Problem Formulation

The following demo aims to provide a concrete example of applying linear programming (LP) in the realm of water resource management. We will focus on a very simple reservoir problem, which, despite its simplicity, captures the essence of many real-world challenges.

Imagine you are managing a small hydroelectric reservoir that provides water and energy to a nearby town. Your goal is to operate the reservoir, choosing how much water to release each day such that the (1) the water level stays near a specific target and (2) you maximize energy generation. Additionally, there is a legally mandated minimum flow requirement downstream to sustain local ecosystems

Here, I will translate this problem context into formal LP notation, then in a later post I will provide Python code to implement the LP decision making process within a simulation model.

Problem formulation

We want to translate our problem context into formal mathematical notation that will allow us to use an LP solver. In order to help us get there, I’ve outlines the following table with all the variables being considered here.

Decision variables

In this case the things we are controlling at the reservoir releases (Rt), and the storage level (St) at each timestep.

Constraints:

Constraints are the backbone of any LP problem, defining the feasible space which ultimately dictates the optimal solution. Without constraints, an LP problem would have infinite solutions, rendering it practically meaningless. Constraints are meant to represent any sort of physical limitations, legal requirements, or specific conditions that must be satisfied by your decision. In the context of water resource management, constraints could signify capacity limits of a reservoir, minimum environmental flow requirements, or regulatory water diversion limits.

  • Mass balance requirement:

Naturally you want to make sure that mass is conserved through the reservoir, by balancing all of the inflows, outflows and storage states, for each timestep (t):

Although we need to separate this into two separate inequalities in order to meet the “standard form” for an LP formulation. I am also going to move the decision variables (Rt and St) to the left side of the inequalities.

  • Storage limitations:

The most obvious constraints are that we don’t want the reservoir to overflow or runout of water. We do this by requiring the storage to be between some minimum (Smin) and maximum (Smax) values specified by the operator.

Again we need to separate this into two separate inequalities:

  • Maximum and minimum release limitations:

The last two constraints represent the physical limit of how much water can be discharged out of the reservoir at any given time (Rmax) and the minimum release requirement that is needed to maintain ecosystem quality downstream of the reservoir (Rmin).

Objective:

The objective function in an LP represents your goals. In the case of our toy reservoir, we have two goals:

  1. Maximize water storage
  2. Maximize energy production.

Initially when I was setting this demonstration up, there was no hydroelectric component. However, I chose to add the energy generation because it introduces more complexity in the actions (as we will see). This is because the two objectives conflict with one another. When generating energy, the LP is encouraged to release a lot of water and maximize energy generation, however it must balance this with the goal of raising the storage to the target level. This tension between the two objectives makes for much more interesting decision dynamics.

1. Objective #1: Maximize storage

Since our reservoir managers want to make sure the Town’s water supply is reliable, they want to maximize the storage. This demo scenario assumes that flood is not a concern for this reservoir. We can define objective one simply as:

Again, we can replace the unknown value (St) with the mass-balance equation described above to generate a form of the objective function which includes the decision variable (Rt):

We can also drop the non-decision variables (all except Rt) since these cannot influence our solution:

2. Objective #2: Maximize energy generation:

Here I introduce a second component of the objective function associated with the amount of hydroelectric energy being generated by releases. Whereas the minimize-storage-deviation component of the objective function will encourage minimizing releases, the energy generation component of the objective function will incentivize maximizing releases to generate as much energy each day as possible.

I will define the energy generated during each timestep as:

(1+2). Aggregate minimization objective:
Lastly, LP problems are only able to solve for a single objective function. With that said, we need to aggregate the storage and energy generation objectives described above. In this case I take a simple sum of the two different objective scores. Of course this is not always appropriate, and the weighting should reflect the priorities of the decision makers and stakeholders.

In this case, this requires a priori specification of objective preferences, which will determine the solution to the optimization. In later posts I plan on showing an alternative method which allows for posterior interpretation of objective tradeoffs. But for now, we stay limited with the basic LP with equal objective weighting.

Also, we want to formulate the optimization as a minimization problem. Since we are trying to maximize both of our objectives, we will make them negative in the final objective function.

The final aggregate objective is calculated as the sum of objectives of the N days of operation:

Final formulation:

As we prepare to solve the LP problem, it is best to lay out the complete formulation in a way that will easily allow us to transform the information into the form needed for the solver:

Conclusions and what’s next

If you’ve made it this far, good on you, and I hope you found some benefit to the process of formulating a reservoir LP problem from scratch.

This post was meant to include simulated reservoir dynamics which implement the LP solutions in simulation. However, the post has grown quite long at this point and so I will save the code for another post. In the meantime, you may find it helpful to try to code the reservoir simulator yourself and try to identify some weaknesses with the proposed operational LP formulation.

Thanks for reading, and see you here next time where we will use this formulation to implement release decisions in a reservoir simulation.

Using drawdata and Mercury in Jupyter Notebooks

I wrote a post last year on Enhancing Jupyter Notebooks for Teaching. Through my time at Cornell as a TA or mentoring our undergrad researchers, I’ve learned how important it is to find fun and interesting ways for students to play with data and to feel more comfortable diving into traditionally difficult topics like machine learning. This post features two cool new libraries that were brought to my attention recently that can help jazz up your tutorials:

(1) drawdata: allows a student to interactively draw a dataset (lines, histograms, scatterplots) with up to four different labels. The dataset and labels can be saved as JSONs and CSVs and also directly copied into Pandas dataframes. This can be useful to facilitate interactive machine learning tutorials.

(2) Mercury: converts a Jupyter notebook to web app. This is especially useful for classroom tutorials because it doesn’t require that students have Jupyter installed or even Python.

I’ve combined both of these functionalities into a notebook that is focused on classification of a dataset with 2 labels using support vector machines (SVMs) that can be found here.

drawdata

First let’s install drawdata by typing pip install drawdata into our command line. Next, let’s follow through the steps of the Jupyter notebook. We’ll import the rest of our libraries and draw out a simple linearly-separable dataset.

By clicking “copy csv” we can copy this exact dataset to a Pandas dataframe. Upon inspection, we see that each datapoint has 2 features (an x and y coordinate) and a label that is either “a” or “b”. Let’s also adjust the labels to be [-1,1] for the purposes of using some classifiers from scikit- learn. We then fit a very basic support vector classifier and plot the decision boundary.

data=pd.read_clipboard(sep=",")

#Rename the labels to integers

for i in range(0, len(data)):
    # checking if the character at char index is equivalent to 'a'
    if(data.iloc[i,2] == 'a'):
        # append $ to modified string
        data.iloc[i,2] = -1
    else:
        # append original string character
        data.iloc[i,2] = 1
data.iloc[:,2]=data.iloc[:,2].astype('int')


#Create our datasets

X=np.array(data.iloc[:,0:2])
y=np.array(data.iloc[:,2])


#Create a 60/40 training and test split 

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.4, random_state=42)
    
#Fit classifier 

clf=svm.SVC(kernel='linear',C=0.025)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("The classification accuracy is", score)
#The classification accuracy is 1.0


#Plot original dataset that you drew 

cm = plt.cm.get_cmap('cool_r')
cm_bright = ListedColormap(["purple", "cyan"])
figure = plt.figure(figsize=(8, 6))
ax = plt.subplot(1,1,1)
#Plot decision boundary
ax.set_title("Classification Boundary")
DecisionBoundaryDisplay.from_estimator(clf, X, cmap=cm, alpha=0.8, ax=ax, eps=0.5)
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train,cmap=cm_bright , edgecolors="k")
# Plot the testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k")

We can also plot the margin and support vectors.

This is a straightforward classification example, but we can also look at non-linearly separable examples.

Ruh roh! Our linear classifier is never going to work for this dataset! We’ll have to use the kernel trick- that is, we need to map our data to a higher dimensional space with an additional feature that may be able to better help us separate out the two classes. We can create an additional feature that captures the distance from a central point (300, 300). This allows us to find a hyperplane that can separate the points in a higher dimensional space.

#Create an additional dimension
r = np.sqrt((X[:, 0]-300)**2+(X[:, 1]-300)**2)

def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap=cm_bright,edgecolors="k")
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')
    

interact(plot_3D, elev=(-90, 90), azip=(-180, 180),
         X=fixed(X), y=fixed(y));

Thus, we can implement a similar radial basis function kernel in our SVC rather than using a linear kernel to define our non-linear classification boundary.

Mercury

Once you have a notebook that you are satisfied with, you can make it into an interactive web app by adding a YAML header. The YAML header also facilitates the ability interact with certain variables in the code and then run it. For example, users can select a classifier from a drop down menu or slide through different values of C. Finally they click the run button to execute the notebook.

A YAML header is added to the first RAW cell in the notebook. Here I want the user to be able to slide through the hardness of the margin (between 0 and 100).

---
title: SVM Classifier
description: Implement linear and non-linear classifiers
show-code: True
params:
    C_margin:
        input: slider 
        label: Value of Margin 
        value: 0.025
        min: 0
        max: 100
---

Then we install mercury (pip install mljar-mercury) and go to the command line and type mercury watch Classification_Tutorial.ipynb. The prompt will create a local address that you can put in your browser. The users can then draw their own dataset, adjust the hardness of the margin, and run the notebook.

Resources

Some of the SVM plotting code was borrowed from: https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html

More info on Mercury can be found in this post: https://medium.com/@MLJARofficial/mercury-convert-jupyter-notebook-to-a-web-app-by-adding-yaml-header-872ce4e53676

Structuring a Python Project: Recommendations and a Template Example

Motivation:

The start of a new year is a good (albeit, relatively arbitrary) time to reassess aspects of your workflow.

I, like many people, taught myself Python by jumping into different projects. The consequence of this ad-hoc learning was that I did not learn some of the fundamentals until much later in my project development.

At the end of the Fall ’23 semester, I found myself spending a lot of time cleaning up repositories and modules that I had constructed in the preceding months. I ended up restructuring and reorganizing significant portions of the code base, implementing organizational practices that I had learned after it’s conception.

This post is intended to save the reader from making the same mistake. Here, I present a recommended structure for new Python projects, and discuss the main components. This is largely targeted at Python users who have not had a formal Python training, or who are working with their first significantly sized project.

I provide an example_python_project, hosted on GitHub, as a demonstration and to serve as a template for beginners.

Content:

  1. Recommended structure
  2. Example project repository
  3. Project components (with example project)
    1. Modules, packages, __init__.py
    2. Executable
    3. README
    4. Documentation
    5. Tests
    6. Requirements
    7. LICENSE
  4. Conclusion

Recommended project structure

I’ll begin by presenting a recommended project structure. Further down, I provide some explanation and justification for this structure.

My example_python_project follows recommendations from Kenneth Reitz, co-author of The Hitchhiker’s Guide to Pythton (1), while also drawing from a similar demo-project, samplemod, by Navdeep Gill (2).

The project folder follows this structure:

example_python_project/ 
├── sample_package/ 
│   ├── subpackage/ 
│   │   ├── __init__.py 
│   │   └── subpackage_module.py 
│   ├── __init__.py 
│   ├── helpers.py 
│   └── module.py 
├── docs/ 
│   └── documentation.md 
├── tests/ 
│   ├── __init__.py 
│   └── basic_test.py 
├── main.py 
├── README.md 
├── requirements.txt 
└── LICENSE

If you are just starting a project, it may seem unnecessary complex to begin with so much modularity. It may seem easier to open a .py file and start freewheeling. Here, I am trying to highlight the several reasons why it is important to take care when initially constructing a Python project. Some of these reasons include:

  1. Maintenance: A well-structured project makes it easier to understand the code, fix bugs, and add new features. This is especially important as the project grows in size and complexity.
  2. Collaboration: When working on a project with multiple developers, a clear structure makes it easier for everyone to understand how the code is organized and how different components interact with each other.
  3. Scalability: A well-structured project allows to scale up the codebase, adding new features and sub-components, without making the codebase hard to understand or maintain.
  4. Testing: A well-structured project makes it easier to write automated tests for the code. This helps to ensure that changes to the code do not break existing functionality.
  5. Distribution: A well-structured project makes it easier to distribute the code as a package. This allows others to easily install and use the code in their own projects.

Overall, taking the time to structure a Python project when starting can save a lot of time and heartache in the long run, by making the project easier to understand, maintain, and expand.


Example Project Repository

The repository containing this example project is available on GitHub here: example_python_project.

The project follows the recommended project structure above, and is designed to use modular functions from the module, helpers, and subpackage_module. It is intended to be a skeleton upon which you can build-up your own project.

If you would like to experiment with your own copy of the code, you can fork a copy of the repository, or Download a ZIP version.

Project overview

The project is a silly riddle program with no real usefulness other than forming the structure of the project. The bulk of the work is done in the main_module_function() which first prints a riddle on the screen, then iteratively uses the helper_function() and subpackage_function() to try and “solve” the riddle. Both of these functions simply return a random True/False, and are repeatedly called until the riddle is solved (when status == True).

Below is a visual representation of how the different functions are interacting. The green-box functions are contained within the main sample_package, while the blue-box function is stored in the subpackage.

The program can then be executed from a command line using the main.py executable:

C:\<your-local-directory>\example_python_project> python main.py

The output will first print out the riddle, then print statements indicating which functions are being used to “solve” the riddle. This is simply a means of demonstrating how the different functions are being activated, and not necessarily a recommended “Best Practice”.

A normal output should resemble something similar to the below, although there may be more or less print statements depending upon how many times it takes the random generator to produce a “True” solution:

Here is a riddle, maybe `sample_package` can help solve it:

   What runs but has no feet, roars but has no mouth?

Lets see if the helper can solve the riddle.
The helper_function is helping!
The helper could not solve it.
Maybe the subpackage_module can help.
The subpackage_function is being used now.
The subpackage solved it, the answer is "A River"!

Project components

Modules, packages, __init__.py, oh my!

Before going any further, I want to take time to clarify some vocabulary which is helpful for understanding component interactions.

Module:
A module is simply a file ending in .py, which contains functions and or variables.

Package:
A package is a collection of modules (.py files) which relate to one another, and which contains an __init__.py file.

__init__.py
Inclusion of a __init__.py file (pronounced “dunder in-it”) within a folder will indicate to Python that the folder is a package. Often, the __init__ module is empty, however it can be used to import other modules, or functions which will then be stored in namespace, making it available for use later.

For example, in my sample_package/__init__.py, I import all contents of the module.py and subpackage_module.py:

# Import the all functions from main and sub modules
from .module import *
from .subpackage.subpackage_module import *

This allows all of the functions stored within module to be callable from the primary sample_package directly, rather than specifying the various sub-structures needed to access various functions. For example, by including from .subpackage.subpackage_module import *, I able to run:

# IF __init__ imports all content from main and sub modules then you can do this:
import sample_package
sample_package.subpackage_module_function()

Rather than requiring the following fully-nested call, which is necessary when the __init__.py is empty:

# IF __init__ is EMPTY, then you need to do this:
import sample_package
sample_package.subpackage.subpackage_module.subpackage_module_function()

Notably, an __init__.py is not necessary to use modules and functions within a folder… however, customizing the imports present in the packages __init__.py will provide increased customization to your projects use. As the project increases in complexity, strategic usage of imports within the __init__ can keep your main executable functions cleaner.

Executables

So, you’ve crafted a Python project with a sleek, modular package design. The next step is to setup a single file which will execute the package.

Inclusion of a single executable has the benefit of providing a single-entry point for other users who want to run the program without getting lost in the project.

In the example_python_project, this is done with main.py:

# Import the main package
import sample_package

def run():
    solved = sample_package.main_module_function()
    return solved

# Run the function if this is the main file executed
if __name__ == "__main__":
    run()

The program then can then be executed from a command line:

C:\<your-local-directory\example_python_project> python run_program.py

README

The README.md file is typically someone’s first encounter with your project. This is particularly true if the project is hosted on GitHub, where the README.md is used as the home-page of a repository.

A README.md file should include, at minimum, a brief description of the project, it’s purpose, and clear instructions on how to use the code.

Often, README files are written in Markdown, which includes simple text-formatting options. You can find a basic Markdown Cheat Sheet here. Although reStructuredText is often used, and even .txt files may be suitable.

Documentation

Great code requires great documentation. Initializing a new project with a dedicated docs/ folder may help hold you accountable for documenting the code along the way.

For information on how to use Sphinx and reStructuredText to create clean webpage-based documentation, you can see Rohini Gupta’s post on Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!).

Tests

Bugs aren’t fun. They are even less fun when a code was bug-free yesterday but contains bugs today. Implementing automated tests in your project can help verify functionality throughout the development process and catch bugs when they may arise.

It is recommended to implement Unit Tests which verify individual components of the project. These tests should assert that function output properties align with expectations. As you develop your project in a modular way, you can go in and progressively add consecutive tests, then run all of the tests before sharing or pushing the project to others.

A standard Python instillation comes with the unittest package, which is intended to provide a framework for these tests. I provide an example test below, but deeper-dive into the unittest framework may require a dedicated future posts.

In the example_python_project, I include the basic_test.py to verify that the solution generated by main_module_function() is True using the unittest package:

import sample_package
import unittest

# Define a test suite targeting specific functionality
class BasicTestSuite(unittest.TestCase):
    """Basic test cases."""
    def test_that_riddle_is_solved(self):
        solved = sample_package.module.main_module_function()
        self.assertTrue(solved)


if __name__ == '__main__':
    unittest.main()

Running the basic_test module from the command line produce an “OK” if everything runs smoothly, otherwise will provide information regarding which tests are failing.

----------------------------------------------------------------------
Ran 1 test in 0.004s
OK

Currently, the example_python_project requires the basic_test module to be executed manually. To learn more about automating this process, you can see Andrew Dirck’s 2020 post: Automate unit testing with Github Actions for research codes.

Requirements

The requirements.txt is a simple text file which lists the dependencies, or necessary packages that are required to run the code.

This can be particularly important if your code requires a specific version of a package, since the package verison can be specified in the requirements.txt. Specifying a particular package version (e.g., numpy==1.24.1) can improve the reliability of your code, since different versions of these packages may operate in different ways in the future.

Here is an example of what might be inside a requirements.txt, if the numpy and random packages are necessary:

numpy==1.24.1
random==3.11.1

Users can easily install all the packages listed in requirements.txt using the command:

pip install -r requirements.txt

License

I’ll keep this section brief, since I am far from legally qualified to comment much on Licensing. However, general advice seems to suggest that if you are sharing code publicly, safest to include a license of some sort.

Inclusion of an open-source license allows other users to comfortably use and modify your code for their own purposes, allowing you to contribute and benefit the broader community. At the same time, protecting the original author from future liabilities associated with its use by others.

The GNU General Public License is the most common open-source license, however if you would like to know more about the different options, you can find some guidance here: https://choosealicense.com/

Conclusions

If you are an experienced Python user, there may not be anything new for you here but at the least I hope it serves as a reminder to take care in your project design this year.

Additionally, this is likely to be one part in a multi-part Introduction to Python series that I will be writing for future members of our research group. With that in mind, check back here later this spring for the subsequent parts if that interests you.

Best of luck!

References

(1) Reitz, K., & Schlusser, T. (2016). The Hitchhiker’s guide to Python: best practices for development. ” O’Reilly Media, Inc.”.
(2) Navdeep Gill. 2019. samplemod. https://github.com/navdeep-G/samplemod. (2023).

Copula Coding Exercise

This Halloween, you may be scared by many things, but don’t let one of them be copulas. Dave Gold wrote an excellent post on the theory behind copulas and I’ll just build off of that post with a simple coding exercise. When I first heard about copulas, they sounded really relevant to helping me come up with a way to capture joint behavior across variables (which I then wanted to use to capture joint hydrologic behavior across watersheds). However, because I hadn’t learned about them formally in a class, I had a hard time translating what I was reading in dense statistics textbooks into practice and found myself blindly trying to use R packages and not making much progress. It was very helpful for me to start to wrap my head around copulas by just starting with coding a simple trivariate Gaussian copula from scratch. If copulas are new to you, hopefully this little exercise will be helpful to you as well.

Let’s pose a scenario that will help orient the reason why a copula will be helpful. Let’s imagine that we have historical daily full natural flow for three gauged locations in the Central Valley region of California: Don Pedro Reservoir, Millerton Lake, and New Melones Lake. Because these three locations are really important for water supply for the region, it’s helpful to identify the likelihood of joint flooding or drought tendencies across these basins. For this post, let’s focus on joint flooding.

First let’s fit distributions for each basin individually. I calculate the water year and create another column for that and then take the daily flow and determine the aggregate peak three-day flow for each year and each basin. Given the different residence times in each basin, a three-day aggregate flow can better capture concurrent events than using a one-day flow. Let’s term the flows in each basin [xt,1…xt,n] which is the 3-day peak annual flows in each of the n basins in year t.

library(evd)
library(mvtnorm)
library(extRemes)
library(mnormt)

#Read in annual observed flows 
TLG_flows=read.table("C:/Users/rg727/Downloads/FNF_TLG_mm.txt")
MIL_flows=read.table("C:/Users/rg727/Downloads/FNF_MIL_mm.txt")
NML_flows=read.table("C:/Users/rg727/Downloads/FNF_NML_mm.txt")

#Calculate water year column
TLG_flows$water_year=0

for (i in 1:dim(TLG_flows)[1]){
  if (TLG_flows$V2[i]==10|TLG_flows$V2[i]==11|TLG_flows$V2[i]==12){
    TLG_flows$water_year[i]=TLG_flows$V1[i]+1
  } else{
    TLG_flows$water_year[i]=TLG_flows$V1[i]}
}


MIL_flows$water_year=0

for (i in 1:dim(MIL_flows)[1]){
  if (MIL_flows$V2[i]==10|MIL_flows$V2[i]==11|MIL_flows$V2[i]==12){
    MIL_flows$water_year[i]=MIL_flows$V1[i]+1
  } else{
    MIL_flows$water_year[i]=MIL_flows$V1[i]}
}


NML_flows$water_year=0

for (i in 1:dim(NML_flows)[1]){
  if (NML_flows$V2[i]==10|NML_flows$V2[i]==11|NML_flows$V2[i]==12){
    NML_flows$water_year[i]=NML_flows$V1[i]+1
  } else{
    NML_flows$water_year[i]=NML_flows$V1[i]}
}



#Calculate 3-day total flow


TLG_flows$three_day=0

for (i in 1:length(TLG_flows$V4)){
  if (i == 1){
    TLG_flows$three_day[i]=sum(TLG_flows$V4[i],TLG_flows$V4[i+1])
  }
  else
    TLG_flows$three_day[i]=sum(TLG_flows$V4[i-1],TLG_flows$V4[i],TLG_flows$V4[i+1])
}

MIL_flows$three_day=0

for (i in 1:length(MIL_flows$V4)){
  if (i == 1){
    MIL_flows$three_day[i]=sum(MIL_flows$V4[i],MIL_flows$V4[i+1])
  }
  else
    MIL_flows$three_day[i]=sum(MIL_flows$V4[i-1],MIL_flows$V4[i],MIL_flows$V4[i+1])
}

NML_flows$three_day=0

for (i in 1:length(NML_flows$V4)){
  if (i == 1){
    NML_flows$three_day[i]=sum(NML_flows$V4[i],NML_flows$V4[i+1])
  }
  else
    NML_flows$three_day[i]=sum(NML_flows$V4[i-1],NML_flows$V4[i],NML_flows$V4[i+1])
}



#Calculate peak annual flow
TLG_flow_peak_annual=aggregate(TLG_flows,by=list(TLG_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
MIL_flow_peak_annual=aggregate(MIL_flows,by=list(MIL_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
NML_flow_peak_annual=aggregate(NML_flows,by=list(NML_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)

#Remove extra year (1986) from TLG
TLG_flow_peak_annual=TLG_flow_peak_annual[2:33,]


Next, we need to fit individual marginal distributions for each location. Since we are interested in capturing flood risk across the basins, a common marginal distribution to use is a Generalized Extreme Value distribution (GEV). We fit a different GEV distribution for each basin and from here, we create a bit of code to determine the peak three-day flows associated with a historical 10-year return period event.

#Determine level of a 10-year return period

getrlpoints <- function(fit){
  
  xp2 <- ppoints(fit$n, a = 0)
  ytmp <- datagrabber(fit)
  y <- c(ytmp[, 1])
  sdat <- sort(y)
  npy <- fit$npy
  u <- fit$threshold
  rlpoints.x <- -1/log(xp2)[sdat > u]/npy
  rlpoints.y <- sdat[sdat > u]
  rlpoints <- data.frame(rlpoints.x, rlpoints.y)
  
  return(rlpoints)
}


getcidf <- function(fit){
  
  rperiods = c(2,5,10,20,50,100,500,1000)
  bds <- ci(fit, return.period = rperiods)
  c1 <- as.numeric(bds[,1])
  c2 <- as.numeric(bds[,2])
  c3 <- as.numeric(bds[,3])
  ci_df <- data.frame(c1, c2, c3, rperiods)
  
  return(ci_df)
}


gevfit_MIL <- fevd(MIL_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_MIL)
ci_df <- getcidf(gevfit_MIL)

gevfit_NML <- fevd(NML_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_NML)
ci_df <- getcidf(gevfit_NML)

gevfit_TLG <- fevd(TLG_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_TLG)
ci_df <- getcidf(gevfit_TLG)



#These are the historical thresholds defined from the GEV fit to the three-day sum of the peak flows 
ten_yr_event_TLG=58.53738	
ten_yr_event_MIL=40.77821
ten_yr_event_NML=58.95369

Now that we have our marginals fit, we need to use these in some way to fit a Gaussian copula. By definition, a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. So we need to transform our observations to be psuedo-uniform. To do so, we push the values of [xt,1…xt,n] through the inverse of the CDF of the respective fitted GEV function. These marginals are now uniformly distributed between 0 and 1. Let’s term these values now [ut,1…ut,n].

#Now we want to fit the copula on our historical flows. First create u's

u_TLG = pgev(TLG_flow_peak_annual$three_day, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3],lower.tail = TRUE)
u_MIL = pgev(MIL_flow_peak_annual$three_day, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML = pgev(NML_flow_peak_annual$three_day, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

#Let's also convert the thresholds to u's for each basin 
u_TLG_event = pgev(ten_yr_event_TLG, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3], lower.tail = TRUE)
u_MIL_event = pgev(ten_yr_event_MIL, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML_event = pgev(ten_yr_event_NML, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

Now we address the Gaussian part of the copula. Recall the following in Dave’s post:

Where Φ_R is the joint standard normal CDF with: 

mean and var

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

So we have our u vectors that we now need to push through an inverse standard normal distribution to get Z scores.

#Now takes these u's and push them through a standard normal to get Z scores 
 
Z_TLG=qnorm(u_TLG)
Z_MIL=qnorm(u_MIL)
Z_NML=qnorm(u_NML)

#Let's do the same for the historical events
Z_TLG_event=qnorm(u_TLG_event)
Z_MIL_event=qnorm(u_MIL_event)
Z_NML_event=qnorm(u_NML_event)

The last part that we need for our copula is a spearman rank correlation matrix that captures the structural relationship of the peak flows across the three basins.

cor_matrix=cor(cbind(Z_TLG,Z_MIL,Z_NML),method = "spearman")

Finally, we have all the components of our copula. What question can we ask now that we have this copula? How about “What is the likelihood of exceeding the historical 10-year event in each basin?”. We code up the following formula which expresses this exact likelihood. More details can be found in this very helpful book chapter:

We calculate this with the following line and we find an exceedance probability of 0.0564.

#Now calculate the probability of exceeding the historical threshold in all basins

exceedance=1-pnorm(Z_TLG_event)-pnorm(Z_MIL_event)-pnorm(Z_NML_event)+pmnorm(c(Z_TLG_event,Z_MIL_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_MIL)))+pmnorm(c(Z_TLG_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_NML)))+pmnorm(c(Z_MIL_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_MIL,Z_NML)))-pmnorm(c(Z_TLG_event,Z_MIL_event,Z_NML_event),mean=rep(0,3),varcov = cor(cbind(Z_TLG,Z_MIL,Z_NML)))

Now this intuitively makes sense because a 10-year event has a 10% or 0.1 probability of being exceeded in any given year. If you calculate a joint likelihood of exceedance of this event across multiple basins, then this event is even less likely, so our lower probability of 0.0564 tracks. Note that you can create synthetic realizations of correlated peak flows by simulating from from a multivariate normal distribution with a mean of 0 and the Spearman rank correlation matrix defined above. Hopefully this was a simple but helpful example to demonstrate the key components of the Gaussian copula. The script and corresponding datasets can be found in this repo.

Markdown-Based Scientific and Computational Note Taking with Obsidian

Motivation

Over the last year, being in the Reed Research Group, I have been exposed to new ideas more rapidly than I can manage. During this period, I was filling my laptop storage with countless .docx, .pdf, .txt, .html files semi-sporadically stored across different cloud and local storages.

Finding a good method for organizing my notes and documentation has been very beneficial for my productivity, processing of new ideas, and ultimately staying sane. For me, this has been done through Obsidian.

Obsidian is an open-source markdown (.md) based note taking and organization tool, with strengths being:

  • Connected and relational note organization
  • Rendering code and mathematical (LaTeX) syntax
  • Community-developed plugins
  • Light-weight
  • A nice aesthetic

The fact that all notes are simply .md files is very appealing to me. I don’t need to waste time searching through different local and cloud directories for my notes, and can avoid having OneNote, Notepad, and 6 Word windows open at the same time.

Also, I like having localized storage of the notes and documentation that I am accumulating. I store my Obsidian directory on GitHub, and push-pull copies between my laptop and desktop, giving me the security of having three copies of my precious notes at all times.

I’ve been using Obsidian as my primary notetaking app for a few months now, and it become central to my workflow. In this post, I show how Obsidian extends Markdown feature utility, helps to organize topical .md libraries, and generally makes the documentation process more enjoyable. I also try to explain why I believe Obsidian is so effective for researchers or code developers.

Note Taking in Markdown

Markdown (.md) text is an efficient and effective way of not just documenting code (e.g., README.md files), but is great for writing tutorials (e.g., in JupyterNotebook), taking notes, and even designing a Website (as demonstrated in Andrew’s Blog Post last week).

In my opinion, the beauty is that only a few syntax tricks are necessary for producing a very clean .md document. This is particularly relevant as a programmer, when notes often require mathematical and code syntax.

I am not going to delve into Markdown syntax. For those who have not written in Markdown, I suggest you see the Markdown Documentation here. Instead, I focus on how Obsidian enables a pleasant environment in which to write .md documentation.

Obsidian Key Features

Below, I hit on what I view as the key strengths of Obsidian:

  • Connected and relational note organization
  • Rendering code and mathematical (LaTeX) syntax
  • Community-developed plugins

Relational Note Organization

If you take a quick look at any of the Obsidian forums, you will come across this word: Zettelkasten.

Zettelkasten, meaning "slip-box" in German and also referred to as card file, is a note taking strategy which encourages the use of many highly-specific notes which are connected to one another via references.

Obsidian is designed to help facilitate this style of note taking, with the goal of facilitating what they refer to as a "second brain". The goal of this relational note taking is to store not just information about a single topic but rather a network of connected information.

To this end, Obsidian helps to easily navigate these connections and visualize relationships stored within a Vault (which is just a fancy word for one large folder directory). Below is a screen cap of my note network developed over the last ~4 months. On the left is a visualization of my entire note directory, with a zoomed-in view on the right.

Notice the scenario discovery node, which has direct connections to methodological notes on Logistic Regression, Boosted Trees, PRIM, and literature review notes on Bryant & Lempert 2010, a paper which was influential in advocating for participatory, computer-assisted scenario discovery.

Each of these nodes is then connected to the broader network of notes and documentation.

These relations are easily constructed by linking other pages within the directory, or subheadings within those pages. When writing the notes, you can then click through the links to flip between files in the directory.

Links to Internal Pages and Subheadings

Referencing other pages (.md files) in your library is done with a double square bracket on either side: [[Markdown cheatsheet]]

You can link get down to finer resolution and specifically reference various levels of sub-headings within a page by adding a hashtag to the internal link, such as: [[Markdown cheatsheet#Basics#Bold]]

Tags and Tag Searches

Another tool that helps facilitate relational note taking is the use of #tags. Attach a # to any word within any of your notes, and that word becomes connected to other instances of the word throughout the directory.

Once tags have been created, they can be searched. The following Obsidian syntax generates a live list of occurrences of that tag across your entire vault:

```query
tag: #scenarios 

Which produces the window:

Rending code and math syntax

Language-Specific Code Snippets

Obsidian will beautifully stylized code snippets using language-specific formatting, and if you don’t agree then you change change your style settings.

A block of code is specified, for some specific language using the tripple-tic syntax as such:

```langage
<Enter code here>

The classic HelloWorld() function can be stylistically rendered in Python or C++:

LaTeX

As per usual, the $ characters are used to render LaTeX equations. Use of single-$ characters will results in in-line equations ($<Enter LaTeX>$) with double-$$ used for centered equations.

Obsidian is not limited to short LaTeX equations, and has plugins designed to allow for inclusion other LaTeX packages or HTML syntaxes.

latex $$ \phi_{t,i} = \left\{ \begin{array}\\ \phi_{t-1,i} + 1 & \text{if } \ z_t < \text{limit}\\ 0 & \text{otherwise.} \end{array} \right. $$

will produce:

Plugins

Obsidian boasts an impressive 667 community-developed plugins with a wide variety of functionality. A glimpse at the webpage shows plugins that give more control over the visual interface, allow for alternative LaTeX environments, or allow for pages to be exported to various file formats.

Realistically, I have not spent a lot of time working with the plugins. But, if you are someone who likes the idea of a continuously evolving and modifiable documentation environment then you may want to check them out in greater depth.

Conclusion: Why did I write this?

This is not a sponsered post in any way, I just like the app.

When diving into a new project or starting a research program, it is critical to find a way of taking notes, documenting resources, and storing ideas that works for you. Obsidian is one tool which has proven to be effective for me. It may help you.

Best of luck with your learning.

Enhancing Jupyter Notebooks for Teaching

In our research group, we often use Jupyter Notebooks as teaching tools in the classroom and to train new students. Jupyter Notebooks are useful because they integrate code and its output into a single document that also supports visualizations, narrative text, mathematical equations, and other forms of media that often allows students to better contextualize what they are learning.

There are some aspects of Jupyter Notebooks that students sometimes struggle with as they shift from IDEs to this new environment. Many have noted difficulty tracking the value of variables through the notebook and getting stuck in a mindset of just pressing “Shift+Enter”. Since the advent of Project Jupyter, there have been a variety of “unofficial” extensions and widgets that have been put out by the community that I have found can help enhance the learning experience and make Jupyter Notebooks less frustrating. It’s just not apparent that these extensions exist until you go searching for them. I will demonstrate some below that have been particularly helpful when working with our undergrads and hopefully will be helpful for you too!

Variable Inspector

One very easy way to remedy the situation of not having a variable display is to enable the Variable Inspector Extension which allows you to see the value of all of the variables in an external floating window. In your command prompt type:

#Installation
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

#Enable extension
jupyter nbextension enable varInspector/main

Here we are installing a massive library of extensions called Nbextensions that we will enable in this post. When you open your notebooks, you will now see a small icon at the top that you can click to create a variable inspector window that will reside in the margins if the notebook.

Variable Inspector Icon

As you step through the notebook, your variable inspector will become populated.

Variable Inspector Floating Window

Snippets of Code

The Snippets menu is pretty fantastic for students who are new to coding or for those of us who go back and forth between coding languages and keep having to quickly search how to make arrays or need code for example plots.

We can just enable this extension as follows right in Jupyter Notebook under the Nbextensions tab:

Enabling the Snippets Menu

This gif provides an example of some of the options that are available for NumPy, Matplotlib, and Pandas.

Jupyter Snippets Menu

If for instance, you want to create a histogram but can’t remember the form of the code, a snippet for this exists under the Matplotlib submenu.

Making a histogram with Matplotlib Snippets

Rubberband (Highlight Multiple Cells)

If you enable the Rubberband extension in the Nbextensions tab, you can highlight and execute multiple cells, which just gives a bit more precision in selecting a subset of cells that you want to run at once. You hold down “Ctrl+Shift” while dragging the mouse. Then you can execute the highlighted cells using “Shift+Enter”.

Execution Time

You can wrap your code in a variety of timing functions or just enable the Execution Time extension in the Nbextensions tab. For every cell that you execute, the elapsed time will now be recorded under the cell. For me, this is particularly useful when you have some cells that take a longer time to execute. If you are trying to create training, it’s helpful for a user to get a breakdown of exactly how long the code should be taking to run (so that they know if the timing is normal or if there is an issue with the code).

Elapsed time is shown below each cell

Creating a Jupyter Notebook Slideshow with RISE (a reveal.js extension)

The last teaching tool that I want to demonstrate is how to turn your Jupyter Notebook into an interactive slideshow, which has been infinitely more useful to use as a teaching tool than just scrolling through the Jupyter Notebook at a meeting.

First, installation is easy:

pip install RISE

Now, we have to prepare our notebook for a slideshow. Go to View -> Cell Toolbar -> Slideshow. We need to select the Slide Type for each of our cells.

Selecting the slide type for the slideshow

At the top of the screen, we now have a new icon that allows you to enter into slideshow mode when you are ready. We can now step through the slideshow that we have made and interactively execute the coding boxes as well.

I hope these tips help you to have a more pleasant experience while using Jupyter Notebooks. If there are other extensions that you enjoy using regularly, please post below!

Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.

PyGeohydro

PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.

PyNHD

The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…

PyDaymet

The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

  1. Installation
  2. Retrieving USGS Water Data
  3. Retrieving Geophysical (NLDI) Data
  4. Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
plot.signatures(flow_data)
Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    navigation="upstreamTributaries",
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

  • Minimum and maximum temperature (tmin, tmax)
  • Precipitation (prcp)
  • Vapor pressure (vp)
  • Snow-Water Equivalent (swe)
  • Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

NOTE:
So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()

daymet_data.shape

# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.

Conclusion

I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.

Citations

Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

Viewing your Scientific Landscape with VOSviewer

In this post, we’re taking a look at a cool tool for visualizing bibliometric networks called VOSviewer. In our group, we have been thinking more about our scientific landscape (i.e. what fields are we reaching with our work and who are we collaborating with most). VOSviewer provides a great way to interactively engage with this network in maps like this:

https://app.vosviewer.com/?json=https://drive.google.com/uc?id=19qfFyRLmgG9Zh7S9oss2Xo8vcqj-GjG7

(The web app cannot always accommodate high traffic, so if the link appears broken, please check back in a couple hours!)

In this blog post, I’m going to highlight this tool by looking at the reach of the Borg Multi-Objective Evolutionary Algorithm (Borg MOEA) developed by Dave Hadka and Patrick Reed and first published in 2013. I also really like that these conceptual networks can be embedded into a website. Slight caveat: if you use WordPress, you must have a Business account with plugins enabled to embed these maps directly into your website. You’ll see that in order to view the interactive maps, I’ll have links to VosViewer’s online application, but unfortunately you will not be able to interact with the maps directly in this post.

Step 1: Download bibliographic data

The first step to creating these maps is to get our bibliographic network information. I’m using the Dimensions database which has a free web app here. It should be noted that Vosviewer also supports exports from Scopus, Web of Science, Lens, and PubMed. Once I get to the Dimensions web app, I type in “Borg MOEA” to find the original paper and download the citations associated with that paper.

You can choose the Borg paper as shown above and then scroll down and click on “show all” next to “Publication Citations”.

Then we export this by clicking the save/export button at the top of the page.

Make sure to choose “bibliographic mapping” so that the data are compatible with VosViewer. Dimensions will send an email with the link to download a zipped file which contains a .csv file with all the meta data needed to create the network.

Step 2: Create the map in VOSviewer

The user interface for the desktop version of VosViewer is fairly intuitive but also quite powerful. First we choose “Create” on the left  and then specify that we want to create a map based on bibliographic data.

Next, we specify that we want to read our data from a database file. Note that VosViewer also supports RIS files and has an API that you can directly query with. On the right, we choose the Dimensions tab and upload the metadata .csv file.

Now comes the fun part! We can create maps based on various components of the metadata that we downloaded. For example, let’s do the co-authorship analysis. This will show the strength of co-authorship among all the authors that have cited the Borg paper.

Click “Next>” and then fill out some narrowing thresholds. I decrease the number of documents that the author must have from 5 to 1 and then choose to show only connected authors to create a clean map. The result is this:

Here the colors represent the different clusters of authors and the size represents the number of documents that have been published by that author that have cited Borg. If we switch to the “Overlay Visualization” tab, we get more information on the years that the authors have published most together.

Step 3: Clean-Up and Customization

One thing you might have noticed is that there are duplicates of some names. For example, Patrick M. Reed and Patrick Reed. We want to merge these publications so that we only have one bubble to represent Pat. For this, we need to create thesaurus file. In the folder that contains the VOSviewer application, navigate to the “data” folder and look for “thesaurus_authors.txt”. Here we can specify which names to merge.

Now, I can recreate my map, but this time, I supply a thesaurus file when creating the map.

Below, we have our final merged map.

We can create another map that instead shows the journals that the citations come from. Now instead we create a map and choose “Citation” and “Sources”.

So we are seeing strong usage of Borg in journals pertaining to water resources, evolutionary computation, and some miscellaneous IEEE journals.

Note that you can also create maps associated with countries from where the authors’ institutions are located.

We see that the US is the strongest hub, followed by China, Italy, and the UK. We can finally show the institutions that the authors are affiliated with. Here we see lots of publications from Cornell, Penn State, and Politecnico di Milano.

Step 4: Creating a shared link or embedding in a website

Screenshots are not an ideal way to view these maps, so let’s see how we can go about sharing them as interactive applications that others can use. We can go to the right hand side of the application and choose “share” as a google drive link. After going through authentication, you now have a shareable link as follows:

Institution: https://app.vosviewer.com/?json=https://drive.google.com/uc?id=18q9f8LvLNbATl5rmP28te5MB0R8bn3xx

Co-Author: https://app.vosviewer.com/?json=https://drive.google.com/uc?id=19qfFyRLmgG9Zh7S9oss2Xo8vcqj-GjG7

Country: https://app.vosviewer.com/?json=https://drive.google.com/uc?id=1pCgv2JVpqXBxEjDIIckBni9HUihQrwWA

Journal: https://app.vosviewer.com/?json=https://drive.google.com/uc?id=174togoYCbxZMm7fcU459p8mYbG3sAxje

If your website supports it, you can also embed these applications directly into a webpage using the following html code. You would just need to swap out the source location, src, with your own links. When swapping out, be sure to include &simple_ui=true after your link as shown below.

<iframe
  allowfullscreen="true"
  src="https://app.vosviewer.com/?json=https://drive.google.com/uc?id=1mYU4RNuhUQP__sKZWRnHzn_IkzrfvJMV&simple_ui=true"
  width="100%"
  height="75%"
  style="border: 1px solid #ddd; max-width: 1200px; min-height: 500px"
>
</iframe>

This is just scraping the surface of what VOSviewer is capable of, so definitely take time to explore the website, the examples, and the manual. If you make any cool maps, please link them in the comments below!