Ensemble forecasting – theory

‘Does the flap of a butterfly’s wing in Brazil stir up a tornado in Texas?’ These words reflect the so-called ‘butterfly effect’ typically ascribed to Edward Lorenz (MIT), who was a pioneer in the field of numerical weather prediction (NWP). (Interestingly, Lorenz himself first referred to the ‘flap of a seagulls’ wing’; the butterfly swap out was the work of a creative conference session convener. Somewhat coincidentally, the shape of the ‘Lorenz attractor’ is also reminiscent of a butterfly, see Figure 1). Motivated by the very salient challenge in the post-WWII era to improve weather prediction capabilities for strategic purposes, he developed a theoretical framework for the predictability of non-linear dynamical systems in a seminal paper, ‘Deterministic nonperiodic flow’ (Lorenz, 1963), that would come to be known as ‘chaos theory’ or ‘chaotic dynamics’. This theory was foundational to the development of ensemble forecasting (and a whole host of other things), and still underpins our current theoretical understanding of the inherent predictability of complex dynamical systems like the atmosphere (Lorenz, 2006).


Figure 1. Book cover (left panel) and the characteristic ‘butterfly wing’ shape of the Lorenz attractor (right panel) from his seminal 1963 work. (Palmer, 1993)

In actuality, the whole ‘flap of a butterfly’s wing creating a tornado in Texas’ relates specifically to only one aspect of the theoretical framework developed by Lorenz. In this post, I will attempt to give a slightly fuller treatment of this theoretical framework and its development. My hope is that this post will be a theoretical primer for two follow-on practically oriented posts: 1) a discussion of the state-of-the-science of hydrometeorological ensemble forecasting and its application in emerging water resources management practices like forecast informed reservoir operations, and 2) a deeper dive into ensemble verification techniques. While the techniques for (2) are grounded in the theoretical statistical properties of stochastic-dynamic predictive ensembles, they have broad applications to any scenario where one needs to diagnose the performance of an ensemble.

The Lorenz model and ‘chaos’ theory

When most people hear the word ‘chaos’, they tend to think of the dictionary definition: ‘complete disorder and confusion’ (Oxford), which in a scientific context might aptly describe some sort of white noise process. As you will see, this is somewhat of a far cry from the ‘chaos’ described in Lorenz’s theory. The ‘chaos’ terminology was actually coined by a later paper building on Lorenz’s work (Li & Yorke, 1975) and as noted by Wilks (2019): ‘…this label is somewhat unfortunate in that it is not really descriptive of the sensitive-dependence phenomenon’. The sensitive-dependence phenomenon is one of a set of 4 key properties (to be discussed later) that characterize the behaviors of the sort of non-linear, non-periodic deterministic systems that Lorenz argued were most representative of the atmosphere. In contrast to ‘disorder’ and ‘confusion’, these properties, in fact, lend some sense of order to the evolution of these dynamical systems, although the nature of that order is highly dependent on the system state.

A deterministic, non-periodic systems model

First, let’s dive into a few details of Lorenz’s 1963 work, using some figures and descriptions from a later paper (Palmer, 1993) that are easier to follow. While the 1963 paper is quite dense, much of the mathematical logic is dedicated to justifying the use of a particular system of equations that forms the basis of the study. This system of 3 equations and 3 variables (X, Y, Z) describes a non-linear, dissipative model of convection in a fluid of uniform depth, where there is a temperature difference between the upper and lower surfaces. Lorenz derived the set of 3 equations shown in the upper left panel of Figure 2 from earlier work by Rayleigh (1916) on this particular problem. In short, X relates to the intensity of convective motion, Y relates to the temperature difference between ascending and descending currents, and Z relates to the distortion of the vertical temperature profile from linearity; the details of these variables are not actually all that important to the study. What is important is that this system of equations has no general solutions (aside from the steady state solution) and must be numerically integrated in the time dimension to determine the convective flow dynamics. The ‘trajectories’ of these integrations shown in the phase space diagrams in Figure 2 exhibit the sorts of unstable, non-periodic behaviors that Lorenz thought were the most useful analog to atmospheric dynamics, albeit in a much simpler system. (‘Much’ is an understatement here; modern NWP models have a phase space on the order of 109 in contrast to the 3-dimensional phase space of this problem, Wilks, 2019. Of note, the use of phase-space diagrams (i.e. plots where each axis corresponds to a dependent variable in the system of equations) preceded Lorenz, but his use of them is perhaps one of the best-known early instances of this kind of visualization. Other uses of phase space relationships can be found in Rohini’s post or Dave’s post)

Figure 2. a) Lorenz equations and the XY phase space diagram of their integrations. b-c) X variable timeseries of two separate integrations of the Lorenz equations from very similar initial conditions. (Palmer, 1993)

Regime structure and multiple distinct timescales

What ‘behaviors’ then, are we talking about? Referring again to Figure 2a, we see the projection of a long series of model integration trajectories onto the XY-plane of the phase space diagram, where each dot is a timestep in the integration. What is immediately apparent in the form of these integrations is the two lobed ‘butterfly’ shape of the diagram. Each lobe has a central attractor where the trajectories often reside for multiple revolutions, but can then transition to the other attractor when passing near the center of the diagram. These so-called ‘Lorenz attractors’ comprise one of the key properties of chaotic dynamics, namely regime structure, which is tendency of the trajectories to reside around phase space attractors for some period of time. This residence time in a regime is generally quite a bit longer than the timescales of the trajectories’ individual revolutions around an attractor. This attribute is referred to as multiple distinct timescales and is evidenced in Figure 2b-c, where smaller amplitude sections of the timeseries show residence in one or the other regime and large amplitude sections describe transitions between regimes. Often, but not always, the residence in these regimes occurs for multiple revolutions, suggesting that there are shorter timescale evolutions of the system that take place within these regimes, while infrequent, random shifts to the other regimes occur at longer timescales.

Sensitive-dependence and state-dependent variation in predictability

Figure 3. a-c) Different trajectories through the XY phase space dependent on the initial condition state-space (black circle). (Palmer, 1993)

Returning now to the ‘butterfly effect’; what, then, is this whole sensitive-dependence thing mentioned earlier? Figure 3a-c provide a nice graphical representation of this phenomenon. In each panel, a set of nearby initial states are chosen at different location in the phase space diagram and then are followed through their set of trajectories. In 3a, these trajectories neatly transition from one regime to the other, remaining relatively close to each other at the trajectories’ end. In 3b, a set of initial states not so far from 3a are chosen and instead of neatly transitioning to the other regime, they diverge strongly near the center of the diagram, with some trajectories remaining in the original regime, and some transitioning. However, for about half of the timespan, the trajectories remain very close to one another. Finally, an initial state chosen near the center of the diagram (3c) diverges quickly into both regimes, with trajectories ending up at nearly opposite ends of the phase space (black tails at upper right and left of diagram). Figures b-c, in particular, showcase the sensitive-dependence on initial conditions attributes of the system. In other words, from a set of very close-by initial states, the final trajectories in phase space can yield strongly divergent results. Importantly, this divergence in trajectories over some period of time can occur right away (3c), at some intermediate time (3b), or not at all (3a).

This is the basic idea behind the last core property of these systems, state-dependent variation in predictability. Clearly, a forecast initialized from the starting point in 3a could be a little bit uncertain about the exact starting state and still end up in about the right spot for an accurate future prediction at the end of the forecast horizon. At medium ranges, this is also the case for 3b, but the longer range forecast is highly uncertain. For 3c, all forecast ranges are highly uncertain; in other words, the flap of a butterfly’s wing can mean the difference between one or the other trajectory! Importantly, one can imagine in this representation that an average value of 3c’s trajectories (i.e. the ensemble mean) would fall somewhere in the middle of the phase space and be representative of none of the two physically plausible trajectories into the right or left regimes. This is an important point that we’ll return to at the end of this post.

From the Lorenz model to ensemble forecasting

The previous sections have highlighted this idealized dynamical system (Lorenz model) that theoretically has properties of the actual atmosphere. Those 4 properties (the big 4!) were: 1) sensitive-dependence on initial conditions, 2) regime structure, 3) multiple distinct timescales, and 4) state-dependent variation in predictability. In this last section, I will tie these concepts into a theoretical discussion of ensemble forecasting. Notably, in the previous sections, I discussed ‘timescales’ without any reference to the real atmosphere. The dynamics of the Lorenz system prove to be quite illuminating about the actual atmosphere when timescales are attached to the system that are roughly on the order of the evolution of synoptic scale weather systems. If one assumes that a lap around the Lorenz attractor equates to a synoptic timescale of ~5 days, then the Lorenz model can be conceptualized in terms of relatively homogenous synoptic weather progressions occurring within two distinct weather regimes that typically persist on the order of weeks to months (see Figure 3b-c). This theoretical foundation jives nicely with the empirically derived weather-regime based approach (see Rohini’s post or Nasser’s post) where incidentally, 4 or more weather regimes are commonly found (The real atmosphere is quite a bit more complex than the Lorenz model after all). This discussion of the Lorenz model has hopefully given you some intuition that conceptualizing real world weather progressions outside the context of an appropriate regime structure could lead to some very unphysical representations of the climate system.

Ensemble forecasting as a discrete approximation of forecast uncertainty

In terms of weather prediction, though, the big 4 make things really tough. While there are certainly conditions in the atmosphere that can lead to excellent long range predictability (i.e. ‘forecasts of opportunity’, see Figure 3a), the ‘typical’ dynamics of the atmosphere yield the potential for multiple regimes and associated transitions within the short-to-medium range timeframe (1-15 days) where synoptic-scale hydro-meteorological forecasting is theoretically possible. (Note, by ‘synoptic-scale’ here, I am referring to the capability to predict actual weather events, not prevailing conditions. Current science puts the theoretical dynamical limit to predictability at ~2 weeks with current NWP technology achieving ‘usable’ skill out to ~7 days, give or take a few dependent on the region)

Early efforts to bring the Lorenz model into weather prediction sought to develop analytical methods to propagate initial condition uncertainty into a useful probabilistic approximation of the forecast uncertainty at various lead times. This can work for simplified representation of the atmosphere like the Lorenz model, but quickly becomes intractable as the complexity and scale of the governing equations and boundary conditions increases.

Figure 4. Conceptual depiction of ensemble forecasting including sampling of initial condition uncertainty, forecasts at different lead times, regime structure, and ensemble mean comparison to individual ensemble members. Solid arrow represents a ‘control’ member of the ensemble. Histograms represent an approximate empirical distribution of the ensemble forecast. (Wilks, 2019)

Thankfully, rapid advances in computing power led to an alternate approach, ensemble forecasting! Ensemble forecasting is a stochastic-dynamic approach that couples a discrete, probabilistic sampling of the initial condition uncertainty (Figure 4 ‘Initial time’) and propagates each of those initial condition state vectors through a dynamical NWP model. Each of these NWP integrations is a unique trajectory through state space of the dynamical model that yields a discrete approximation of the forecast uncertainty (Figure 4 ‘Intermediate/Final forecast lead time’). This discrete forecast uncertainty distribution theoretically encompasses the full space of potential hydro-meteorologic trajectories and allows a probabilistic representation of forecast outcomes through analysis of the empirical forecast ensemble distribution. These forecast trajectories highlight many of the big 4 properties discussed in previous sections, including regime structure and state-dependent predictability (Figure 4 trajectories are analogous to the Figure 3b trajectories for the Lorenz model). The ensemble mean prediction is an accurate and dynamically consistent prediction at the intermediate lead time, but at the final lead time where distinct regime partitioning has occurred, it is no longer dynamically consistent and delineates a region of low probability in the full ensemble distribution. I will explore properties of ensemble averaging, both good and bad, in a future post.

Lastly, I will note that the ensemble forecasting approach is a type of Monte Carlo procedure. Like other Monte Carlo approaches with complex systems, the methodology for sampling the initial condition uncertainty has a profound effect on the uncertainty quantification contained within the final NWP ensemble output, especially when considering the high degree of spatiotemporal relationships within the observed climatic variables that form the initial state vector. This is a key area of continued research and improvement in ensemble forecasting models.

Final thoughts

I hope that you find this discussion of the theoretical underpinnings of chaotic dynamics and ensemble forecasting to be useful. I have always found these foundational concepts to be particularly fascinating. Moreover, the basic theory has strong connections outside of ensemble forecasting, including the ties to weather regime theory mentioned in this post. I also think these foundational concepts are necessary to understand how much actual ensemble forecasting techniques can diverge from the theoretical stochastic-dynamic framework. This will be the subject of some future posts that will delve into more practical and applied aspects of ensemble forecasting with a focus on water resources management.

Reference

Lorenz, E. N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20, 130–141.

Lorenz, E. N. (2006). Reflections on the Conception , Birth , and Childhood of Numerical Weather Prediction. Annual Review of Earth and Planetary Science, 34, 37–45. https://doi.org/10.1146/annurev.earth.34.083105.102317

Palmer, T. N. (1993). Extended-range atmospheric prediction and the Lorenz model. Bulletin – American Meteorological Society, 74(1), 49–65. https://doi.org/10.1175/1520-0477(1993)074<0049:ERAPAT>2.0.CO;2

Rayleigh, L. (1916). On convective currents in a horizontal layer when the higher temperature is on the underside. Phil. Mag., 32, 529-546.

Wilks, D. S., (2019). Statistical Methods in the Atmospheric Sciences, 4th ed. Cambridge, MA: Elsevier.

An introduction to linear programming for reservoir operations (part 2) – Implementation with Pyomo

Introduction

Previously, in Part 1 I used a very simple reservoir operations scenario to demonstrate some linear programming (LP) concepts.

After some feedback on my initial formulation I went back and revised the formulation to make sure that (1) both reservoir releases and storage levels are optimized simultaneously and (2) the LP handles decisions over multiple timesteps (1,…,N) during optimization. Definitely look back at Part 1 for more context.

The current LP formulation is as follows:

In this post, I show a simple implementation of this LP using the Pyomo package for solving optimization problems in Python.

I have shared the code used in this demo in a repository here: Reservoir-LP-Demo


Constructing the LP model with Pyomo

While Pyomo can help us construct the LP model, you will need access to a separate solver software in order to actually run the optimization. I don’t get into the details here on how to set up these solvers (see their specific installation instructions), but generally you will need this solver to be accessible on you PATH.

Two solvers that I have had good experience with are:

As always, it’s best to consult the Pyomo documentation for any questions you might have. Here, I highlight a few things that are needed for our implementation.

We start by importing the pyomo.environ module:

import pyomo.environ as pyo

From this module we will need to use the following classes to help build our model:

  • pyo.ConcreteModel()
  • pyo.RangeSet()
  • pyo.Var()
  • pyo.Objective()
  • pyo.Constraint()
  • pyo.SolverFactory()

The nice thing about using pyomo rather than trying to manage the LP matrices yourself is that you can specify objectives and constraints as functions.

For example, the objective function is defined as:

# Objective Function
def objective_rule(m):
    return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)

And a constraint used to enforce the lower limit of the storage mass balance can defined as:

def S_balance_lower(m, t):
    if t == 0:
        return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
    return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

Rather than picking the full implementation apart, I present the entire function below, and encourage you to compare the code implementation with the problem definition above.

def pyomo_lp_reservoir(N, S_min, S_max, R_min, R_max, 
                       eta_0, I, D,  
                       initial_storage):

    # Model
    model = pyo.ConcreteModel()

    # Time range
    model.T = pyo.RangeSet(0, N-1)  

    # Decision Variables
    model.S = pyo.Var(model.T, bounds=(S_min, S_max))  # Storage
    model.R = pyo.Var(model.T, bounds=(R_min, R_max))  # Release

    # Objective Function
    def objective_rule(m):
        return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)
    model.objective = pyo.Objective(rule=objective_rule, sense=pyo.minimize)

    # Constraints
    def S_balance_lower(m, t):
        if t == 0:
            return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
        return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

    def S_balance_upper(m, t):
        if t == 0:
            return -(m.S[t] + m.R[t]) <= -(initial_storage + I[t] - D[t])
        return -(m.S[t] + m.R[t]) <= -(m.S[t-1] + I[t] - D[t])
    model.S_lower = pyo.Constraint(model.T, rule=S_balance_lower)
    model.S_upper = pyo.Constraint(model.T, rule=S_balance_upper)
    model.S_final = pyo.Constraint(expr=model.S[N-1] == initial_storage)

    # Solve
    solver = pyo.SolverFactory('scip')
    results = solver.solve(model)

    if results.solver.status == pyo.SolverStatus.ok:
        S_opt = np.array([pyo.value(model.S[t]) for t in model.T])
        R_opt = np.array([pyo.value(model.R[t]) for t in model.T])
        return S_opt, R_opt
    else:        
        raise ValueError('Not able to solve.')

Note that in this implementation, pyomo will optimize all of the reservoir release and storage decisions simultaneously, returning the vectors of length N which prescribe the next N days of operations.

Usage

Now we are ready to use our LP reservoir simulator. In the code block below, I set some specifications for our operational constraints, generate fake inflow and demand timeseries, run the LP solver, and plot the simulated results:

# spcifications
N_t = 30
S_min = 2500
S_max = 5000
R_min = 10
R_max = 1000
eta = 1.2

# Generate simple inflow and demand data
I, D = generate_data(N_t, correlation_factor = -0.70,
                     inflow_mean=500, inflow_std=100,
                     lag_correlation=0.2)


# Run LP operation simulation
S_sim, R_sim = pyomo_lp_reservoir(N_t, S_min, S_max, R_min, R_max, eta, I, D, 
                                  initial_storage=S_max)

# Plot results
plot_simulated_reservoir(I, D,
                         R_sim, S_sim, 
                         S_max, eta=eta)

The operations are shown:

Under this LP formulation, with a perfect inflow forecast, the reservoir operates as a “run of river” with the release rates being very close to the inflow rate.

In practice, operators may need to limit the difference in release volume from day-to-day. I added an optional parameter (R_change_limit) which adds a constraint on the difference subsequent releases from between each day.

The operations, with the daily release change rate limited to 50 is shown below.

Conclusions

The way you formulate the an LP problem dictates the “optimal” decisions that you will generate. The purpose of these past two posts was to make some attempt at giving a practice demo of some basic LP concepts. I hope for some that it might be useful as a starting point.

An introduction to linear programming for reservoir operations (part 1)

Motivation:

If you were like me, you may have been provided an overview of linear programming (LP) methods but craved a more hands-of demonstration, as opposed to the abstract/generic mathematical formulation that is often presented in lectures. This post is for the hydrology enthusiasts out there who want to improve their linear programming understanding by stepping through a contextual example.

In this post, I provide a very brief intro to linear programming then go through the process of applying a linear programming approach to a simple hydroelectric reservoir problem focused on determining a release rate that will consider both storage targets and energy generation. In a later post, I will come back to show how the formulation generated here can be implemented within simulation code.

Content:

  • An introduction to linear programming
    • Formulating an LP
    • Solving LP problems: Simplex
    • Relevance for water resource managers
  • A toy reservoir demonstration
    • The problem formulation
      • Constraints
      • Objective
  • Conclusions and what’s next

An introduction to linear programming

Linear programming (LP) is a mathematical optimization technique for making decisions under constraints. The aim of an LP problem is to find the best outcome (either maximizing or minimizing) given a set of linear constraints and a linear objective function.

Formulating an LP

The quality of an LP solution is only as good as the formulation used to generate it.

An LP formulation consists of three main components:

1. Decision Variables: These are the variables that you have control over. The goal here is to find the specific decision variable values that produce optimal outcomes. There are two decision variables shown in the figure below, x1 and x2.

2. Constraints: These are the restrictions which limit what decisions are allowed. (For our reservoir problem, we will use constraints to make sure total mass is conserved, storage levels stay within limits, etc.) The constraints functions are required to be linear, and are expressed as inequality relationships.

3. Objective Function: This is the (single) metric used to measure outcome performance. This function is required to be linear and it’s value is denoted Z in the figure below, where it depends on two decision variables (x1, and x2).

In practice, the list of constraints on the system can get very large. Fortunately matrix operations can be used to solve these problems quickly, although this requires that the problem is formatted in “standard form” shown above. The standard form arranges the coefficient values for the constraints within matrices A and b.

Solving LP problems: keep it Simplex

Often the number of potential solutions that satisfy your constraints will be very large (infinitely large for continuous variables), and you will want to find the one solution in this region that maximizes/minimizes your objective of interest (the “optimal solution”).

The set of all potential solutions is referred to as the “feasible space” (see the graphic below), with the constraint functions forming the boundary edges of this region. Note that by changing the constraints, you are inherently changing the feasible space and thus are changing the set of solutions that you are or are not considering when solving the problem.

The fundamental theorem of linear programming states that the optimal solution for an LP problem will exist at one of corners (or a vertex) of the feasible space.

Recognizing this, George Dantzig proposed the Simplex algorithm which strategically moves between vertices on the boundary feasible region, testing for optimality until the best solution is identified.

A detailed review of the Simplex algorithm warrants it’s own blog post, and wont be explained here. Fortunately there are various open LP solvers. For example, one option in Python is scipy.optimize.linprog().

Relevance for water resource managers

Why should you keep read this post?

If you work in water resource systems, you will likely need to make decisions in highly constrained settings. In these cases, LP methods can be helpful. In fact there are many scenarios in water resource management in which LP methods can (and historically have) been useful. Some general application contexts include:

  • Resource allocation problems: LP can be used to allocate water efficiently among different users like agriculture, industry, and municipalities.
  • Operations optimization: LP can guide the operations decisions of water reservoirs, deciding on storage levels, releases, and energy production to maximize objectives like water availability or energy generation from hydropower (the topic of this demonstration below).

Toy Reservoir Demonstration: Problem Formulation

The following demo aims to provide a concrete example of applying linear programming (LP) in the realm of water resource management. We will focus on a very simple reservoir problem, which, despite its simplicity, captures the essence of many real-world challenges.

Imagine you are managing a small hydroelectric reservoir that provides water and energy to a nearby town. Your goal is to operate the reservoir, choosing how much water to release each day such that the (1) the water level stays near a specific target and (2) you maximize energy generation. Additionally, there is a legally mandated minimum flow requirement downstream to sustain local ecosystems

Here, I will translate this problem context into formal LP notation, then in a later post I will provide Python code to implement the LP decision making process within a simulation model.

Problem formulation

We want to translate our problem context into formal mathematical notation that will allow us to use an LP solver. In order to help us get there, I’ve outlines the following table with all the variables being considered here.

Decision variables

In this case the things we are controlling at the reservoir releases (Rt), and the storage level (St) at each timestep.

Constraints:

Constraints are the backbone of any LP problem, defining the feasible space which ultimately dictates the optimal solution. Without constraints, an LP problem would have infinite solutions, rendering it practically meaningless. Constraints are meant to represent any sort of physical limitations, legal requirements, or specific conditions that must be satisfied by your decision. In the context of water resource management, constraints could signify capacity limits of a reservoir, minimum environmental flow requirements, or regulatory water diversion limits.

  • Mass balance requirement:

Naturally you want to make sure that mass is conserved through the reservoir, by balancing all of the inflows, outflows and storage states, for each timestep (t):

Although we need to separate this into two separate inequalities in order to meet the “standard form” for an LP formulation. I am also going to move the decision variables (Rt and St) to the left side of the inequalities.

  • Storage limitations:

The most obvious constraints are that we don’t want the reservoir to overflow or runout of water. We do this by requiring the storage to be between some minimum (Smin) and maximum (Smax) values specified by the operator.

Again we need to separate this into two separate inequalities:

  • Maximum and minimum release limitations:

The last two constraints represent the physical limit of how much water can be discharged out of the reservoir at any given time (Rmax) and the minimum release requirement that is needed to maintain ecosystem quality downstream of the reservoir (Rmin).

Objective:

The objective function in an LP represents your goals. In the case of our toy reservoir, we have two goals:

  1. Maximize water storage
  2. Maximize energy production.

Initially when I was setting this demonstration up, there was no hydroelectric component. However, I chose to add the energy generation because it introduces more complexity in the actions (as we will see). This is because the two objectives conflict with one another. When generating energy, the LP is encouraged to release a lot of water and maximize energy generation, however it must balance this with the goal of raising the storage to the target level. This tension between the two objectives makes for much more interesting decision dynamics.

1. Objective #1: Maximize storage

Since our reservoir managers want to make sure the Town’s water supply is reliable, they want to maximize the storage. This demo scenario assumes that flood is not a concern for this reservoir. We can define objective one simply as:

Again, we can replace the unknown value (St) with the mass-balance equation described above to generate a form of the objective function which includes the decision variable (Rt):

We can also drop the non-decision variables (all except Rt) since these cannot influence our solution:

2. Objective #2: Maximize energy generation:

Here I introduce a second component of the objective function associated with the amount of hydroelectric energy being generated by releases. Whereas the minimize-storage-deviation component of the objective function will encourage minimizing releases, the energy generation component of the objective function will incentivize maximizing releases to generate as much energy each day as possible.

I will define the energy generated during each timestep as:

(1+2). Aggregate minimization objective:
Lastly, LP problems are only able to solve for a single objective function. With that said, we need to aggregate the storage and energy generation objectives described above. In this case I take a simple sum of the two different objective scores. Of course this is not always appropriate, and the weighting should reflect the priorities of the decision makers and stakeholders.

In this case, this requires a priori specification of objective preferences, which will determine the solution to the optimization. In later posts I plan on showing an alternative method which allows for posterior interpretation of objective tradeoffs. But for now, we stay limited with the basic LP with equal objective weighting.

Also, we want to formulate the optimization as a minimization problem. Since we are trying to maximize both of our objectives, we will make them negative in the final objective function.

The final aggregate objective is calculated as the sum of objectives of the N days of operation:

Final formulation:

As we prepare to solve the LP problem, it is best to lay out the complete formulation in a way that will easily allow us to transform the information into the form needed for the solver:

Conclusions and what’s next

If you’ve made it this far, good on you, and I hope you found some benefit to the process of formulating a reservoir LP problem from scratch.

This post was meant to include simulated reservoir dynamics which implement the LP solutions in simulation. However, the post has grown quite long at this point and so I will save the code for another post. In the meantime, you may find it helpful to try to code the reservoir simulator yourself and try to identify some weaknesses with the proposed operational LP formulation.

Thanks for reading, and see you here next time where we will use this formulation to implement release decisions in a reservoir simulation.

An overview of the National Hydrologic Model

Over the past several months, I have been working with data from the US Geological Survey’s (USGS) National Hydrologic Model (NHM), a valuable resource that required some time to become familiar with. The goal of this post is to provide an overview of the NHM, incorporating as many links as possible, with the hope of encouraging others to utilize these resources and serving as a springboard for further investigation.

Why should you care about the NHM?

Water systems modelers are consistently in need of data. You might find that the specific streamflow data you seek does not exist, or perhaps you have some data but want to expand your training set. You may also wish to broaden your data to include not only streamflow but also estimates of various geophysical, hydrological, or environmental variables such as catchment area, vegetation classifications, and evapotranspiration.

The NHM can quench your data thirst by offering Continental US (CONUS) scale estimates of different geo-hydro-climatic variables synthesized from multiple datasets.

You can access these NHM datasets (simulated streamflow, land-cover parameter values, etc.) yourself through the ScienceBase website. However, it is first beneficial to understand the various components of the NHM infrastructure more broadly.


Introduction

The National Hydrologic Model (NHM) infrastructure is designed with the goal…

“… to facilitate hydrologic model parameterization on the basis of datasets that characterize geology, soil, vegetation, contributing area, topography, growing season, snow dynamics, climate, solar radiation, and evapotranspiration across the CONUS.”

The NHM includes several different components

  1. The Geospatial Fabric
  2. Model input datasets
  3. Physical models used to simulate hydrologic processes

In this post, I will delve into each of these components, providing more information, and conclude by pointing out ways you can access the model outputs yourself.

The Geospatial Fabric

The geospatial fabric contains spatial data that serves as the skeletal structure of the NHM, facilitating the routing of modeled streamflow across across catchments. The image below shows the CONUS-scale geospatial fabric.

The geospatial fabric contains individual stream reaches (called “segments”), delineated sub-catchments (called “Hydrologic Response Units”, or HRUs), and many specific points of interest which correspond to USGS observational gauge locations.

The raw data for version 1.1 of the Geospatial Fabric can be found here.

The spatial data is largely drawn from the National Hydrography Dataset (NHD) which defines the high-resolution stream linkages and provides a unique identifier (called “ComID”) for each stream segment.

If you are looking to set up a workflow which uses NHM streamflow data, you will need to specify the ComID numbers for your locations of interest. You can then retrieve the data for each ComID location.

If you are doing this with Python, then I suggest you check out PyNHD package which I previously highlighted on this blog, and which can identify NHD ComID numbers based on coordinate points.

For more information on the geospatial fabric, you can see Bock et al. (2020).

PRMS

The Precipitation-Runoff Modeling System (PRMS) is described by the USGS as being:

“a deterministic, distributed-parameter, physical process based modeling system developed to evaluate the response of various combinations of climate and land use on streamflow and general watershed hydrology.”

The PRMS simulates many different hydrologic processes such as snowmelt, infiltration, groundwater recharge, surface runoff, and finally streamflow. A conceptual representation of the PRMS, taken from Markstrom et al. (2015) shows the modeled relationships between these processes:

The input data for the PRMS is flexible, but requires some combination of precipitation, air temperature, and solar radiation timeseries. Historic Daymet data provide the climate forcings for the historic period, but future climate projections can also be used.

Additionally, the model requires a set of catchment-delineated parameter values which quantify things such as soil types, types of vegetation coverage, percentage of impervious surfaces, etc. These data can be provided by the National Land Cover Database (NLCD), or alternative land cover change scenarios can be used to assess the impact of land surface on hydrology.

The PRMS is thus able to provide a strong physical processes basis when coupled with the NHM. The combination of these two models is simply referred to as the NHM-PRMS.

You can access the user manual for the PRMS here, and a report describing the latest version (PRMS-IV) from Markstrom et al (2015) here.

Accessing NHM datasets

The NHM infrastructure is great in the sense that it is flexible and incorporates so many different datasets. However, this may introduce difficulty in running the models yourself (I have not attempted this, and don’t have guidance on that).

Fortunately, there are some datasets which have been shared by the USGS, and which conveniently provide various model outputs or calibrated parameters without the need to interact with the model directly.

You can access and explore available datasets from the NHM through the ScienceBase website.

A few notable datasets that you might be interest in using are:

NHM-PRMS simulated streamflow from 1980-2016

Here I want to highlight one of these datasets that I have the most experience working with, and which I believe may be the most interesting to the WaterProgramming crowd: the “Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS), 1980-2016, Daymet Version 3 calibration“.

This dataset contains daily streamflow values across CONUS for the period from 1980-2016, at each HRU and segment contained in the geospatial fabric, and is available through the link here.

Note: Given the scale of this dataset, the file is rather large (92 GB).

Here I want to show how easy in can be to access the streamflow timeseries from this dataset. Assuming that you have downloaded full NHM-PRMS dataset file, for the fully observed-calibrated and Muskingum routing simulation (byHRU_musk_obs.tar), you can extract the segment streamflow timeseries using just a few lines of Python code:

## Example of how to extract streamflow data from NHM-PRMS
import tarfile
import netCDF4 as nc
import pandas as pd

# Open file and extract just segment outflow
tar = tarfile.open('your_data_directory/byHRU_musk_obs.tar')
tar.extract('seg_outflow', path= './extracted/')

# Open the extracted netCDF
segment_outflow = nc.Dataset(f'./extracted/seg_outflow.nc')

# Store values and segment IDs
vals = segment_outflow['seg_outflow'][:]
segment_ids = segment_outflow['segment'][:]

# Make a dataframe
segment_outflow_df = pd.DataFrame(vals, columns = segment_ids)

At this point, segment_outflow_df will contain data from all of CONUS, and you will likely want to choose a subset of this data using the ComID numbers for each segment; I’ll have to leave that part up to you!

I hope this post helped to shine some light on this great resource, and encourages you to consider leveraging the NHM in your own work. As always, thanks for reading!

References

  1. Bock, A.E, Santiago,M., Wieczorek, M.E., Foks, S.S., Norton, P.A., and Lombard, M.A., 2020, Geospatial Fabric for National Hydrologic Modeling, version 1.1 (ver. 3.0, November 2021): U.S. Geological Survey data release, https://doi.org/10.5066/P971JAGF.
  2. Regan, R.S., Markstrom, S.L., Hay, L.E., Viger, R.J., Norton, P.A., Driscoll, J.M., LaFontaine, J.H., 2018, Description of the National Hydrologic Model for use with the Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Techniques and Methods, book 6, chap B9, 38 p., https://doi.org/10.3133/tm6B9.
  3. Leavesley, G. H. (1984). Precipitation-runoff modeling system: User’s manual (Vol. 83, No. 4238). US Department of the Interior.
  4. Markstrom, S. L., Regan, R. S., Hay, L. E., Viger, R. J., Webb, R. M., Payn, R. A., & LaFontaine, J. H. (2015). PRMS-IV, the precipitation-runoff modeling system, version 4. US Geological Survey Techniques and Methods6, B7.
  5. Hay, L.E., and LaFontaine, J.H., 2020, Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS),1980-2016, Daymet Version 3 calibration: U.S. Geological Survey data release, https://doi.org/10.5066/P9PGZE0S.

Highlighting the Excellence in Systems Analysis Teaching and Innovative Communication (ECSTATIC) Repository

*This post was written in collaboration with Dr. Chris Chini (Air Force Institute of Technology) and Dr. David Rosenberg (Utah State University)

The EWRI Congress and EWRS committee meeting that a good majority of us in water resources engineering attend every year serves as a reminder to me that there exist some really wonderful free resources to aid in teaching and communicating in our field. Most of us in EWRS are familiar with the ECSTASTIC repository, but I wanted to use our blog as a forum to formally discuss the repository and provide guidance on how to contribute.

Background  

David: In 2013, several of us among the 100+ members of the Environmental Water Resources Systems (EWRS) committee saw the need to develop, share, and disseminate excellent and innovative methods to teach water systems content. There were organized channels such as journals and conferences to communicate research. Individual committee members were mostly using teaching resources they developed themselves. In most cases, those materials were not accessible to other faculty, students, or life learners. We wanted a channel – the ECSTATIC repository – to share excellent teaching videos, audio clips, lecture materials, readings, problems, games, code, etc.

A second purpose for the task committee was to submit a peer-reviewed article that reviewed the state of systems analysis education, discussed what a systems analysis curriculum should include, and assessed the impact of the interactive website. We ended up authoring and publishing two peer-reviewed articles. The second article took data from interviews we did with practitioners in the field. The notes/videos from the interviews were also posted to the repository!

What is in the repository?

The repository can be found here: https://digitalcommons.usu.edu/ecstatic/

The teaching repository has a multitude of collections ranging from course syllabi, games, problems, lecture materials, and code/models.

Some of the most popular materials on ECSTATIC include lecture notes on multi-objective optimization and nonlinear optimization from Dr. Lina Sela, an Associate Professor at the University of Texas at Austin. Additionally, Dr. Joseph Kasprzyk, Associate Professor University of Colorado at Boulder, has lectures on Introduction to Water Resources Systems and Introduction to Reservoir Analysis. Other submissions include GAMS examples (https://digitalcommons.usu.edu/ecstatic_all/63/), case studies for a guided project on stochastic hydrology (https://digitalcommons.usu.edu/ecstatic_all/57/), and syllabi for courses on environmental systems (https://digitalcommons.usu.edu/ecstatic_all/19/) and water technology & policy (https://digitalcommons.usu.edu/ecstatic_all/66/).

Example Slide Deck Content

There are also free books, video lectures (including a series on Managing Infrastructure with Deep Uncertainty Discussions), and games that can be used to supplement lectures.

Video Lectures from the Managing Infrastructure with Deep Uncertainty Discussions Series

Who can contribute?

Please submit new material at: https://digitalcommons.usu.edu/cgi/ir_submit.cgi?context=ecstatic_all

New submissions are always welcome and are accepted in whatever format suits the material. Submissions can be made by professors, PhD students, practitioners, or any person who feels they have educational materials to share to benefit the Water Systems community. For any questions on the repository, its content, or how to submit an article please reach out to either Nathan Bonham (nathan.bonham@colorado.edu) or Christopher Chini (christopher.m.chini@gmail.com).

David: Submissions on any topic related to water resources would be great, but the ECSTATIC task committee is particularly interested in entire courses from colleagues who may be retired or who are retiring soon. Furthermore, the committee would like to build up the video content in the repository. During the pandemic, many seminars were conducted online and recorded through Zoom. Provided the speakers are fine with having their seminars online, these videos are the best opportunity for people using the site to get context around the methods being used and the problems being solved. In short- content that you can’t get from a textbook or a syllabus.

What is the peer review process like?

New submissions have to pass through a review process. When a new work is submitted via Utah State University’s Digital Commons, a volunteer Chief Editor assigns one or more peer reviewers who are solicited from the EWRS community.

Reviewers use 4 criteria to evaluate submissions:

·       Is the submission topical to ECSTATIC (water resources)?

·       Are the materials organized so someone can download and use them?

·       What are the strengths of the submission?

·       What changes, if any, are needed to post on ECSTATIC?

The entire submission, review, and publishing process occur through the content management system.

What is the reach of ECSTATIC?

ECSTATIC has grown – and continues to grow – to a global audience and usership that surpasses the EWRS community.

ECSTATIC Repository Reach

Since it’s inception, there have been 26,900 total downloads of material across many different countries (as shown in the map above) and types of institutions (academic, government, etc)

Downloads over time

The ECSTATIC repo is a great resource, not only to preserve and share knowledge within our community but to create a platform for those outside our community to become integrated in. Please considering sharing your resources and helping the ECSTATIC repository grow!

The Colorado River Basin: Exploring its Past, Present, and Future Management

I. Introduction

Figure 1. Aerial view of the Colorado River passing through the Grand Canyon in Arizona (McBride) [18]

The Colorado River Basin (CRB) covers a drainage area of 246,000 square miles across California, Colorado, Nevada, New Mexico, Utah, Wyoming and Arizona. Studies have estimated that the basin provides $1.4 trillion of economic revenue per year, 16 million jobs across 7 states, water to 40 million people, and irrigation to approximately 5.5 million acres of agricultural land. For states such as New Mexico and Nevada, it represents more than two thirds of their annual GDP (65% and 87%, respectively). [1]

On August 21, 2021, the U.S. federal government declared its first-ever water cuts in the basin, due to the ongoing megadrought. The basin is estimated to be filled to 35% of its full capacity and has suffered a 20% decrease in inflow in the last century. [2] Rising temperatures caused in part by climate change have dried up the water supply of the basin, with certain areas experiencing more than double the global average temperature. Hotter temperatures have had three notable impacts on the drying of the basin: accelerated evaporation and snowpack melting, drying up soil before runoff reaches reservoirs, and increased wildfires which cause erosion of sediment into the basin. 

The CRB finds itself at a critical juncture; as the population and demand in the area continues to grow, the water supply is only diminishing. Ideally, the basin should provide for municipal, agricultural, tribal, recreational and wildlife needs. Thus, appropriate water management policies will be foundational to the efficacy of water distribution in these critical times. 

II. Brief History

Representatives from the seven Colorado River states negotiated and signed the Colorado River Compact on November 9th, 1922, and a century later, this compact still defines much of the management strategies of the CRB. The compact divided the basin into Upper and Lower sections and provided a framework for the distribution of water [3]. At the time of the signing, it was estimated that the annual flow of the basin was 16.4 million acre-feet (maf/y). Accordingly, the right to 7.5 maf/y of water was split between each portion of the basin. A more accurate estimate of total flow adjusted this to 13.5 maf/y with fluctuations between 4.4 maf/y to 22 maf/y [3]. State-specific allocations were defined in 1928 as part of the Boulder Canyon Act for the Lower Basin, and in 1948 for the Upper Basin in the Upper Colorado River Basin Compact [4]. Figure 2 displays water distribution on a state basis in million-acre feet/year.

The system of water allocation, which is still in place today, dates back to 1876 when the Colorado Constitution put into place the Doctrine of Prior Appropriation. The core of the doctrine enforces that water rights can only be obtained for beneficial use. These rights can be bought, sold, inherited, and relocated. The doctrine gives owners of the land near the water, equal rights of use. This system regulates the uses of surface and tributary groundwater on a basis of priority. The highest priority are senior water rights holders. This group are those that have had the rights for the longest time and are typically best positioned in times of drought. Contrastingly, junior water rights holders are those that have obtained their water rights more recently and tend to be those whose supply is curtailed first in times of drought. This system is often described as “first-in-time, first-in-right” [5]. However, a key criticism of this doctrine is that it fails to encompass beneficial uses of water, which are particularly important when water is scarce.

Figure 3 summarizes the key historic moments of the Colorado River Basin from the Colorado Constitution in 1876 to the most recent 2023 negotiations. Some of the most notable events are the 1922 signing of the Colorado River Compact which created the foundation of division and apportionment of the water basin. Additionally, the construction of dams in the early to mid 20th century such as the Hoover Dam, Parker Dam, Glen Canyon, Flaming Gorge, Navajo and Curecanti have played an important role in determining water flow today. The formation of the Central Arizona Project in 1968, provided access to water for both agricultural lands and metropolitan areas such as the Maricopa, Pinal and Pima counties [6]. More recently, the critically low water levels of Lake Powell and the Glen Canyon Dam have raised concerns about their ability to generate hydropower. In early 2023, the seven states that utilize the CRB met in hopes of reaching an agreement on how to deal with the current shortages but the proposal failed.

III. Key Players

Figure 4, highlights prominent key players in the Colorado River Basin who are frequently involved in negotiations and policy making.

IV. Current Situation

In January 2023, the states within the CRB failed to come to an agreement on an updated water consumption plan aimed to curbing the impacts of the megadrought. The proposed plan would require California to cut their water from 4.4 maf/y to 1 maf/y. The agreement, aimed at preventing Lake Powell and Mead from falling below the critical level for hydropower, proposed major water cuts for Southwestern states of California and Arizona and incorporated losses from evaporation into the cutbacks [7]. Although six states agreed on a plan to move forward, California rejected the major water cuts due to concerns related to agriculture and legal water rights status. The proposed cuts would primarily impact regions such as Imperial County which have senior rights to 3.1 maf/y and an agricultural revenue of $2.3 billion annually [8]. California proposed an alternative plan, which called for a 400,000 acre-foot reduction (for CA specifically), with additional reductions contingent upon the water level of Lake Mead in the future. While both plans are similar, the California plan is founded on waiting until the water level passes a critical threshold, while the plan from the other states is more preventative in nature [7].

Figure 5. Key Facts about the CRB

In more recent news, the Biden Administration just proposed a plan to evenly cut water allocations to California, Arizona and Nevada by approximately one-quarter. An intervention from the federal government on such a scale would be unprecedented in the history of the region. The options considered include: taking no action, making reductions based on senior water rights, or making equal reductions across the three states. Opting for the first option would likely result to a dead pool in Lake Mead and Powell, a term used to describe when the water levels fall too low to continue flowing downstream [12]. The second option would favor California, as it is one of the states which holds the most seniority in the basin particularly in the Coachella and Imperial Valleys of Southern CA [9]. Consequently, this decision would more severely impact Arizona and Nevada, which are important swing states for the President and have an important tribal presence. The final decision is anticipated to be made in the summer of 2023 [10]. 

In many parts of California and Colorado, this winter has been particularly heavy in rain and snow, making people hopeful that the river could be replenished. Scholars estimate that it would require 3 years of average snow with no water consumption to fully restore the Colorado River Basin reservoirs and 7 years under current consumption activity [11]. Unfortunately, the basin’s soil is extremely dry, which means that any excess water that comes from the snowpack is likely to be absorbed by the ground before it has a chance to reach rivers and streams. It is estimated that by the end of 2023 Lake Powell will be at 3,555 feet of elevation (roughly 32% capacity). When Lake Powell reaches 3,490 feet, the Glen Canyon Dam will be unable to produce hydroelectric power. At 3,370 feet, a dead pool will be reached.

V. Paths to a Sustainable Future

As water supply in the CRB continues to diminish, it has become increasingly crucial to find ways to minimize water loss of the system. One of the major contributor is through evaporation, which accounts for approximately 1.5 maf/y of loss. This loss is more than Utah’s allocation from the Colorado River.  In order to minimize evaporation loss, an ideal reservoir would be very deep and have small surface area. However, this is not the case for reservoirs like Lake Powell which loses approximately 0.86 maf/y (more than 6% of CRB annual flow) [13]. This raises the important question of how to best allocate the CRB water considering that some states experience more evaporation loss than others. According to research from CU Boulder, some possible solutions include covering the surface water with reflective materials, films of organic compounds, and lightweight shades. Additionally, relocating the reservoir water underground storage areas or aquifers could also serve to reduce evaporation [14]. 

An alternative approach is cloud seeding. In early March of 2023, the Federal Government invested $2.4 million in cloud seeding for the CRB. Cloud seeding is a technique used to artificially induce more precipitation by injecting ice nuclei, silver iodide or other small crystals into subfreezing clouds. This promotes condensation of water around the nuclei and the formation of rain drops which are estimated to increase precipitation by 5-15% [15]. The grant will be used to fund new cloud seeding generators which can be operated remotely as well as aircrafts for silver iodide injections. While this is a significant investment, cloud seeding has been practiced for decades in the CRB. Indeed, it is estimated that Colorado, Utah, and Wyoming each spend over $1 million annually on cloud seeding and Utah has planned to increase its spendings to $14 million next year [16]. While the negative impacts of cloud seeding are still unclear, some scholars believe that they could cause silver toxicity because of the use of potentially harmful chemicals. Additionally, the wind can sometimes blow the seeded clouds to a different location [17]. Ultimately, cloud seeding does not solve the underlying obstacles of climate change or aridification in the region, but it may help alleviate some of the impact from the drought until a more sustainable alternative can be found. 

Work Cited

[1] “Economic Importance of the Colorado River.” The Nature Conservancy. The Nature Conservancy. Accessed December 1, 2021. https://www.nature.org/en-us/about-us/where-we-work/priority-landscapes/colorado-river/economic-importance-of-the-colorado-river/.

[2] Kann, Drew. “Climate Change Is Drying up the Colorado River, Putting Millions at Risk of ‘Severe Water Shortages’.” CNN. Cable News Network, February 22, 2020. https://www.cnn.com/2020/02/21/weather/colorado-river-flow-dwindling-warming-temperatures-climate-change/index.html.

[3] Megdal, Sharon B. “Sharing Colorado River Water: History, Public Policy and the Colorado River Compact.” The University of Arizona: Water Resources Research Center , 1 Aug. 1997, https://wrrc.arizona.edu/publication/sharing-colorado-river-water-history-public-policy-and-colorado-river-compact.&nbsp;

[4] Water Education Foundation. (n.d.). Colorado River Compact. Water Education Foundation. Retrieved April 17, 2023, from https://www.watereducation.org/aquapedia-background/colorado-river-compact#:~:text=The%20Lower%20Basin%20states%20were,River%20Basin%20Compact%20of%201948.&nbsp;

[5] Hockaday, S, and K.J Ormerod. “Western Water Law: Understanding the Doctrine of Prior Appropriation: Extension.” University of Nevada, Reno , University of Nevada, Reno, 2020, https://extension.unr.edu/publication.aspx?PubID=3750#:~:text=Senior%20water%20rights%20are%20often,farming%2C%20ranching%20and%20agricultural%20uses.&text=A%20claim%20to%20water%20that%20is%20more%20recent%20than%20senior,municipal%2C%20environmental%20or%20recreational%20uses.&nbsp;

[6] State of the Rockies Project 2011-12 Research Team. “The Colorado River Basin: An Overview.” Colorado College, 2012. 

[7] Partlow, Joshua. “As the Colorado River Dries up, States Can’t Agree on Saving Water.” The Washington Post, The Washington Post, 4 Apr. 2023, https://www.washingtonpost.com/climate-environment/2023/01/31/colorado-river-states-water-cuts-agreement/.

[8] Bland, Alastair. “California, Other States Reach Impasse over Colorado River.” CalMatters, 1 Feb. 2023, https://calmatters.org/environment/2023/01/california-colorado-river-water-2/.&nbsp;

[9] Hager, Alex. “Six States Agree on a Proposal for Colorado River Cutbacks, California Has a Counter.” KUNC, NPR, 1 Feb. 2023, https://www.kunc.org/environment/2023-01-31/six-states-agree-on-a-proposal-for-colorado-river-cutbacks-california-has-a-counter.

[10] Flavelle, Christopher. “Biden Administration Proposes Evenly Cutting Water Allotments from Colorado River.” The New York Times, The New York Times, 11 Apr. 2023, https://www.nytimes.com/2023/04/11/climate/colorado-river-water-cuts-drought.html?campaign_id=190&%3Bemc=edit_ufn_20230411&%3Binstance_id=89950&%3Bnl=from-the-times&%3Bregi_id=108807334&%3Bsegment_id=130155&%3Bte=1&%3Buser_id=ea42cfd845993c7028deab54b22c44cd.&nbsp;

[11] Mullane, Shannon. “Colorado’s Healthy Snowpack Promises to Offer Some Relief for Strained Water Supplies.” The Colorado Sun, The Colorado Sun, 14 Mar. 2023, https://coloradosun.com/2023/03/14/colorado-snowpack-water-supply-relief/.

[12]Tavernise, Sabrina, host. “7 States, 1 River and an Agonizing Choice.” The Daily, The New York Times, 31 Jan. 2023. https://www.nytimes.com/2023/01/31/podcasts/the-daily/colorado-river-water-cuts.html?showTranscript=1

[13] “Lake Powell Reservoir: A Failed Solution.” Glen Canyon Institute, Glen Canyon Institute , https://www.glencanyon.org/lake-powell-reservoir-a-failed-solution/.

[14] Blanken, Peter, et al. “Reservoir Evaporation a Big Challenge for Water Managers in West.” CU Boulder Today, 28 Dec. 2015, https://www.colorado.edu/today/2015/12/28/reservoir-evaporation-big-challenge-water-managers-west.

[15] McDonough, Frank. “What Is Cloud Seeding?” DRI, Desert Research Institute, 19 Sept. 2022, https://www.dri.edu/cloud-seeding-program/what-is-cloud-seeding/.

[16] Peterson, Brittany. “Feds Spend $2.4 Million on Cloud Seeding for Colorado River.” AP NEWS, Associated Press, 17 Mar. 2023, https://apnews.com/article/climate-change-cloud-seeding-colorado-river-f02c216532f698230d575d97a4a8ac7b.

[17] Rinkesh. “Various Pros and Cons of Cloud Seeding.” Conserve Energy Future, Conserve Energy Future, 28 July 2022, https://www.conserve-energy-future.com/pros-cons-of-cloud-seeding.php.

Work Cited for Photos

[18] McBride , P. (n.d.). The Colorado River winds through the Grand Canyon. photograph, Arizona.

[19] Kuhn, Eric, and John Fleck . “A Century Ago in Colorado River Compact Negotiations: Storage, Yes. but in the Compact?” Jfleck at Inkstain , 15 Nov. 2022, https://www.inkstain.net/2022/11/a-century-ago-in-colorado-river-compact-negotiations-storage-yes-but-in-the-compact/.

[20] “A Project for the Ages.” Bechtel Corporate, https://www.bechtel.com/projects/hoover-dam/.

[21] Lane, Taylor. “Lake Mead through the Decades – Lake Mead in 1950 Photograph from United States National Park Service Photography Collection .” Journal, Las Vegas Review-Journal, 17 Aug. 2022, https://www.reviewjournal.com/local/local-las-vegas/lake-mead-through-the-decades-photos-2602149/.

[22] “1944: U.S. Mexico Water Treaty.” Know Your Water News , Central Arizona Project , 25 Oct. 2022, https://knowyourwaternews.com/1944-u-s-mexico-water-treaty/.

[23] “Glen Canyon Unit – Arial View of the Glen Canyon Dam and Lake Powell.” Bureau of Reclamation – Upper Colorado Region , Bureau of Reclemation , https://www.usbr.gov/uc/rm/crsp/gc/.

[24] “Reclamation and Arizona.” Reclamation, U.S Department of the Interior , https://www.usbr.gov/lc/phoenix/AZ100/1960/supreme_court_AZ_vs_CA.html.

[25] Ferris, Kathleen. “CAP: Tracking the Flow of Colorado River Water to Your City.” AMWUA, Arizona Municipal Water Users Association, 19 Oct. 2015, https://www.amwua.org/blog/cap-tracking-the-flow-of-colorado-river-water-to-your-city.

Causality for Dummies

The name of this post speaks for itself – this will be a soft introduction to causality, and is intended to increase familiarity with the ideas, methods, applications and limitations of causal theory. Specifically, we will walk through a brief introduction of causality and compare it with correlation. We will also distinguish causal discovery from causal inference, the types of datasets commonly encountered and provide a (very) brief overview of the methods that can be applied towards both. This post also includes a list of commonly-encountered terms within the literature, as well as prior posts written on this topic.

Proceed if you’d like to be less of a causality dummy!

Introducing Causality

Note: The terms “event” and “variable” will be used interchangeably in this post.

Causality has roots and applications in a diverse set of field: philosophy, economics, mathematics, neuroscience – just to name a few. The earliest formalization for causality can be attributed to Aristotle, while modern causality as we understand it stem from David Hume, an 18th-century philosopher (Kleinberg, 2012). Hume argued that causal relationships can be inferred from observations and is conditional upon the observer’s beliefs and perceptions. Formally, however, causation can be defined as the contribution of an event to the production of other events. Causal links between events can be uni- or bi-directional – that is

  1. Event X causes event Y where the reverse is false, or
  2. Event X causes event Y where the reverse is true.

Such causal links have to be first detected and and then quantified to confirm the existence of a causal relationship between two events or variables, where the strength of these relationships have been measured using some form of a score-based strength measure (where the score has to exceed a specific threshold for it to be “counted” as a causal link), or using statistical models to calculate conditional independence.

Before we get down into the details on the methods that enable aforementioned quantification, let’s confront the elephant in the room:

Correlation vs Causation

The distinction between correlation and causation is often-muddled. Fortunately, there exists a plethora of analogies as to why correlation is not causation. My favorite so far is the global warming and pirates analogy, where increasing global mean temperature was found to be negatively correlated with a consistent decrease in the number of pirates. To demonstrate my point:

This plot is trying to tell you something that will save the Earth. (source: Baker, 2020)

…so does this mean that I should halt my PhD and join the dwindling ranks of pirates to save planet Earth?

Well, no (alas). This example demonstrates why relationships between such highly-correlated variables can be purely coincidental, with no causal links. Instead, there are external factors that may have actually the number of pirates to decrease, and independently, the global mean temperature to rise. Dave’s, Rohini’s and Andrew’s blog posts, as well as this website provide more examples that further demonstrate this point. Conversely, two variables that are causally linked may not be correlated. This can happen when there is a lack of change in variables being measured, sometimes caused by insufficient or manipulated sampling (Penn, 2018).

Correlation is limited by its requirement that the relationship between two variables be linear. It also does not factor time-ordering and time-lags of the variables’ time series. As a result, depending on correlation to assess a system may cause you to overlook the many relationships that might exist within it. Correlation is therefore a useful in making predictions, where trends in one variable can be used to predict the trends of another. Causation, on the other hand, can be used in making decisions, as it can help to develop a better understanding of the cause-and-effects of changes made to system variables.

Now that we’ve distinguished causation from its close (but troublesome) cousin, let’s begin to get into the meat of causality with some commons terms you might encounter as you being exploring the literature.

A quick glossary

The information from this section is largely drawn from Alber (2022), Nogueira et al. (2022), Runge et al. (2019), and Weinberger (2018). It should not be considered exhaustive, by any measure, and should only be used to get your bearings as you begin to explore causality.

Causal faithfulness (“faithfulness”): The assumption that causally-connected events be probabistically dependent on each other.

Causal Markov condition: The assumption that, in a graphical model, Event Y is independent of every other event, conditional on Y’s direct causes.

Causal precedence: The assumption that Event A causes Event B if A happens before B. That is, events in the present cannot have caused events in the past.

Causal sufficiency: The assumption that all possible direct common causes of an event, or changes to a variable, have been observed.

Conditional independence: Two events or variables X and Y are conditionally independent (it is known that X does not cause Y, and vice versa) given information on an additional event or variable Z.

Confounders: “Interfering” variables that influence both the dependent and independent variables, therefore making it more challenging, or confounding, to verify to presence of a causal relationship.

Contemporaneous links: A causal relationship that exists between variables in the same time step, therefore being “con” (with, the same time as) and “temporary”. The existence of such links is one instance in which the causal precedence assumption is broken.

Directed acyclic graphs (DAGs): A graphical representation of the causal relationships between a set of variables or events. These relationships are known to be, or assumed to be, true.

Edges: The graphical representation of the links connecting two variables or events in a causal network or graph. These edges may or may not represent causal links.

Granger causality: Event X Granger-causes Event Y if predicting Y based on its own past observations and past observations of X performs better than predicting Y solely on its own past observations.

Markov equivalence class: A set of graphs that represent the same patterns of conditional independence between variables.

Nodes: The graphical representation of two events (or changes to variables).

Causality: Discovery, Inference, Data and Metrics

Causality and its related methods are typically used for two purposes: causal discovery and causal inference. Explanations for both are provided below.

Causal Discovery

Also known as causal structural learning (Tibau et al., 2022), the goal of causal discovery is to obtain causal information directly from observed or historical data. Methods used for causal discovery do not assume implicit causal links between the variables within a dataset. Instead, they begin with the assumption of a “clean slate” and attempts to generate (then analyze) models to illustrate the inter-variable links inherent to the dataset thus preserving them. The end goal of causal discovery is to approximate a graph that represents the presence or absence of causal relationships between a set of two or more nodes.

Causal Inference

Causal inference uses (and does not generate) causal graphs, focusing on thoroughly testing the truth of a the causal relationship between two variables. Unlike causal discover, it assumes that a causal relationship already exists between two variables. Following this assumption, it tests and quantifies the actual relationships between variables in the available data. It is useful for assessing the impact of one event, or a change in one variable, on another and can be applied towards studying the possible effects of altering a given system. Here, causal inference should not be confused with sensitivity analysis. The intended use of sensitivity analysis is to map changes in model output to changes in its underlying assumptions, parameterizations, and biases. Causal inference is focused on assessing the cause-and-effect relationships between variables or events in a system or model.

Data used in causal discovery and causal inference

There are two forms of data that are typically encountered in literature that use causal methods:

Comparing cross-sectional data and longitudinal data (source: Scribbr).

Cross-sectional data

Data in this form is non-temporal. Put simply, all variables available in cross-sectional data represents a single point in time. It may contain observations of multiple individuals, where each observation represents one individual and each variable contains information on a different aspect of the individual. The assumption of causal precedence does not hold for such datasets, and therefore requires additional processing to develop causal links between variables. Such datasets are handled using methods that measure causality using advanced variations of conditional independence tests (more on this later). An example of a cross-sectional dataset is the census data collected by the U.S. Census Bureau once every decade.

Longitudinal data

This form of data is a general category that also contains time-series data, which consists of a series of observations about a single (usually) subject across some time period. Such datasets are relatively easy to handle compared to cross-sectional data as the causal precedence assumption is (often) met. Therefore, most causality methods can be used to handle longitudinal data, but some common methods include the basic forms of Granger causality, cross-convergent mapping (CCM), and fast conditional independence (FCI). An example of longitudinal data would be historical rainfall gauge records of precipitation over time.

Causality Methods

The information from this section largely originates from Runge et al. (2019), Noguiera et al. (2022), and Ombadi et al. (2020).

There are several methods can can be used to discover or infer causality from the aforementioned datasets. In general, these methods are used to identify and extract causal interactions between observed data. The outcomes of these methods facilitate the filtering of relevant drivers (or variables that cause observed outcomes) from the larger set of potential ones and clarify inter-variable relationships that are often muddied with correlations. These methods measure causality in one of two ways:

  1. Score-based methods: Such methods assign a “relevance score” to rank each proposed causal graph based on the likelihood that they accurately represent the conditional (in)dependencies between variables in a dataset. Without additional phases to refine their algorithms, these methods are computationally expensive as all potential graph configurations have to be ranked.
  2. Constraint-based methods: These methods employ a number of statistical tests to identify “necessary” causal graph edges, and their corresponding directions. While less computationally expensive than basic score-based methods, constraint-based causality methods are limited to evaluating causal links for one node (that represents a variable) at a time. Therefore, it cannot evaluate multiple variables and potential multivariate causal relationships. It’s computational expense is also proportional to the number of variables in the dataset.

Now that we have a general idea of what causality methods can help us do, and what , let’s dive into a few general classes of causality methods.

Granger Causality

Often cited as one of the earliest mathematical representations of causality, Granger causality is a statistical hypothesis test to determine if one time series is useful in forecasting another time series based in prediction. It was introduced by Clive Granger in the 1960s and has widely been used in economics since, and more recently has found applications in neuroscience and climate science (see Runge et al. 2018, 2019). Granger causality can be used to characterize predictive causal relationships and to measure the influence of system or model drivers on variables’ time series. These relationships and influences can be uni-directional (where X Granger-causes Y, but not Y G.C. X) or bi-directional (X G.C. Y, and Y G.C. X).

A time series demonstration on how variable X G.C. Y (source: Google).

Due to its measuring predictive causality, Granger causality is thus limited to systems with independent driving variables. It cannot be used to assess multivariate systems or systems in which conditional dependencies exist, both of which are characteristic of real-world stochastic and linear processes. It is also requires separability where the causal variable (the cause) has to be independent of the influenced variable (the effect), and assumes causal precedence. It is nonetheless a useful preliminary tool for causal discovery.

Nonlinear state-space methods

An alternative to Granger causality are non-linear state-space methods, which includes methods such as convergent cross mapping (CCM). Such methods assumes that variable interactions occur in an underlying deterministic dynamical system, and then attempts to uncover causal relationships based on Takens’ theorem (see Dave’s blog post here for an example) by reconstructing the nonlinear state-space. The key idea here is this: if event X can be predicted using time-delayed information from event Y, then X had a causal effect on Y.

Visualization on how nonlinear state-space methods reconstruct the dynamic system and identify causal relationships (source: Time Series Analysis Handbook, Chapter 6)

Convergent Cross Mapping (CCM)

Convergent Cross Mapping (Sugihara et al., 2012; Ye et al., 2015) tests the reliability of a variable Y as an estimate or predictor of variable X, revealing weak non-linear interactions between time series which might otherwise have been missed (Delforge et al, 2022). The CCM algorithm involves generating the system manifold M, the X-variable shadow manifold MX, and the Y-variable shadow manifold MY. The algorithm then samples an arbitrary set of nearest-neighbor (NN) points from MX, then determines if they correspond to neighboring points in MY. If X and Y are causally linked, they should share M as a common “attractor” manifold (a system state that can be sustained in the short-term). Variable X can therefore be said to inform variable Y .but not vice versa (unidirectionally lined). CCM can also be used to detect causality due to external forcings, where X and Y do not interact by may be driven by a common external variable Z. It is therefore best suited for causal discovery.

While CCM does not rely on conditional independence, it assumes the existence of a deterministic (but unknown) underlying system (i.e. a computer program, physics-based model) which can be represented using M. Therefore, it does not work well for stochastic time series (i.e. streamflow, coin tosses, bacteria population growth). The predictive ability of CCMs are also vulnerable to noise in data. Furthermore, it requires a long time series to de a reliable measure of causality between two variables, as a longer series decreases the NN-distance on each manifold thus improving the ability of the CCM to predict causality.

Causal network learning algorithms

On the flipside of CCMs, causal network learning algorithms assume that the underlying system in which the variables arise to be purely stochastic. These class of algorithms add or remove edges to causal graphs using criteria based in conditional or mutual independence, and assumes that both the causal Markov and faithfulness conditions holds true for all proposed graph structures. They are therefore best used for causal inference, and can be applied to cross-sectional data, as well as linear and non-linear time series.

These algorithms result in a “best estimate” graphs where all edges have the associated optimal conditional independences that best reflect observed data. They employ two stages: the skeleton discovery phase where non-essential links are eliminated, and the orientation phase where the directionality of the causal links are finalized. Because of this, these algorithms can be used to reconstruct large-scale, high-dimensional systems with interacting variables. Some of the algorithms in this class are also capable of identifying the direction of contemporaneous links, thus not being beholden to the causal precedence assumption. However, such methods can only estimate graphs up to a Markov equivalence class, and require a longer time series to provide a better prediction of causal relationships between variables.

General framework of all causal network learning algorithms (source: Runge et al., 2019).

Let’s go over a few examples of causal network learning algorithms:

The PC (Peter-Clark) algorithm

The PC algorithm begins will a fully connected graph in its skeleton phase an iteratively removes edges where conditional independences exists. It then orients the remaining edges in its orientation phase. It its earliest form, the PC algorithm was limited due to assumptions on causal sufficiency, and the lack of contemporaneous dependency handling. It also did not scale well to high-dimensional data.

Later variations attempted to overcome these limitations. For example, the momentary conditional independence PC (PCMCI) and the PCMCI+ algorithms added a further step to determine causal between variables in different timesteps and to find lagged and contemporaneous relationships separately, therefore handling contemporaneity. The PC-select variation introduced the ability to apply conditional independence tests on target variables, allowing it to process high-dimensional data. These variations can also eliminate spurious causal links. However, the PC algorithm and its variables still depend on the causal Markov, faithfulness, and sufficiency assumptions. The causal links that it detects are also relative to the feature space (Ureyen et al, 2022). This means that the directionality (or existence) of these links may change if new information is introduced to the system.

Full conditional independence (FCI)

Unlike the PC-based algorithms, FCI does not require that causal sufficiency be met although it, too, is based on iterative conditional independence tests and begins will a complete graph. Another differentiating feature between PC and FCI is the lack of the assumption of causal links directionality in the latter. Instead of a uni- or bi-directional orientation that the PC algorithm eventually assigns to its causal graph edges, the FCI has four edge implementations to account for the possibility of spurious links. Given variables X and Y, FCI’s edge implementations are as follows:

  1. X causes Y
  2. X causes Y or Y causes X
  3. X causes Y or there are unmeasured confounding variables
  4. X causes Y, Y causes X, there are unmeasured confounding variables, or some combination of both

There are also several FCI variations that allow improved handling of large datasets, high dimensionality, and contemporaneous variables. For example, the Anytime and the Adaptive Anytime FCI restricts the maximum number of variables to be considered as drivers, and the time series FCI (TsFCI) uses sliding windows to transform the original, long time series into a set of “independent” subsamples that can be treated as cross-sectional. To effectively use FCI, however, the data should be carefully prepared using Joint Causal Inference (JCI) to allow the generated graph to include both variable information and system information, to account for system background knowledge (Mooij et al., 2016).

Structural causal models (SCMs)

Similar to causal network learning algorithms, SCMs assumes a purely stochastic underlying system and uses DAGs to model the flow of information. It also can detect causal graphs to within a Markov equivalence class. Unlike causal network learning algorithms, SCMs structure DAGs that consist of a set of endogenous (Y) and exogenous (X) variables that are connected by a set of functions (F) that determine the values of Y based on the values of X. Within this context, a node represents the a variable x and y in X and Y, while an edge represents a function f within F. By doing so, SCMs enable the discovery of causal directions in cases where the causal direction cannot be inferred with conditional independence-based methods. SCMs can also handle a wide range of systems (linear, non linear, various noise probability distributions). This last advantage is one of the limitations of SCMs: it requires that some information on underlying structure of the system is known a priori (e.g. the system is assumed to be linear with at least one of the noise terms being drawn from a Gaussian distribution). SCMs are best used for causal inference, as causal links between variables have to be assumed during the generation of SCMs.

Information-theoretic algorithms

Finally, we have information-theoretic (IT) algorithms, which are considered an extension of the GC methods and allows the verification of nonlinear relationships between system variables, and therefore is best used for causal inference. IT algorithms measure transfer entropy (TE), which is defined as the amount of shared information between variables X and Y when both are conditioned on external variable Z. The magnitude of TE reflects the Shannon Entropy reduction in Y when, given Z, information on X is added to the system. For further information on IT and TE, Andrew’s blog post and Keyvan’s May 2020 and June 2020 posts further expand on the theory and application of both concepts.

There are a couple of assumptions that come along with the use of IT algorithms. First, like both SCMs and causal learning network algorithms, it assumes that the underlying system is purely stochastic. It is also bound to causal precedence, and asssumes that the causal variable X provides all useful information for the prediction of the effect Y, given Z. In addition, IT algorithms benefit from longer time series to improve predictions of causal links between variables. On the other hand, it does not make assumptions about the underlying structure of the data and can detect both linear and nonlinear causal relationships.

Prior WaterProgramming blog posts

That was a lot! But if you would like a more detailed dive into causality and/or explore some toy problems, there are a number of solid blog posts written that focus on the underlying math and concepts central to the approaches used in causal discovery and/or inference:

  1. Introduction to Granger Causality
  2. Introduction to Convergent Cross Mappint
  3. Detecting Causality using Convergent Cross Mapping: A Python Demo using the Fisheries Game
  4. Causal Inference Using Information-Theoretic Approaches
  5. Information Theory and the Moment-Independent Sensitivity Indices
  6. Milton Friedman’s thermostat and sensitivity analysis of control policies

Summary and key challenges

In this blog post, we introduced causality and compared it to correlation. We listed a glossary of commonly-used terms in causality literature, as well as distinguished causal discovery from causal inference. Next, we explored a number of commonly-used causality methods: Granger causality, CCM, conditional independence-based causal learning network algorithms, SCMs, and information-theoretic algorithms.

From this overview, it can be concluded that methods to discovery and infer causal relationships are powerful tool that enable us to identify cause-and-effect links between seemingly unrelated system variables. Improvements to these methods are pivotal to improve climate models, increase AI explainability, and aid in better, more transparent decision-making. Nevertheless, these methods face challenges (Tibau et al. 2022) that include, but are not limited to:

  1. Handling gridded or spatio-temporally aggregated data
  2. Representing nonlinear processes that may interact across time scales
  3. Handling non-Gaussian variable distributions and data non-stationarity
  4. Handling partial observability where only a subset of system variables is observed, thus challenging the causal sufficiency assumption
  5. Uncertainty: Non-stationarity, noise, internal variability
  6. Dealing with mixed data types (discriete vs continuous)
  7. Lack of benchmarking approaches due to lack of ground truth data

This brings us to the end of the post – do take a look at the References for a list of key literature and online articles that will be helpful as you begin learning about causality. Thank you for sticking with me and happy exploring!

References

Alber, S. (2022, February 9). Directed Acyclic Graphs (DAGs) and Regression for Causal Inference. UC David Health. Davis; California. Retrieved Match 14, 2023, from https://health.ucdavis.edu/ctsc/area/Resource-library/documents/directed-acyclic-graphs20220209.pdf

Baker, L. (2020, July 9). Hilarious graphs (and pirates) prove that correlation is not causation. Medium. Retrieved March 14, 2023, from https://towardsdatascience.com/hilarious-graphs-and-pirates-prove-that-correlation-is-not-causation-667838af4159

Delforge, D., de Viron, O., Vanclooster, M., Van Camp, M., & Watlet, A. (2022). Detecting hydrological connectivity using causal inference from time series: Synthetic and real Karstic case studies. Hydrology and Earth System Sciences, 26(8), 2181–2199. https://doi.org/10.5194/hess-26-2181-2022

Gonçalves, B. (2020, September 9). Causal inference - part IV - structural causal models. Medium. Retrieved March 13, 2023, from https://medium.data4sci.com/causal-inference-part-iv-structural-causal-models-df10a83be580

Kleinberg, S. (2012). A Brief History of Causality (Chapter 2) – Causality, Probability, and Time. Cambridge Core. Retrieved March 14, 2023, from https://www.cambridge.org/core/books/abs/causality-probability-and-time/brief-history-of-causality/C87F30B5A6F4F63F0C28C3156B809B9E

Mooij, J. M., Sara, M., & Claasen, T. (2022). Joint Causal Inference from Multiple Contexts. Journal of Machine Learning Research 21, 21(1). https://doi.org/https://doi.org/10.48550/arXiv.1611.10351

Nogueira, A. R., Pugnana, A., Ruggieri, S., Pedreschi, D., & Gama, J. (2022). Methods and tools for causal discovery and causal inference. WIREs Data Mining and Knowledge Discovery, 12(2). https://doi.org/10.1002/widm.1449

Ombadi, M., Nguyen, P., Sorooshian, S., & Hsu, K. (2020). Evaluation of methods for causal discovery in hydrometeorological systems. Water Resources Research, 56(7). https://doi.org/10.1029/2020wr027251

Penn, C. S., (2020, August 25). Can causation exist without correlation? Yes! Christopher S. Penn – Marketing Data Science Keynote Speaker. Retrieved March 14, 2023, from https://www.christopherspenn.com/2018/08/can-causation-exist-without-correlation/

Runge, J. (2018). Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7), 075310. https://doi.org/10.1063/1.5025050

Runge, J., Bathiany, S., Bollt, E., Camps-Valls, G., Coumou, D., Deyle, E., Glymour, C., Kretschmer, M., Mahecha, M. D., Muñoz-Marí, J., van Nes, E. H., Peters, J., Quax, R., Reichstein, M., Scheffer, M., Schölkopf, B., Spirtes, P., Sugihara, G., Sun, J., Zscheischler, J. (2019). Inferring causation from time series in Earth System Sciences. Nature Communications, 10(1). https://doi.org/10.1038/s41467-019-10105-3

Sugihara, G., May, R., Ye, H., Hsieh, C.-hao, Deyle, E., Fogarty, M., & Munch, S. (2012). Detecting causality in complex ecosystems. Science, 338(6106), 496–500. https://doi.org/10.1126/science.1227079

Tibau, X.-A., Reimers, C., Gerhardus, A., Denzler, J., Eyring, V., & Runge, J. (2022). A spatiotemporal stochastic climate model for benchmarking causal discovery methods for teleconnections. Environmental Data Science, 1. https://doi.org/10.1017/eds.2022.11

Uereyen, S., Bachofer, F., & Kuenzer, C. (2022). A framework for multivariate analysis of land surface dynamics and driving variables—a case study for Indo-Gangetic River basins. Remote Sensing, 14(1), 197. https://doi.org/10.3390/rs14010197

Weinberger, N. (2017). Faithfulness, coordination and causal coincidences. Erkenntnis, 83(2), 113–133. https://doi.org/10.1007/s10670-017-9882-6

Ye, H., Deyle, E. R., Gilarranz, L. J., & Sugihara, G. (2015). Distinguishing time-delayed causal interactions using convergent cross mapping. Scientific Reports, 5(1). https://doi.org/10.1038/srep14750

Structuring a Python Project: Recommendations and a Template Example

Motivation:

The start of a new year is a good (albeit, relatively arbitrary) time to reassess aspects of your workflow.

I, like many people, taught myself Python by jumping into different projects. The consequence of this ad-hoc learning was that I did not learn some of the fundamentals until much later in my project development.

At the end of the Fall ’23 semester, I found myself spending a lot of time cleaning up repositories and modules that I had constructed in the preceding months. I ended up restructuring and reorganizing significant portions of the code base, implementing organizational practices that I had learned after it’s conception.

This post is intended to save the reader from making the same mistake. Here, I present a recommended structure for new Python projects, and discuss the main components. This is largely targeted at Python users who have not had a formal Python training, or who are working with their first significantly sized project.

I provide an example_python_project, hosted on GitHub, as a demonstration and to serve as a template for beginners.

Content:

  1. Recommended structure
  2. Example project repository
  3. Project components (with example project)
    1. Modules, packages, __init__.py
    2. Executable
    3. README
    4. Documentation
    5. Tests
    6. Requirements
    7. LICENSE
  4. Conclusion

Recommended project structure

I’ll begin by presenting a recommended project structure. Further down, I provide some explanation and justification for this structure.

My example_python_project follows recommendations from Kenneth Reitz, co-author of The Hitchhiker’s Guide to Pythton (1), while also drawing from a similar demo-project, samplemod, by Navdeep Gill (2).

The project folder follows this structure:

example_python_project/ 
├── sample_package/ 
│   ├── subpackage/ 
│   │   ├── __init__.py 
│   │   └── subpackage_module.py 
│   ├── __init__.py 
│   ├── helpers.py 
│   └── module.py 
├── docs/ 
│   └── documentation.md 
├── tests/ 
│   ├── __init__.py 
│   └── basic_test.py 
├── main.py 
├── README.md 
├── requirements.txt 
└── LICENSE

If you are just starting a project, it may seem unnecessary complex to begin with so much modularity. It may seem easier to open a .py file and start freewheeling. Here, I am trying to highlight the several reasons why it is important to take care when initially constructing a Python project. Some of these reasons include:

  1. Maintenance: A well-structured project makes it easier to understand the code, fix bugs, and add new features. This is especially important as the project grows in size and complexity.
  2. Collaboration: When working on a project with multiple developers, a clear structure makes it easier for everyone to understand how the code is organized and how different components interact with each other.
  3. Scalability: A well-structured project allows to scale up the codebase, adding new features and sub-components, without making the codebase hard to understand or maintain.
  4. Testing: A well-structured project makes it easier to write automated tests for the code. This helps to ensure that changes to the code do not break existing functionality.
  5. Distribution: A well-structured project makes it easier to distribute the code as a package. This allows others to easily install and use the code in their own projects.

Overall, taking the time to structure a Python project when starting can save a lot of time and heartache in the long run, by making the project easier to understand, maintain, and expand.


Example Project Repository

The repository containing this example project is available on GitHub here: example_python_project.

The project follows the recommended project structure above, and is designed to use modular functions from the module, helpers, and subpackage_module. It is intended to be a skeleton upon which you can build-up your own project.

If you would like to experiment with your own copy of the code, you can fork a copy of the repository, or Download a ZIP version.

Project overview

The project is a silly riddle program with no real usefulness other than forming the structure of the project. The bulk of the work is done in the main_module_function() which first prints a riddle on the screen, then iteratively uses the helper_function() and subpackage_function() to try and “solve” the riddle. Both of these functions simply return a random True/False, and are repeatedly called until the riddle is solved (when status == True).

Below is a visual representation of how the different functions are interacting. The green-box functions are contained within the main sample_package, while the blue-box function is stored in the subpackage.

The program can then be executed from a command line using the main.py executable:

C:\<your-local-directory>\example_python_project> python main.py

The output will first print out the riddle, then print statements indicating which functions are being used to “solve” the riddle. This is simply a means of demonstrating how the different functions are being activated, and not necessarily a recommended “Best Practice”.

A normal output should resemble something similar to the below, although there may be more or less print statements depending upon how many times it takes the random generator to produce a “True” solution:

Here is a riddle, maybe `sample_package` can help solve it:

   What runs but has no feet, roars but has no mouth?

Lets see if the helper can solve the riddle.
The helper_function is helping!
The helper could not solve it.
Maybe the subpackage_module can help.
The subpackage_function is being used now.
The subpackage solved it, the answer is "A River"!

Project components

Modules, packages, __init__.py, oh my!

Before going any further, I want to take time to clarify some vocabulary which is helpful for understanding component interactions.

Module:
A module is simply a file ending in .py, which contains functions and or variables.

Package:
A package is a collection of modules (.py files) which relate to one another, and which contains an __init__.py file.

__init__.py
Inclusion of a __init__.py file (pronounced “dunder in-it”) within a folder will indicate to Python that the folder is a package. Often, the __init__ module is empty, however it can be used to import other modules, or functions which will then be stored in namespace, making it available for use later.

For example, in my sample_package/__init__.py, I import all contents of the module.py and subpackage_module.py:

# Import the all functions from main and sub modules
from .module import *
from .subpackage.subpackage_module import *

This allows all of the functions stored within module to be callable from the primary sample_package directly, rather than specifying the various sub-structures needed to access various functions. For example, by including from .subpackage.subpackage_module import *, I able to run:

# IF __init__ imports all content from main and sub modules then you can do this:
import sample_package
sample_package.subpackage_module_function()

Rather than requiring the following fully-nested call, which is necessary when the __init__.py is empty:

# IF __init__ is EMPTY, then you need to do this:
import sample_package
sample_package.subpackage.subpackage_module.subpackage_module_function()

Notably, an __init__.py is not necessary to use modules and functions within a folder… however, customizing the imports present in the packages __init__.py will provide increased customization to your projects use. As the project increases in complexity, strategic usage of imports within the __init__ can keep your main executable functions cleaner.

Executables

So, you’ve crafted a Python project with a sleek, modular package design. The next step is to setup a single file which will execute the package.

Inclusion of a single executable has the benefit of providing a single-entry point for other users who want to run the program without getting lost in the project.

In the example_python_project, this is done with main.py:

# Import the main package
import sample_package

def run():
    solved = sample_package.main_module_function()
    return solved

# Run the function if this is the main file executed
if __name__ == "__main__":
    run()

The program then can then be executed from a command line:

C:\<your-local-directory\example_python_project> python run_program.py

README

The README.md file is typically someone’s first encounter with your project. This is particularly true if the project is hosted on GitHub, where the README.md is used as the home-page of a repository.

A README.md file should include, at minimum, a brief description of the project, it’s purpose, and clear instructions on how to use the code.

Often, README files are written in Markdown, which includes simple text-formatting options. You can find a basic Markdown Cheat Sheet here. Although reStructuredText is often used, and even .txt files may be suitable.

Documentation

Great code requires great documentation. Initializing a new project with a dedicated docs/ folder may help hold you accountable for documenting the code along the way.

For information on how to use Sphinx and reStructuredText to create clean webpage-based documentation, you can see Rohini Gupta’s post on Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!).

Tests

Bugs aren’t fun. They are even less fun when a code was bug-free yesterday but contains bugs today. Implementing automated tests in your project can help verify functionality throughout the development process and catch bugs when they may arise.

It is recommended to implement Unit Tests which verify individual components of the project. These tests should assert that function output properties align with expectations. As you develop your project in a modular way, you can go in and progressively add consecutive tests, then run all of the tests before sharing or pushing the project to others.

A standard Python instillation comes with the unittest package, which is intended to provide a framework for these tests. I provide an example test below, but deeper-dive into the unittest framework may require a dedicated future posts.

In the example_python_project, I include the basic_test.py to verify that the solution generated by main_module_function() is True using the unittest package:

import sample_package
import unittest

# Define a test suite targeting specific functionality
class BasicTestSuite(unittest.TestCase):
    """Basic test cases."""
    def test_that_riddle_is_solved(self):
        solved = sample_package.module.main_module_function()
        self.assertTrue(solved)


if __name__ == '__main__':
    unittest.main()

Running the basic_test module from the command line produce an “OK” if everything runs smoothly, otherwise will provide information regarding which tests are failing.

----------------------------------------------------------------------
Ran 1 test in 0.004s
OK

Currently, the example_python_project requires the basic_test module to be executed manually. To learn more about automating this process, you can see Andrew Dirck’s 2020 post: Automate unit testing with Github Actions for research codes.

Requirements

The requirements.txt is a simple text file which lists the dependencies, or necessary packages that are required to run the code.

This can be particularly important if your code requires a specific version of a package, since the package verison can be specified in the requirements.txt. Specifying a particular package version (e.g., numpy==1.24.1) can improve the reliability of your code, since different versions of these packages may operate in different ways in the future.

Here is an example of what might be inside a requirements.txt, if the numpy and random packages are necessary:

numpy==1.24.1
random==3.11.1

Users can easily install all the packages listed in requirements.txt using the command:

pip install -r requirements.txt

License

I’ll keep this section brief, since I am far from legally qualified to comment much on Licensing. However, general advice seems to suggest that if you are sharing code publicly, safest to include a license of some sort.

Inclusion of an open-source license allows other users to comfortably use and modify your code for their own purposes, allowing you to contribute and benefit the broader community. At the same time, protecting the original author from future liabilities associated with its use by others.

The GNU General Public License is the most common open-source license, however if you would like to know more about the different options, you can find some guidance here: https://choosealicense.com/

Conclusions

If you are an experienced Python user, there may not be anything new for you here but at the least I hope it serves as a reminder to take care in your project design this year.

Additionally, this is likely to be one part in a multi-part Introduction to Python series that I will be writing for future members of our research group. With that in mind, check back here later this spring for the subsequent parts if that interests you.

Best of luck!

References

(1) Reitz, K., & Schlusser, T. (2016). The Hitchhiker’s guide to Python: best practices for development. ” O’Reilly Media, Inc.”.
(2) Navdeep Gill. 2019. samplemod. https://github.com/navdeep-G/samplemod. (2023).

Understanding Information Usage in Machine Learning Models

Deep learning models still largely remain black box concepts, where users have little understanding of how and when input variables are being used for prediction. In some applications, simpler and more interpretable models may be appropriate, but there are instances where deep learning models are simply more accurate. The question is, does one have to sacrifice accuracy for interpretability? Ideally, no. If you truly understand how a model works, this puts you in an optimal position to determine what changes need to be made to make it more accurate. Techniques for better understanding the inner workings of more complex ML models likely will lead to the acceptance of these models in process-driven fields like hydrology. We should aspire to move away from the practice of blindly tuning hyperparameters to get good validation results, and rather to focus more about what these parameters may physically represent (which is certainly not always easy or applicable, otherwise it would be done more often). The goal of this blog post is to outline some concepts within the computer science community that can help users understand information usage in their ML models that can be applied for hydrology applications.

An example of a complex ML model

The current state of the art for recurrent-style neural networks is a Long Short-Term Memory (LSTM) network which has become increasingly popular in hydrology applications. These RNN style networks contain a cell state that has the ability of learn long-term dependencies as well as gates to regulate the flow of information in and out the cell.

A single LSTM cell and its components (Colah, 2020)

The top horizontal line denotes the cell state, Ct ,which is continually being updated, either removing irrelevant information from subsequent time periods or incorporating new recent information. The first gate of the cell state is denoted as the forget gate, or ft. The forget gate takes the previous hidden state ht and the current input xt and decides what to forget from the previous time step’s cell state Ct-1 through the application of a sigmoid function, which effectively can set values in the cell state to a value between 0 and 1 (i.e. representing completely forget to completely retain). Next the input gate, it ,decides which values to update using a sigmoid function. A tanh layer creates new cell state candidate, C~t ,which are multiplied by the values from the input gate to create an update. The cell state can then be updated first by forgetting the prescribed values at the forget gate, and then adding the scaled values, it* C~t. Finally the output gate is used to describe what will be output to the next LSTM unit, which is represented by ht  at the bottom of the unit. The input xt is run through a sigmoidal function. The cell state is pushed through a tanh function from above and then multiplied with the transformed input to decide what the hidden state will be that is relayed to the next unit. The LSTM layer can be made up of hundreds of these cells and multiple layers can be used. Most importantly, the LSTM allows for the preservation and memory of past inputs and outputs, which can help expedite training in a system with slow changing dynamics. 

Clearly an LSTM is one of the more complex deep learning models. However, it has shown great success in hydrology applications, particularly where it’s important to capture physical processes that occur at different time scales. For example, Kratzert et al. (2018) demonstrate how an LSTM network used for rainfall-runoff across 241 catchments is able to perform comparably to SAC-SMA in a fraction of the computational time. The authors also do a great job of unboxing some of the LSTM behavior, including demonstrating how the cell state captures temperature and snow dynamics. However, very few other studies investigate their ML models to this degree.

Techniques for Interpretability

Model-Agnostic Local Models

Many interpretability techniques have been introduced that utilize simple local methods to explain model predictions for a single sample to better understand the behavior of a more complex model. Local interpretable model-agnostic explanations (LIME) is one such technique introduced by Ribeiro et al. (2016) that utilizes a local surrogate model. LIME generates perturbations of input and output data, weighs the samples according to proximity to the baseline sample, and trains a localized (usually linear) model. The goal is to find a model that will minimize loss (minimize the distance between the prediction and estimation) and also minimize complexity.

Another localized technique utilizes Shapley values. Originating from cooperative game theory, Shapley values assign a payout depending on a player’s contribution to the total payout (Shapley, 1953). The analog to ML is that the Shapley value becomes the marginal contribution of a feature across many coalitions (subsets) of possible feature combinations. Because Shapley values are calculated across many possible coalitions, computation time to compute them can be cumbersome.

Model-Specific Local Models

DeepLift is a backpropogation-based approach that propagates the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the activation of each neuron to its ‘reference activation’ and assigns contribution scores according to the difference. DeepSHAP is a modification of the DeepLift algorithm used to efficiently estimate Shapley values over the input feature space for a given instance. DeepShap has been shown to be faster than model-agnostic methods and can explain series of complex models more than model specific local models (Chen et al., 2021). Rather than passing multipliers comparing neuron activation, DeepShap will backpropagate SHAP values.

DeepShap

The authors of DeepShap have created a GitHub repository here that allows you to implement SHAP for any machine learning model and DeepShap for deep learning models implemented using TensorFlow and Keras. The package has some really beautiful figures that you can create with single lines of code. My goal was to see if I could harvest some tools from DeepShap to help make sense of a rainfall runoff LSTM model for an ephemeral subbasin in the Tuolumne River Basin that I had previously coded.

I utilize the following features to predict outflow: Precipitation, Temperature, DOY (represented as sine/cosine of day in this case), prior precipitation aggregated for the past three days ago and two weeks, and then interaction variables. In a prior blog post, I demonstrate why these aggregate variables and interaction variables help to capture different regimes of flow. The results look great, but can we use DeepShap to think more about the parameters of the LSTM?

LSTM Prediction

One parameter of interest is the lag of information that the LSTM feeds in at every time step. In my LSTM, I decide to feed my inputs in in batches of 30, so at every time step, the LSTM is seeing the past 30 days of inputs. I use the DeepExplainer function to calculate the average Shapley values across a set of 500 time steps of the input data and I plot these features for the past 30 days.

#Deep Shap Implementation (Training)
#Choose random points from the dataset to train the DeepExplainer
random_ind = np.random.choice(X_Train.shape[0], 1000, replace=False)
data = X_Train[random_ind[0:500]]
e = shap.DeepExplainer((model.layers[0].input, model.layers[-1].output),data,K.get_session())

#Deep Shap Testing
#Create a test set out of another part of the training set 
test = X_Train[random_ind[500:1000]]
shap_values = e.shap_values(test)
shap_val = np.array(shap_values)
shap_val = np.reshape(shap_val,(int(shap_val.shape[1]),int(shap_val.shape[2]),int(shap_val.shape[3])))
shap_abs = np.absolute(shap_val)
sum_0 = np.mean(shap_abs,axis=0)

#Plot Shapley Values
shap_plot = pd.DataFrame(sum_0, columns=["Precipitation","Temperature","DOY","Three Day Precip","Two Week Precip","Precip_Times_Temperature","Temperature_Times_Day","Precip_Times_Temperature_Times_Day"])
shap_plot['days'] = [i-16 for i in list(range(1,16))]
shap_plot.head()
shap_plot.plot.area(x='days',figsize=(10, 6), cmap='rainbow')
plt.title("Deep SHAP - Feature Importance")
plt.savefig('SHAP.png', bbox_inches='tight')
plt.show()
Feature importance for a 30-day lag

The figure above shows at what times in the batch certain information is being utilized based on the magnitude of the Shapley value associated with that information. We see primarily that current precipitation is the most prominent driver of the next day’s outflow along with the precipitation*temperature interactive variable which intuitively makes sense.

Let’s look back 5 days though. Many of the interactive variables are still relevant but see that temperature and day of the year are now given a higher priority over precipitation .

Shapley factors for 5 days prior to the current time step

Lags of these variables may give the algorithm more information about seasonality that is more informative than looking at these variables in their current state. Furthermore, precipitation at this lag may not be relevant. There’s lots of additional work to be done, but hopefully this tutorial illustrates that If you have the tools to look into your model and a good understanding of the system, it then becomes possible to better attribute your model’s behavior to a process-based understanding.

References

Chen, H., Lundberg, S. M., & Lee, S. I. (2021). Explaining a Series of Models by Propagating Shapley Values. arXiv preprint arXiv:2105.00108.

Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018). Rainfall–runoff modelling using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences22(11), 6005-6022.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).

Shapley, Lloyd S. “A value for n-person games.” Contributions to the Theory of Games 2.28 (1953): 307-317.

I followed tutorials shown in these blogs as well:

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://github.com/danielhkt/deep-forecasting/blob/main/notebook/test_forecasting.ipynb

https://medium.datadriveninvestor.com/time-step-wise-feature-importance-in-deep-learning-using-shap-e1c46a655455

NetCDF Operators

This post is an introduction to Linux-based climate data and NetCDF operators (CDOs or NCOs) which allow you to perform various operations on netNDF files through the command line. I found these commands to be really nifty when I was working with pre-industrial control runs from a GCM. The output was being written on daily timestep, across 1200 years, and for the whole world, so it was absolutely essential that I cut the size of the files down as much as I could before transferring to my own computer.

The official documentation and installation instructions for NCO can be found here and CDO here, but if you’re working on a supercomputer, the libraries will already likely be installed. I will outline how I used some of these functions for my pre-industrial runs.

Concatenation

Some of the NCO commands have size limits of 40 GB, so it’s important to use the right order of operations when processing your files, which will be different depending on your ultimate goal. My goal was to ultimately get the 500-hpa geopotential height anomalies across the whole 1200 year period for specifically the Western US. Assuming you have a directory with all the NetCDF files, the first goal is to concatenate the data, since my run was broken into many smaller files. The easiest way to do this is with the following command which will take all the netcdf files in the directory (using the *) and merge them into a file called merged_file.nc:

cdo mergetime *.nc merged_file.nc

Return Individual Monthly Files

When calculating anomalies, you will need to determine a mean geopotential height value for each of the 12 months, and then calculate daily deviations with respect to these months to obtain daily deviations. You can do this with the following command:

cdo splitmon merged_file.nc zg500.mon

This command will return 12 files of the form zg500.mon$i.nc.

Return Monthly Means and Daily Anomalies

The next step is to calculate a monthly mean for each of these files. For example, for January use:

cdo timmean zg500.mon1.nc zg500.mean.mon1.nc

Return Anomalies

Now we subtract the means from each monthly file to return the daily anomalies for each month, which will be of the form: zg500.mean.mon${i}.anom.nc. If you want to combine the last two steps into one loop, you can use the code below:

for i in $(seq 1 12)
do
  cdo timmean zg500.mon${i}.nc zg500.mean.mon${i}.nc
  cdo sub zg500.mon${i}.nc zg500.mean.mon${i}.nc zg500.mean.mon${i}.anom.nc
done 

Cut Down to Geographical Area of Interest

Finally, we need to cut down the data just to the Western US. We use ncks (NetCDF Kitchen Sink) from NCO, which is probably the most versatile of all the functions (hence the name). This command is one that has the 40 GB limit, which is why I had to wait until I had monthly files before I could cut them down geographically. We must first specify the variable of interest using the -v flag. In my case, I only had one variable to extract, but you can also extract multiple in one command. Then denote the range of latitude and longitude using the -d flags. It is very important to include the periods at the end of each lat/lon (even if your bounds are integers) otherwise the command does not work.

for i in $(seq 1 12)
do
  ncks -v zg500 -d lon,180.,260. -d lat,30.,60. zg500.mean.mon${i}.cut.anom.nc -o zg500.mean.mon${i}.cut.anom.region.nc
done 

Ultimately, you will get 12 files of the form: zg500.mean.mon${i}.cut.anom.region.nc. And that’s it! You can concatenate the monthly files back together and try to resort the data back into the correct sequence according to time. I was unsuccessful at finding a quick way to do this, but it is possible. I found it much easier to work on this within R. I imported each of the 12 anomaly files, assigned a time vector, concatenated each monthly anomaly matrix into a larger matrix and then sorted according to date. If your files are small enough by the end of the process, this likely is the easiest way to take care of resorting. Enjoy!