Performing runtime diagnostics using MOEAFramework

Performing runtime diagnostics using MOEAFramework

In this blog post, we will be reviewing how to perform runtime diagnostics using MOEAFramework. This software has been used in prior blog posts by Rohini and Jazmin to perform MOEA diagnostics across multiple MOEA parameterizations. Since then, MOEAFramework has undergone a number of updates and structure changes. This blog post will walk through the updated functionality of running MOEAFramework (version 4.0) via the command line to perform runtime diagnostics across 20 seeds using one set of parameters. We will be using the classic 3-objective DTLZ2 problem optimized using NSGAII, both of which are in-built into MOEAFramework.

Before we begin, some helpful tips and system configuration requirements:

  • Ensure that you have the latest version of Java installed (as of April 2024, this is Java Version 22). The current version of MOEAFramework was compiled using class file version 61.0, which was made available in Java Version 17 (find the complete list of Java versions and their associated class files here). This is the minimum requirement for being able to run MOEAFramework.
  • The following commands are written for a Linux interface. Download a Unix terminal emulator like Cygwin if needed.

Another helpful tip: to see all available flags and their corresponding variables, you can use the following structure:

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.mainClass.subClass.functionName --help

Replace mainClass, subClass, and functionName with the actual class, subclass, and function names. For example,

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.SampleGenerator --help

You can also replace --help with -h (if the last three alphabets prove too much to type for your weary digits).

Runtime diagnostics across 20 different seeds for one set of parameters

Generating MOEA parameters and running the optimization

To run NSGAII using one set of parameters, make sure to have a “parameters file” saved as a text file containing the following:

populationSize 10.0 250.999
maxEvaluations 10000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 100.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 100.0

For the a full list of parameter files for each of the in-built MOEAFramework algorithms, please see Jazmin’s post here.

In this example, I have called it NSGAII_Params.txt. Note that maxEvaluations is set to 10,000 on both its lower and upper bounds. This is because we want to fix the number of function evaluations completed by NSGAII. Next, in our command line, we run:

java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.SampleGenerator --method latin --numberOfSamples 1 --parameterFile NSGAII_Params.txt --output NSGAII_Latin

The output NSGAII_Latin file should contain a single-line that can be opened as a text file. It should have six tab-delineated values that correspond to the six parameters in the input file that you have generated. Now that you have your MOEA parameter files, let’s begin running some optimizations!

First, make a new folder in your current directory to store your output data. Here, I am simply calling it data.

mkdir data

Next, optimize the DTLZ2 3-objective problem using NSGAII:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.RuntimeEvaluator --parameterFile NSGAII_Params.txt --input NSGAII_Latin --problem DTLZ2_3 --seed $i --frequency 1000 --algorithm NSGAII --output data/NSGAII_DTLZ2_3_$i.data; done

Here’s what’s going down:

  • First, you are performing a runtime evaluation of the optimization of the 3-objective DTLZ2 problem using NSGAII
  • You are obtaining the decision variables and objective values at every 1,000 function evaluations, effectively tracking the progress of NSGAII as it attempts to solve the problem
  • Finally, you are storing the output in the data/ folder
  • You then repeat this for 20 seeds (or for as many as you so desire).

Double check your .data file. It should contain information on your decision variables and objective values at every 1,000 NFEs or so, seperated from the next thousand by a “#”.

Generate the reference set

Next, obtain the only the objective values at every 1,000 NFEs by entering the following into your command line:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ResultFileMerger --problem DTLZ2_3 --output data/NSGAII_DTLZ2_3_$i.set --epsilon 0.01,0.01,0.01 data/NSGAII_DTLZ2_3_$i.data; done

Notice that we have a new flag here – the --epsilon flag tells MOEAFramework that you only want objective values that are at least 0.01 better than other non-dominated objective values for a given objective. This helps to trim down the size of the final reference set (coming up soon) and remove solutions that are only marginally better (and may not be decision-relevant in the real-world context).

On to generating the reference set – let’s combine all objectives across all seeds using the following command line directive:

for i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ReferenceSetMerger --output data/NSGAII_DTLZ2_3.ref -epsilon 0.01,0.01,0.01 data/NSGAII_DTLZ2_3_$i.set; done

Your final reference set should now be contained within the NSGAII_DTLZ2_3.ref file in the data/ folder.

Generate the runtime metrics

Finally, let’s generate the runtime metrics. To avoid any mix-ups, let’s create a folder to store these files:

mkdir data_metrics

And finally, generate our metrics!

or i in {1..20}; do java -cp MOEAFramework-4.0-Demo.jar org.moeaframework.analysis.tools.ResultFileEvaluator --problem DTLZ2_3 --epsilon 0.01,0.01,0.01 --input data/NSGAII_DTLZ2_3_$i.data --reference data/NSGAII_DTLZ2_3.ref --output data_metrics/NSGAII_DTLZ2_3_$i.metrics; done

If all goes well, you should see 20 files (one each for each seed) similar in structure to the one below in your data_metrics/ folder:

The header values are the names of each of the MOEA performance metrics that MOEAFramework measures. In this blog post, we will proceed with visualizing the hypervolume over time across all 20 seeds.

Visualizing runtime diagnostics

The following Python code first extracts the metric that you would like to view, and saves the plot as a PNG file in the data_metrics/ folder:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

# define constants 
num_seeds = 20
NFE = 10000 
freq = 1000
num_output = int(NFE/freq)

algorithm = 'NSGAII'
problem = 'DTLZ2_3'
folder_name = 'data_metrics/'
metric_name = 'Hypervolume'
# create matrix of hypervolume runtimes 
hvol_matrix = np.zeros((num_seeds, num_output), dtype=float)
for seed in range(num_seeds):
    runtime_df = pd.read_csv(f'{folder_name}{algorithm}_{problem}_{seed+1}.metrics', delimiter=' ', header=0)
    if metric_name == 'Hypervolume':
        hvol_matrix[seed] = runtime_df['#Hypervolume'].values
    else:
        hvol_matrix[seed] = runtime_df[metric_name].values

# plot the hypervolume over time
fig, ax = plt.subplots(figsize=(10, 6))

ax.fill_between(np.arange(freq, NFE+freq, freq), np.min(hvol_matrix, axis=0), np.max(hvol_matrix, axis=0), color='paleturquoise', alpha=0.6)
ax.plot(np.arange(freq, NFE+freq, freq), np.min(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0)
ax.plot(np.arange(freq, NFE+freq, freq), np.max(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0)
ax.plot(np.arange(freq, NFE+freq, freq), np.mean(hvol_matrix, axis=0), color='darkslategrey', linewidth=2.0 ,label='Mean', linestyle='--')

ax.set_xlabel('NFE')
ax.set_xlim([freq, NFE+freq])
ax.set_ylabel(metric_name)
ax.set_ylim([0, 1])
ax.set_title(f'{metric_name} over time')
ax.legend(loc='upper left')

plt.savefig(f'{folder_name}{algorithm}{problem}_{metric_name}.png')

If you correctly implemented the code, you should be able to be view the following figure that shows how the hypervolume attained by the NSGAII algorithm improves steadily over time.

In the figure above, the colored inner region spans the hypervolume attained across all 20 seeds, with the dotted line representing the mean hypervolume over time. The solid upper and lower bounding lines are the maximum and minimum hypervolume achieved every 1,000 NFEs respectively. Note that, in this specific instance, NSGAII only achieves about 50% of the total hypervolume of the overall objective space. This implies that a higher NFE (a longer runtime) is required for NSGAII to further increase the hypervolume achieved. Nonetheless, the rate of hypervolume increase is gradually decreasing, indicating that this particular parameterization of NSGAII is fast approaching its maximum possible hypervolume, which additional NFEs only contributing small improvements to performance. It is also worth noting the narrow range of hypervolume values, especially as the number of NFEs grows larger. This is representative of the reliability of the NGSAII algorithm, demonstrating that is can somewhat reliably reproduce results across multiple different seeds.

Summary

This just about brings us to the end of this blog post! We’ve covered how to perform MOEA runtime diagnostics and plot the results. If you are curious, here are some additional things to explore:

  • Plot different performance metrics against NFE. Please see Joe Kasprzyk’s post here to better understand the plots you generate.
  • Explore different MOEAs that are built into MOEAFramework to see how they perform across multiple seeds.
  • Generate multiple MOEA parameter samples using the in-built MOEAFramework Latin Hypercube Sampler to analyze the sensitivity of a given MOEA to its parameterization.
  • Attempt examining the runtime diagnostics of Borg MOEA using the updated version of MOEAFramework.

On that note, make sure to check back for updates as MOEAFramework is being actively reviewed and improved! You can view the documentation of Version 4.0 here and access its GitHub repository here.

Happy coding!

Ensemble forecasting – theory

‘Does the flap of a butterfly’s wing in Brazil stir up a tornado in Texas?’ These words reflect the so-called ‘butterfly effect’ typically ascribed to Edward Lorenz (MIT), who was a pioneer in the field of numerical weather prediction (NWP). (Interestingly, Lorenz himself first referred to the ‘flap of a seagulls’ wing’; the butterfly swap out was the work of a creative conference session convener. Somewhat coincidentally, the shape of the ‘Lorenz attractor’ is also reminiscent of a butterfly, see Figure 1). Motivated by the very salient challenge in the post-WWII era to improve weather prediction capabilities for strategic purposes, he developed a theoretical framework for the predictability of non-linear dynamical systems in a seminal paper, ‘Deterministic nonperiodic flow’ (Lorenz, 1963), that would come to be known as ‘chaos theory’ or ‘chaotic dynamics’. This theory was foundational to the development of ensemble forecasting (and a whole host of other things), and still underpins our current theoretical understanding of the inherent predictability of complex dynamical systems like the atmosphere (Lorenz, 2006).


Figure 1. Book cover (left panel) and the characteristic ‘butterfly wing’ shape of the Lorenz attractor (right panel) from his seminal 1963 work. (Palmer, 1993)

In actuality, the whole ‘flap of a butterfly’s wing creating a tornado in Texas’ relates specifically to only one aspect of the theoretical framework developed by Lorenz. In this post, I will attempt to give a slightly fuller treatment of this theoretical framework and its development. My hope is that this post will be a theoretical primer for two follow-on practically oriented posts: 1) a discussion of the state-of-the-science of hydrometeorological ensemble forecasting and its application in emerging water resources management practices like forecast informed reservoir operations, and 2) a deeper dive into ensemble verification techniques. While the techniques for (2) are grounded in the theoretical statistical properties of stochastic-dynamic predictive ensembles, they have broad applications to any scenario where one needs to diagnose the performance of an ensemble.

The Lorenz model and ‘chaos’ theory

When most people hear the word ‘chaos’, they tend to think of the dictionary definition: ‘complete disorder and confusion’ (Oxford), which in a scientific context might aptly describe some sort of white noise process. As you will see, this is somewhat of a far cry from the ‘chaos’ described in Lorenz’s theory. The ‘chaos’ terminology was actually coined by a later paper building on Lorenz’s work (Li & Yorke, 1975) and as noted by Wilks (2019): ‘…this label is somewhat unfortunate in that it is not really descriptive of the sensitive-dependence phenomenon’. The sensitive-dependence phenomenon is one of a set of 4 key properties (to be discussed later) that characterize the behaviors of the sort of non-linear, non-periodic deterministic systems that Lorenz argued were most representative of the atmosphere. In contrast to ‘disorder’ and ‘confusion’, these properties, in fact, lend some sense of order to the evolution of these dynamical systems, although the nature of that order is highly dependent on the system state.

A deterministic, non-periodic systems model

First, let’s dive into a few details of Lorenz’s 1963 work, using some figures and descriptions from a later paper (Palmer, 1993) that are easier to follow. While the 1963 paper is quite dense, much of the mathematical logic is dedicated to justifying the use of a particular system of equations that forms the basis of the study. This system of 3 equations and 3 variables (X, Y, Z) describes a non-linear, dissipative model of convection in a fluid of uniform depth, where there is a temperature difference between the upper and lower surfaces. Lorenz derived the set of 3 equations shown in the upper left panel of Figure 2 from earlier work by Rayleigh (1916) on this particular problem. In short, X relates to the intensity of convective motion, Y relates to the temperature difference between ascending and descending currents, and Z relates to the distortion of the vertical temperature profile from linearity; the details of these variables are not actually all that important to the study. What is important is that this system of equations has no general solutions (aside from the steady state solution) and must be numerically integrated in the time dimension to determine the convective flow dynamics. The ‘trajectories’ of these integrations shown in the phase space diagrams in Figure 2 exhibit the sorts of unstable, non-periodic behaviors that Lorenz thought were the most useful analog to atmospheric dynamics, albeit in a much simpler system. (‘Much’ is an understatement here; modern NWP models have a phase space on the order of 109 in contrast to the 3-dimensional phase space of this problem, Wilks, 2019. Of note, the use of phase-space diagrams (i.e. plots where each axis corresponds to a dependent variable in the system of equations) preceded Lorenz, but his use of them is perhaps one of the best-known early instances of this kind of visualization. Other uses of phase space relationships can be found in Rohini’s post or Dave’s post)

Figure 2. a) Lorenz equations and the XY phase space diagram of their integrations. b-c) X variable timeseries of two separate integrations of the Lorenz equations from very similar initial conditions. (Palmer, 1993)

Regime structure and multiple distinct timescales

What ‘behaviors’ then, are we talking about? Referring again to Figure 2a, we see the projection of a long series of model integration trajectories onto the XY-plane of the phase space diagram, where each dot is a timestep in the integration. What is immediately apparent in the form of these integrations is the two lobed ‘butterfly’ shape of the diagram. Each lobe has a central attractor where the trajectories often reside for multiple revolutions, but can then transition to the other attractor when passing near the center of the diagram. These so-called ‘Lorenz attractors’ comprise one of the key properties of chaotic dynamics, namely regime structure, which is tendency of the trajectories to reside around phase space attractors for some period of time. This residence time in a regime is generally quite a bit longer than the timescales of the trajectories’ individual revolutions around an attractor. This attribute is referred to as multiple distinct timescales and is evidenced in Figure 2b-c, where smaller amplitude sections of the timeseries show residence in one or the other regime and large amplitude sections describe transitions between regimes. Often, but not always, the residence in these regimes occurs for multiple revolutions, suggesting that there are shorter timescale evolutions of the system that take place within these regimes, while infrequent, random shifts to the other regimes occur at longer timescales.

Sensitive-dependence and state-dependent variation in predictability

Figure 3. a-c) Different trajectories through the XY phase space dependent on the initial condition state-space (black circle). (Palmer, 1993)

Returning now to the ‘butterfly effect’; what, then, is this whole sensitive-dependence thing mentioned earlier? Figure 3a-c provide a nice graphical representation of this phenomenon. In each panel, a set of nearby initial states are chosen at different location in the phase space diagram and then are followed through their set of trajectories. In 3a, these trajectories neatly transition from one regime to the other, remaining relatively close to each other at the trajectories’ end. In 3b, a set of initial states not so far from 3a are chosen and instead of neatly transitioning to the other regime, they diverge strongly near the center of the diagram, with some trajectories remaining in the original regime, and some transitioning. However, for about half of the timespan, the trajectories remain very close to one another. Finally, an initial state chosen near the center of the diagram (3c) diverges quickly into both regimes, with trajectories ending up at nearly opposite ends of the phase space (black tails at upper right and left of diagram). Figures b-c, in particular, showcase the sensitive-dependence on initial conditions attributes of the system. In other words, from a set of very close-by initial states, the final trajectories in phase space can yield strongly divergent results. Importantly, this divergence in trajectories over some period of time can occur right away (3c), at some intermediate time (3b), or not at all (3a).

This is the basic idea behind the last core property of these systems, state-dependent variation in predictability. Clearly, a forecast initialized from the starting point in 3a could be a little bit uncertain about the exact starting state and still end up in about the right spot for an accurate future prediction at the end of the forecast horizon. At medium ranges, this is also the case for 3b, but the longer range forecast is highly uncertain. For 3c, all forecast ranges are highly uncertain; in other words, the flap of a butterfly’s wing can mean the difference between one or the other trajectory! Importantly, one can imagine in this representation that an average value of 3c’s trajectories (i.e. the ensemble mean) would fall somewhere in the middle of the phase space and be representative of none of the two physically plausible trajectories into the right or left regimes. This is an important point that we’ll return to at the end of this post.

From the Lorenz model to ensemble forecasting

The previous sections have highlighted this idealized dynamical system (Lorenz model) that theoretically has properties of the actual atmosphere. Those 4 properties (the big 4!) were: 1) sensitive-dependence on initial conditions, 2) regime structure, 3) multiple distinct timescales, and 4) state-dependent variation in predictability. In this last section, I will tie these concepts into a theoretical discussion of ensemble forecasting. Notably, in the previous sections, I discussed ‘timescales’ without any reference to the real atmosphere. The dynamics of the Lorenz system prove to be quite illuminating about the actual atmosphere when timescales are attached to the system that are roughly on the order of the evolution of synoptic scale weather systems. If one assumes that a lap around the Lorenz attractor equates to a synoptic timescale of ~5 days, then the Lorenz model can be conceptualized in terms of relatively homogenous synoptic weather progressions occurring within two distinct weather regimes that typically persist on the order of weeks to months (see Figure 3b-c). This theoretical foundation jives nicely with the empirically derived weather-regime based approach (see Rohini’s post or Nasser’s post) where incidentally, 4 or more weather regimes are commonly found (The real atmosphere is quite a bit more complex than the Lorenz model after all). This discussion of the Lorenz model has hopefully given you some intuition that conceptualizing real world weather progressions outside the context of an appropriate regime structure could lead to some very unphysical representations of the climate system.

Ensemble forecasting as a discrete approximation of forecast uncertainty

In terms of weather prediction, though, the big 4 make things really tough. While there are certainly conditions in the atmosphere that can lead to excellent long range predictability (i.e. ‘forecasts of opportunity’, see Figure 3a), the ‘typical’ dynamics of the atmosphere yield the potential for multiple regimes and associated transitions within the short-to-medium range timeframe (1-15 days) where synoptic-scale hydro-meteorological forecasting is theoretically possible. (Note, by ‘synoptic-scale’ here, I am referring to the capability to predict actual weather events, not prevailing conditions. Current science puts the theoretical dynamical limit to predictability at ~2 weeks with current NWP technology achieving ‘usable’ skill out to ~7 days, give or take a few dependent on the region)

Early efforts to bring the Lorenz model into weather prediction sought to develop analytical methods to propagate initial condition uncertainty into a useful probabilistic approximation of the forecast uncertainty at various lead times. This can work for simplified representation of the atmosphere like the Lorenz model, but quickly becomes intractable as the complexity and scale of the governing equations and boundary conditions increases.

Figure 4. Conceptual depiction of ensemble forecasting including sampling of initial condition uncertainty, forecasts at different lead times, regime structure, and ensemble mean comparison to individual ensemble members. Solid arrow represents a ‘control’ member of the ensemble. Histograms represent an approximate empirical distribution of the ensemble forecast. (Wilks, 2019)

Thankfully, rapid advances in computing power led to an alternate approach, ensemble forecasting! Ensemble forecasting is a stochastic-dynamic approach that couples a discrete, probabilistic sampling of the initial condition uncertainty (Figure 4 ‘Initial time’) and propagates each of those initial condition state vectors through a dynamical NWP model. Each of these NWP integrations is a unique trajectory through state space of the dynamical model that yields a discrete approximation of the forecast uncertainty (Figure 4 ‘Intermediate/Final forecast lead time’). This discrete forecast uncertainty distribution theoretically encompasses the full space of potential hydro-meteorologic trajectories and allows a probabilistic representation of forecast outcomes through analysis of the empirical forecast ensemble distribution. These forecast trajectories highlight many of the big 4 properties discussed in previous sections, including regime structure and state-dependent predictability (Figure 4 trajectories are analogous to the Figure 3b trajectories for the Lorenz model). The ensemble mean prediction is an accurate and dynamically consistent prediction at the intermediate lead time, but at the final lead time where distinct regime partitioning has occurred, it is no longer dynamically consistent and delineates a region of low probability in the full ensemble distribution. I will explore properties of ensemble averaging, both good and bad, in a future post.

Lastly, I will note that the ensemble forecasting approach is a type of Monte Carlo procedure. Like other Monte Carlo approaches with complex systems, the methodology for sampling the initial condition uncertainty has a profound effect on the uncertainty quantification contained within the final NWP ensemble output, especially when considering the high degree of spatiotemporal relationships within the observed climatic variables that form the initial state vector. This is a key area of continued research and improvement in ensemble forecasting models.

Final thoughts

I hope that you find this discussion of the theoretical underpinnings of chaotic dynamics and ensemble forecasting to be useful. I have always found these foundational concepts to be particularly fascinating. Moreover, the basic theory has strong connections outside of ensemble forecasting, including the ties to weather regime theory mentioned in this post. I also think these foundational concepts are necessary to understand how much actual ensemble forecasting techniques can diverge from the theoretical stochastic-dynamic framework. This will be the subject of some future posts that will delve into more practical and applied aspects of ensemble forecasting with a focus on water resources management.

Reference

Lorenz, E. N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20, 130–141.

Lorenz, E. N. (2006). Reflections on the Conception , Birth , and Childhood of Numerical Weather Prediction. Annual Review of Earth and Planetary Science, 34, 37–45. https://doi.org/10.1146/annurev.earth.34.083105.102317

Palmer, T. N. (1993). Extended-range atmospheric prediction and the Lorenz model. Bulletin – American Meteorological Society, 74(1), 49–65. https://doi.org/10.1175/1520-0477(1993)074<0049:ERAPAT>2.0.CO;2

Rayleigh, L. (1916). On convective currents in a horizontal layer when the higher temperature is on the underside. Phil. Mag., 32, 529-546.

Wilks, D. S., (2019). Statistical Methods in the Atmospheric Sciences, 4th ed. Cambridge, MA: Elsevier.

Introduction to Bayesian Regression using PyMC

Motivation

Fans of this blog will know that uncertainty is often a focus for our group. When approaching uncertainty, Bayesian methods might be of interest since they explicitly provide uncertainty estimates during the modeling process.

PyMC is the best tool I have come across for Bayesian modeling in Python; this post gives a super brief introduction to this toolkit.

Introduction to PyMC

PyMC, described in their own words:
“… is a probabilistic programming library for Python that allows users to build Bayesian models with a simple Python API and fit them using Markov chain Monte Carlo (MCMC) methods.”

In my opinion, the best part of PyMC is the flexibility and breadth of model design features. The space of different model configurations is massive. It allows you to make models ranging from simple linear regressions (shown here), to more complex hierarchical models, copulas, gaussian processes, and more.

Regardless of your model formulation, PyMC let’s you generate posterior estimates of model parameter distributions. These parameter distributions reflect the uncertainty in the model, and can propagate uncertainty into your final predictions.

The posterior estimates of model parameters are generated using Markov chain Monte Carlo (MCMC) methods. A detailed overview of MCMC is outside the scope of this post (maybe in a later post…).

In the simplest terms, MCMC is a method for estimating posterior parameter distributions for a Bayesian model. It generates a sequence of samples from the parameter space (which can be huge and complex), where the probability of each sample is proportional to its likelihood given the observed data. By collecting enough samples, MCMC generates an approximation of the posterior distribution, providing insights into the probable values of the model parameters along with their uncertainties. This is key when the models are very complex and the posterior cannot be directly defined.

The PyMC example gallery has lots of cool stuff to get you inspired, with examples that go far and beyond the simple linear regression case.


Demonstration:

When writing drafting this post, I wanted to include a demonstration which is (a) simple enough to cover in a brief post, and (b) relatively easy for others to replicate. I settled on the simple linear regression model described below, since this was able to be done using readily retrievable CAMELS data.

The example attempts to predict mean streamflow as a linear function of basin catchment area (both in log space). As you’ll see, it’s not the worst model, but its far from a good model; there is a lot of uncertainty!

CAMELS Data

For a description of the CAMELS dataset, see Addor, Newman, Mizukami and Clark (2017).

I pulled all of the national CAMELS data using the pygeohydro package from HyRiver which I have previously recommended on this blog. This is a convenient single-line code to get all the data:

import pygeohydro as gh

### Load camels data
camels_basins, camels_qobs = gh.get_camels()

The camels_basins variable is a dataframe with the different catchment attributes, and the camels_qobs is a xarray.Dataset. In this case we will only be using the camels_basins data.

The CAMELS data spans the continental US, but I want to focus on a specific region (since hydrologic patterns will be regional). Before going further, I filter the data to keep only sites in the Northeaster US:

# filter by mean long lat of geometry: NE US
camels_basins['mean_long'] = camels_basins.geometry.centroid.x
camels_basins['mean_lat'] = camels_basins.geometry.centroid.y
camels_basins = camels_basins[(camels_basins['mean_long'] > -80) & (camels_basins['mean_long'] < -70)]
camels_basins = camels_basins[(camels_basins['mean_lat'] > 35) & (camels_basins['mean_lat'] < 45)]

I also convert the mean flow data (q_mean) units from mm/day to cubic meters per day:

# convert q_mean from mm/day to m3/s
camels_basins['q_mean_cms'] = camels_basins['q_mean'] * (1e-3) *(camels_basins['area_gages2']*1000**2) * (1/(60*60*24)) 

And this is all the data we need for this crude model!

Bayesian linear model

The simple linear regression model (hello my old friend):

Normally you might assume that there is a single, best value corresponding to each of the model parameters (alpha and beta). This is considered a Frequentist perspective and is a common approach. In these cases, the best parameters can be estimated by minimizing the errors corresponding to a particular set of parameters (see least squares, for example.

However, we could take a different approach and assume that the parameters (intercept and slope) are random variables themselves, and have some corresponding distribution. This would constitute a Bayesian perspective.

Keeping with simplicity in this example, I will assume that the intercept and slope each come from a normal distribution with a mean and variance such that:

When it comes time to make inferences or predictions using our model, we can create a large number of predictions by sampling different parameter values from these distributions. Consequently, we will end up with a distribution of uncertain predictions.

PyMC implementation

I recommend you see the PyMC installation guide to help you get set up.

NOTE: The MCMC sampler used by PyMC is written in C and will be SIGNIFICANTLY faster if you provide have access to GCC compiler and specify the it’s directory using the the following:

import pymc as pm

import os
os.environ["THEANO_FLAGS"] = "gcc__cxxflags=-C:\mingw-w64\mingw64\bin"

You will get a warning if you don’t have this properly set up.

Now, onto the demo!

I start by retrieving our X and Y data from the CAMELS dataset we created above:

# Pull out X and Y of interest
x_ftr= 'area_gages2'
y_ftr = 'q_mean_cms'
xs = camels_basins[x_ftr] 
ys = camels_basins[y_ftr]

# Take log-transform 
xs = np.log(xs)
ys = np.log(ys)

At a glance, we see there is a reasonable linear relationship when working in the log space:

Two of the key features when building a model are:

  • The random variable distribution constructions
  • The deterministic model formulation

There are lots of different distributions available, and each one simply takes a name and set of parameter values as inputs. For example, the normal distribution defining our intercept parameter is:

alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)

The value of the parameter priors that you specify when construction the model may have a big impact depending on the complexity of your model. For simple models you may get away with having uninformative priors (e.g., setting mu=0), however if you have some initial guesses then that can help with getting reliable convergence.

In this case, I used a simple least squares estimate of the linear regression as the parameter priors:

slope_prior, intercept_prior = np.polyfit(xs.values.flatten(), ys.values.flatten(), 1)

Once we have our random variables defined, then we will need to formulate the deterministic element of our model prediction. This is the functional relationship between the input, parameters, and output. For our linear regression model, this is simply:

y_mu = alpha + beta * xs

In the case of our Bayesian regression, this can be thought of as the mean of the regression outputs. The final estimates are going to be distributed around the y_mu with the uncertainty resulting from the combinations of our different random variables.

Putting it all together now:

### PyMC linear model
with pm.Model() as model:
    
    # Priors
    alpha = pm.Normal('alpha', mu=intercept_prior, sigma=10)
    beta = pm.Normal('beta', mu=slope_prior, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)

    # mean/expected value of the model
    mu = alpha + beta * xs

    # likelihood
    y = pm.Normal('y', mu=mu, sigma=sigma, observed=ys)

    # sample from the posterior
    trace = pm.sample(2000, cores=6)
 

With our model constructed, we can use the pm.sample() function to begin the MCMC sampling process and estimate the posterior distribution of model parameters. Note that this process can be very computationally intensive for complex models! (Definitely make sure you have the GCC set up correctly if you plan on needing to sample complex models.)

Using the sampled parameter values, we can create posterior estimates of the predictions (log mean flow) using the posterior parameter distributions:

## Generate posterior predictive samples
ppc = pm.sample_posterior_predictive(trace, model=model)

Let’s go ahead and plot the range of the posterior distribution, to visualize the uncertainty in the model estimates:

### Plot the posterior predictive interval
fig, ax = plt.subplots(ncols=2, figsize=(8,4))

# log space
az.plot_hdi(xs, ppc['posterior_predictive']['y'], 
            color='cornflowerblue', ax=ax[0], hdi_prob=0.9)
ax[0].scatter(xs, ys, alpha=0.6, s=20, color='k')
ax[0].set_xlabel('Log ' + x_ftr)
ax[0].set_ylabel('Log Mean Flow (m3/s)')

# original dim space
az.plot_hdi(np.exp(xs), np.exp(ppc['posterior_predictive']['y']), 
            color='cornflowerblue', ax=ax[1], hdi_prob=0.9)
ax[1].scatter(np.exp(xs), np.exp(ys), alpha=0.6, s=20, color='k')
ax[1].set_xlabel(x_ftr)
ax[1].set_ylabel('Mean Flow (m3/s)')
plt.suptitle('90% Posterior Prediction Interval', fontsize=14)
plt.show()

And there we have it! The figure on the left shows the data and posterior prediction range in log-space, while the figure on the right is in non-log space.

As mentioned earlier, it’s not the best model (wayyy to much uncertainty in the large-basin mean flow estimates), but at least we have the benefit of knowing the uncertainty distribution since we took the Bayesian approach!

That’s all for now; this post was really meant to bring PyMC to your attention. Maybe you have a use case or will be more likely to consider Bayesian approaches in the future.

If you have other Bayesian/probabilistic programming tools that you like, please do comment below. PyMC is one (good) option, but I’m sure other people have their own favorites for different reasons.


PyMC resources:

References

Addor, N., Newman, A. J., Mizukami, N. and Clark, M. P. The CAMELS data set: catchment attributes and meteorology for large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, doi:10.5194/hess-21-5293-2017, 2017.

Python Profiling with line_profiler

The line_profiler can be used to see the amount of time taken to execute each line in a function of your code. I think this is an important tool that can be used to reduce the runtime of a code. Simple command “pip install line_profiler” shall install the package or “conda install line_profiler” to install into an existing conda environment.

I shall present the usage of this line_profiler tool for a randomly generated data to calculate supply to demand ratio for releases from a reservoir. Demand or target supply is defined for day of the water year. The following code first defines the calculation for day of water year, generates random data for demand and supply, and two functions are defined for different methods of calculation of ratio of supply to demand. Include the line @profile before the function definition line to get the profile of execution of each line in the function.

import pandas as pd
import numpy as np
from line_profiler import profile

#function to caluclate day of water year
def get_dowy(date):
water_year_start = pd.Timestamp(year=date.year, month=10, day=1)
if date < water_year_start:
water_year_start = pd.Timestamp(year=date.year - 1, month=10, day=1)
return (date - water_year_start).days + 1

# Generate random data for demand for each day of water year
np.random.seed(0)
data = {
'Median_Demand': np.random.randint(0, 1000, 367),
}

# Create dataframe
df_demand = pd.DataFrame(data)

## Generate random data for supply from years 2001 to 2010 and also define corresponding day of water year
date_range = pd.date_range(start='2001-10-01', end='2091-09-30', freq='D')
data = {
'dowy': [get_dowy(date) for date in date_range],
'Supply': np.random.uniform(0, 2500, len(date_range))
}
# Create dataframe
df_supply = pd.DataFrame(data, index=date_range)

@profile #define before the function for profiling
def calc_supply_demand_1(df,df_median):
ratio = pd.DataFrame()
medians_dict = df_demand['Median_Demand'].to_dict()
demand = df_supply['dowy'].map(medians_dict)
supply = df_supply['Supply']
ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
return ratio

@profile
def calc_supply_demand_2(df,df_median):
ratio = pd.DataFrame()
medians_dict = df_demand['Median_Demand'].to_dict()
demand = pd.Series([df_demand['Median_Demand'][i] for i in df.dowy], index=df.index)
supply = df_supply['Supply']
ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
return ratio

ratio1 = calc_supply_demand_1(df_supply, df_demand)
ratio2 = calc_supply_demand_2(df_supply,df_demand)

Running just the code wouldn’t output anything related to line_profiler. To enable profiling, run the script as follows (this sets the environment variable LINE_PROFILE=1)

LINE_PROFILE=1 python Blog_Post.py

The above line generates three output files as profile_output.txt, profile_output_.txt, and profile_output.lprof and stdout is as follows:

Timer unit: 1e-09 s

0.04 seconds - /directory/Blog_Post.py:30 - calc_supply_demand_1
2.43 seconds - /directory/Blog_Post.py:39 - calc_supply_demand_2
Wrote profile results to profile_output.txt
Wrote profile results to profile_output_2024-03-29T192919.txt
Wrote profile results to profile_output.lprof
To view details run:
python -m line_profiler -rtmz profile_output.lprof

On executing the line “python -m line_profiler -rtmz profile_output.lprof”, the following is printed.

Timer unit: 1e-06 s

Total time: 0.0393394 s
File: /directory/Blog_Post.py
Function: calc_supply_demand_1 at line 30

Line # Hits Time Per Hit % Time Line Contents
==============================================================
30 @profile
31 def calc_supply_demand_1(df,df_median):
32 1 2716.4 2716.4 6.9 ratio = pd.DataFrame()
33 1 1365.2 1365.2 3.5 medians_dict = df_demand['Median_Demand'].to_dict()
34 1 3795.6 3795.6 9.6 demand = df_supply['dowy'].map(medians_dict)
35 1 209.7 209.7 0.5 supply = df_supply['Supply']
36 1 31252.0 31252.0 79.4 ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
37 1 0.5 0.5 0.0 return ratio

Total time: 2.43446 s
File: /directory/Blog_Post.py
Function: calc_supply_demand_2 at line 39

Line # Hits Time Per Hit % Time Line Contents
==============================================================
39 @profile
40 def calc_supply_demand_2(df,df_median):
41 1 1365.1 1365.1 0.1 ratio = pd.DataFrame()
42 1 697.5 697.5 0.0 medians_dict = df_demand['Median_Demand'].to_dict()
43 1 2411800.5 2e+06 99.1 demand = pd.Series([df_demand['Median_Demand'][i] for i in df.dowy], index=df.index)
44 1 53.9 53.9 0.0 supply = df_supply['Supply']
45 1 20547.0 20547.0 0.8 ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
46 1 0.6 0.6 0.0 return ratio

0.04 seconds - /directory/Blog_Post.py:30 - calc_supply_demand_1
2.43 seconds - /directory/Blog_Post.py:39 - calc_supply_demand_2

The result shows line number, number of hits (number of times the line is executed; hits increase when executed in a for loop), total time, time per hit, percentage of time and the line contents. The above result implies that for the first function, 79.4% of time was used to execute ratio, whereas for the second function 99.1% is used in execution of demand. ratio1 and ratio2 are the two exact same outputs where demand is defined in different ways in both the functions. We also see that time taken to execute calc_supply_demand_1 function is 0.04 seconds and calc_supply_demand_2 is 2.43 seconds. Using this I could reduce the runtime by 61 times (2.43/0.04) identifying that demand calculation takes 99.1% of time in calc_supply_demand_2 function using line_profiler. Another method is using cprofile (details are in this blog post). cprofile gives more detailed information.

References:

https://kernprof.readthedocs.io/en/latest

https://researchcomputing.princeton.edu/python-profiling