Running a Python script using Excel macros

Why run a Python script through Excel? Why bother with the middleman when environments such as Spyder and Jupyter Notebook exists?

Something that I have been learning of late is the importance of diversifying methods of presenting one’s work in the spirit of open science, communicability and inclusion. In that line of thinking, running a Python script via an Excel ‘user interface’ addresses two issues:

  1. Excel VBA’s slower reading and writing of data
  2. The steep learning curve associated with learning how to code in Python

In short, executing Python via Excel provides those with sufficient experience in the language with an avenue to efficiently communicate and visualize their data in a way that most people can see and understand. This blog post will demonstrate this point by extracting the mean and standard deviation of the sepal length and width, as well as the petal length and width, of three different iris subspecies available from the famed Iris dataset, that can be found here.

Download the dataset, called ‘iris.data’ into a folder or location of your choice. Next, change the extension of the file to ‘.csv’. Let’s start processing this in Python!

1. Writing the Python script

Here, we will be writing a Python script to generate two files: the means (iris_means.csv) and standard deviations (iris_std.csv) of each iris subspecies’ attributes. We will first write the Python script to do so:

import pandas as pd

# the types of data within the iris dataset
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "subspecies"]

iris_data = pd.read_csv('C:/your-preferred-location/iris.csv', 
                        sep=',', header=None, names=col_names, index_col=None)

#%%
iris_setosa = iris_data[iris_data['subspecies']=='Iris-setosa']
iris_setosa=iris_setosa.drop(['subspecies'], axis=1)

iris_versicolor = iris_data[iris_data['subspecies']=='Iris-versicolor']
iris_versicolor=iris_versicolor.drop(['subspecies'], axis=1)

iris_virginica = iris_data[iris_data['subspecies']=='Iris-virginica']
iris_virginica=iris_virginica.drop(['subspecies'], axis=1)

#%%
mean_setosa = iris_setosa.mean(axis=0)
std_setosa = iris_setosa.std(axis=0)

mean_versicolor = iris_versicolor.mean(axis=0)
std_versicolor = iris_versicolor.std(axis=0)

mean_virginica = iris_virginica.mean(axis=0)
std_virginica = iris_virginica.std(axis=0)

subspecies = ['Setosa', 'Versicolor', 'Virginica']

mean_vals = pd.concat([mean_setosa, mean_versicolor, mean_virginica], axis=1)
mean_vals.columns=subspecies

std_vals = pd.concat([std_setosa, std_versicolor, std_virginica], axis=1)
std_vals.columns=subspecies

mean_vals.to_csv('C:/Users/your-preferred-location/iris_means.csv', 
                 sep=',', header=True, index=True)
std_vals.to_csv('C:/Users/your-preferred-location/iris_std.csv', 
                sep=',', header=True, index=True)

2. Write the Excel VBA macro

We will first set up the Excel document that will execute the Excel macro. Create an Excel worksheet file. Mine is named ‘iris_GUI.xlsx’. Next, navigate to the ‘File’ tab and select ‘Options’. Go to ‘Customize Ribbon’ and make sure that ‘Developer’ is checked:

Figure 1: Enabling the Developer tab.

Click ‘OK’. The developer tab should now be visible in your Toolbar Ribbon:

Figure 2: Where the Developer tab is located.

Let’s get to the macro! Under the Developer tab, identify the ‘Macros’ tool on the far-left side of the toolbar. Select it and give your macro a suitable name. I called mine ‘link_python_excel’.

Figure 3: Name and create your macro.

Once this is done, click ‘Create’. Next, you should see a window like this pop up:

Figure 4: This is where you will write your macro.

Within the provided space, first initialize the macro using Sub link_python_excel(). This tells Excel VBA (Excel’s programming language) that you are about to write a macro called ‘link_python_excel’.

Next, declare your macro as an object, and your Python executable and Python script as strings. This is to enable VBA to locate the Python executable and use it to run the script as intended.

Dim objShell As Object
Dim PythonExe, PythonScript As String

You will then want to assign a macro object to its declaration:

Set objShell = VBA.CreateObject("Wscript.shell")

Please do not tamper with the “Wscript.shell” term. This assignment is the portion of the code that enables the macro to interact with Windows PowerShell, thus enabling VBA to execute the Python script. More information on this matter can be found at this website.

Following this, provide the filepath to the Python executable and the Python script:

PythonExe = """C:\Users\lbl59\AppData\Local\Programs\Python\Python39\python.exe"""
PythonScript = "C:\Users\lbl59\Desktop\run_python_in_excel\process_iris_data.py"

Note the use of the triple quotation marks. This method of assigning a string in VBA is used when the string potentially contains spaces. It is generally considered good practice to use “””…””” for file paths.

Finally, run your Python script and activate your workbook. The activation is necessary if you would like to run the script via a button in Excel, which we shall be going through in a bit.

objShell.Run PythonExe & PythonScript
Application.Goto Reference:="link_python_excel"

Finally, don’t forget to end the macro using End Sub.

Overall, your script should look as such:

Sub link_python_excel()

' link_python_excel Macro
' Declare all variables
Dim objShell As Object
Dim PythonExe, PythonScript As String
    
    'Create a new Shell Object
    Set objShell = VBA.CreateObject("Wscript.shell")
        
    'Provide the file path to the Python Exe
    PythonExe = """C:\Users\lbl59\AppData\Local\Programs\Python\Python39\python.exe"""
        
    'Provide the file path to the Python script
    PythonScript = "C:\Users\lbl59\Desktop\run_python_in_excel\process_iris_data.py"
        
    'Run the Python script
    objShell.Run PythonExe & PythonScript
    Application.Goto Reference:="link_python_excel"
    
End Sub

3. Run the Python script for Excel

Save the macro. Note that you will have to save the Excel workbook as a ‘.xlsm’ file to enable macro functionality. Once this is done, navigate to the ‘Developer’ tab and select ‘Insert’ and click on the button icon.

Figure 5: Inserting a button into the Excel workbook.

Draw the button the same way you would a rectangular shape. Rename the button, if you so prefer. In this exercise, the button is labeled ‘Run Code’. Next, right click on the button and select ‘Assign Macro’.

Figure 6: Assign the macro.

Once this is selected, you should be able to see the option to add the ‘link_python_excel’ macro to the button. Select the macro, and you are done! The two output files should have been output into the same location where you stored your iris.csv dataset.

Summary

In this post, we walked through the steps of writing a Python script to be run using an Excel macro. First, a Python script was written to process the iris dataset and output two files. Next, the Excel macro was written to execute this script. Finally, the macro was assigned to a button in the Excel sheet, where the Python script can be executed when the button is clicked.

Hope you found this useful!

To parallelize or not to parallelize? On performance of hybrid schemes in Python with the Borg MOEA

Hybrid parallelization is often used when performing noisy simulation-optimization on high-performance computing clusters. One common technique is to divide the available CPUs into a smaller number of “tasks”, each of which uses its subset of CPUs to assess the performance of candidate solutions through parallel evaluation of multiple scenarios. In this post, I will demonstrate how to use Python’s Multiprocessing module to set up a hybrid parallelization scheme with the Borg MOEA, and explore the resulting tradeoff between efficiency and learning rate in order to better understand when one should (and should not) implement such a scheme.

This post will bring together and build upon several previous posts on this blog, including (1) this demonstration by Andrew Dircks for how to use the Borg MOEA’s Python wrapper to optimize multi-objective simulations written in Python, using the Lake Problem as example; (2) this explanation of the Borg MOEA’s hybrid parallelization strategies by David Gold; and (3) this post by Bernardo Trindade on parallelizing Monte Carlo evaluations using Python and C/C++. The present post will give a brief overview of these topics in order to show how they can be brought together, but the reader is referred to the posts above for more detail on any of these topics.

Why hybrid parallelization?

As a motivating example, consider a complex water resources simulation model that takes 10 minutes to run a 30 year simulation. We want to explore the potential for new infrastructure investments to improve multiple objectives such as water supply reliability, cost, etc. We will try to discover high-performing solutions using an MOEA. To account for uncertainty, each candidate investment will be evaluated in 48 different stochastic scenarios.

Given the expense of our simulation model, this will be a major computational challenge. If we evaluate 20,000 candidate solutions (which is quite low compared to other studies), the total run time will be 20000 * 48 * 10 minutes = 6,667 days. Clearly, we cannot afford to run this serially on a laptop. If we have access to a computer cluster, we can parallelize the MOEA search and speed things up drastically. For example, with an XSEDE allocation on TACC Stampede2, I can run jobs with up to 128 Skylake nodes, each of which has 48 CPU, for a total of 6144 CPU per job. With this setup, we could finish the same experiment in ~1.1 days.

With the simplest possible parallelization strategy, we have a single CPU as the “Master” for the Borg MOEA, and 6143 CPU running individual function evaluations (FE). Each FE is allocated a single CPU, which runs 48 stochastic simulations in serial (i.e., there is no hybrid parallelization taking place). At the other extreme, I could set the MOEA to run with an entire node per “task”. This would allocate a full node to the Borg Master process, plus 127 nodes running individual FEs. Each FE would distribute its 48 stochastic simulations across its 48 CPUs in parallel. Thus, we are running fewer concurrent FEs (127 vs. 6143), but each finishes much faster (10 minutes vs. 480 minutes). In between these two extremes, there are intermediate parallelization schemes that allocate 2, 3, 4, 6, 8, 12, 16, or 24 CPU per task (Table 1).

So how to choose a parallelization strategy? The first thing to consider is efficiency. In theory, the scheme without hybrid parallelization (1 CPU per task) is the most efficient, since it allocates only one CPU to the Borg Master process, allowing for 6143 CPU to be simultaneously performing simulations. The 48 CPU/task scheme, on the other hand, allocates an entire node to the Borg Master (even though it only needs a single CPU to perform its duties), so that only 127*48=6096 CPU are running simulations at a given time. Thus, it should be ~1% slower in theory. However, in practice, a variety of factors such as communication or I/O bottlenecks can change this behavior (see below), so it is important to run tests on your particular machine.

If the hybrid parallelization scheme is slower, why should we even consider it? The key point is the increased potential for “learning” within the MOEA loop. Remember that the 1 CPU/task scheme has 6143 FE running concurrently, while the 48 CPU/task scheme has 127 FE. This means that the former will only have approximately 20000/6143=3.3 “generations,” while the latter will have 20000/127=157.5 generations*. Put another way, each FE will take 480 minutes (30% of total 1.1 days) to complete for the 1 CPU/task scheme, compared to 10 minutes (0.6% of total) for the 48 CPU/task scheme. The latter thus provides a more continuous stream of information to the MOEA Master, and therefore many more opportunities for the MOEA to learn what types of solutions are successful through iterative trial and error.

*Note: The Borg MOEA is a steady-state algorithm rather than a generational algorithm, so I use the term “generation” loosely to refer to the number of FE that each task will complete. See Dave Gold’s blog post on the Borg MOEA’s parallelization for more details on generational vs. steady-state algorithms. Note that for this post, I am using the master-worker variant of the Borg MOEA.

Function evaluations (FE)Total CPUTasks/nodeCPU/taskConcurrent FEGenerations
20000614448161433.3
20000614424230716.5
20000614416320479.8
200006144124153513.0
20000614486102319.6
2000061446876726.1
20000614441251139.1
20000614431638352.2
20000614422425578.4
200006144148127157.5
Table 1: Scaling of concurrent function evaluations and generations for different hybrid parallelization schemes, using 128 Skylake nodes (48 CPU each) on Stampede2.

A simple example: hybrid parallelization of the Lake Problem in Python

Above, I have outlined a purely theoretical tradeoff between efficiency and algorithmic learning. How does this tradeoff play out in practice? What type of hybrid scheme is “best”? I will now explore these questions in more detail using a smaller-scale example. The Lake Problem is a popular example for demonstrating concepts such as non-linear dynamics, deep uncertainty, and adaptive multi-objective control; a search for “Lake Problem” on the Water Programming Blog returns many results (I stopped counting at 30). In this problem, the decision-maker is tasked with designing a regulatory policy that balances the economic benefits of phosphorus with its environmental risks, given the large uncertainty in the complex dynamics that govern lake pollution concentrations over time. In a 2017 article in Environmental Modelling & Software, Julie Quinn demonstrated the merits of Evolutionary Multi-Objective Direct Policy Search (EMODPS) for developing control policies that adaptively update phosphorus releases based on evolving conditions (e.g., current lake concentration).

A Python version of the EMODPS Lake Problem was introduced along with a Python wrapper for the Borg MOEA in this blog post by Andrew Dircks. Additionally, this post by Bernardo Trindade demonstrates how to parallelize Monte Carlo ensembles in Python as well as C/C++. For the present example, I combined aspects of these two posts, in order to parallelize the Monte Carlo evaluations within each function evaluation, using the Multiprocessing module in Python. For this blog post, I will highlight a few of the important differences relative to the non-parallel version of the problem in Andrew Dircks’ blog post, but the interested reader is referred to these previous two post for more detail. All code used to produce these results can be found in this GitHub repository.

The lake_mp.py file specifies the analysis that the Borg MOEA should run for each function evaluation – we will evaluate each candidate operating policy using an ensemble of 48 simulations, each of which uses a different 100-year stochastic realization of hydrology. This ensemble will be divided among a smaller number of CPUs (say, 16) assigned to each function evaluation. Thus, we will have to set up our code a bit differently than in the serial version. First, consider the overall “problem” function which is called by the Borg Master. Within this function, we can set up shared arrays for each objective. The Array class from the multiprocessing module allows for thread-safe access to the same memory array from multiple CPUs. Note that this module requires shared memory (i.e., the CPUs must be on the same node). If you want to share memory across multiple nodes, you will need to use the mpi4py module, rather than multiprocessing.

from multiprocessing import Process, Array
# create shared memory arrays which each process can access/write.
discounted_benefit = Array('d', nSamples)
yrs_inertia_met = Array('i',  nSamples)
yrs_Pcrit_met = Array('i', nSamples)
average_annual_P = Array('d', nSamples*nYears)

Now we need to figure out which simulations should be performed by each CPU, and dispatch those simulations using the Process class from the multiprocessing module.

# assign MC runs to different processes
nbase = int(nSamples / nProcesses)
remainder = nSamples - nProcesses * nbase
start = 0
shared_processes = []
for proc in range(nProcesses):
    nTrials = nbase if proc >= remainder else nbase + 1
    stop = start + nTrials
    p = Process(target=dispatch_MC_to_procs, args=(proc, start, stop, discounted_benefit, yrs_inertia_met, yrs_Pcrit_met, average_annual_P, natFlow, b, q, critical_threshold, C, R, newW))

    shared_processes.append(p)
    start = stop

# start processes
for sp in shared_processes:
    sp.start()

# wait for all processes to finish
for sp in shared_processes:
    sp.join()

The “args” given to the Process function contains information needed by the individual CPUs, including the shared memory arrays. Once we have appended the processes for each CPU, we start them, and then wait for all processes to finish (“join”).

The “target” in the Process function tells each CPU which subset of Monte Carlo samples to evaluate, and runs the individual evaluations in a loop:

#Sub problem to dispatch MC samples from individual processes
def dispatch_MC_to_procs(proc, start, stop, discounted_benefit, yrs_inertia_met, yrs_Pcrit_met, average_annual_P, natFlow, b, q, critical_threshold, C, R, newW):
    lake_state = np.zeros([nYears + 1])
    Y = np.zeros([nYears])
    for s in range(start, stop):
        LakeProblem_singleMC(s, discounted_benefit, yrs_inertia_met, yrs_Pcrit_met, average_annual_P, lake_state, natFlow[s, :], Y, b, q, critical_threshold, C, R, newW)

Within the loop, we call LakeProblem_singleMC to run each Monte Carlo evaluation. After running the simulation, this function can store its sub-objectives in the shared arrays:

# fill in results for MC trial in shared memory arrays, based on index of this MC sample (s)
average_annual_P[(s*nYears):((s+1)*nYears)] = lake_state[1:]
discounted_benefit[s] = db
yrs_inertia_met[s] = yim
yrs_Pcrit_met[s] = yPm

Finally, back in the main “problem” function, we can evaluate the objectives by aggregating over the shared memory arrays:

# Calculate minimization objectives (defined in comments at beginning of file)
objs[0] = -1 * np.mean(discounted_benefit)  #average economic benefit
objs[1] = np.max(np.mean(np.reshape(average_annual_P, (nSamples, nYears)), axis=0))  #minimize the max average annual P concentration
objs[2] = -1 * np.sum(yrs_inertia_met) / ((nYears - 1) * nSamples)  #average percent of transitions meeting inertia thershold
objs[3] = -1 * np.sum(yrs_Pcrit_met) / (nYears * nSamples)  #average reliability
constrs[0] = max(0.0, reliability_threshold - (-1 * objs[3]))

time.sleep(slp)

In the last line, I add a half-second sleep after each function evaluation in order to help this example better mimic the behavior of the much slower motivating problem at the beginning of this post. Without the sleep, the Lake Problem is much faster than larger simulation codes, and therefore could exhibit additional I/O and/or master-worker interaction bottlenecks.

Again, the interested reader is referred to the GitHub repository, as well as Andrew Dircks’ original Lake Problem blog post, for more details on how these code snippets fit into the broader program.

This experiment is run on the Cube cluster at Cornell University, using 12 nodes of 16 CPU each. I tested 5 different hybrid parallelization schemes, from 1 to 16 CPU per task (Table 2). Each formulation is run for 5 seeds to account for the random dynamics of evolutionary algorithms. For this experiment, I will show results through 800 function evaluations. Although this is much smaller than the standard practice, this short experiment allows us to mimic the behavior of the much larger experiment introduced at the beginning of this post (compare the number of Generations in Tables 1 & 2).

Function evaluations (FE)Total CPUTasks/nodeCPU/taskConcurrent FEGenerations
8001921611914.2
80019282958.4
800192444717.0
800192282334.8
8001921161172.7
Table 2: Scaling of concurrent function evaluations and generations for different hybrid parallelization schemes, using 12 nodes (16 CPU each) on TheCube.

How do different parallelization schemes compare?

Figure 1 shows the performance of the 1 & 16 CPU/task schemes, in terms of hypervolume of the approximate Pareto set. The hybrid parallelization scheme (16 CPU/task) is found to outperform the serial scheme (1 CPU/task). In general across the five seeds, the parallelized scheme is found to improve more quickly and to reach a higher overall hypervolume after 800 FE. The intermediate parallelization schemes (2, 4, & 8 CPU/task) are found to perform similarly to the 16 CPU/task (see GitHub repository for figures)

Figure 1: Hypervolume vs. number of function evaluations for parallelization schemes with 1 and 16 CPU per function evaluation.

Figure 2 shows the number of completed function evaluations over time, which helps illuminate the source of this advantage. The serial formulation shows a staircase pattern: large chunks of time with no progress, followed by short bursts with many FE finishing all at once. Overall, the FE arrive in 5 major bursts (with the last smaller than the rest), at approximately 25, 50, 75, 100, and 125 seconds. As we move to 2, 4, 8, and 16 CPU/task, the number of “stairs” is seen to double each time, so that the 16 CPU/task scheme almost approximates a straight line. These patterns are consistent with the generational pattern shown in Table 2.

Figure 2: Runtime vs. number of function evaluations for parallelization schemes with 1, 2, 4, 8, and 16 CPU per function evaluation.

So why does this matter? Elitist evolutionary algorithms use the relative performance of past evaluations to influence the random generation of new candidate solutions. Because a hybrid parallelization strategy evaluates fewer solutions concurrently and has more “generations” of information, there will be more ability to learn over the course of the optimization. For example, with the serial scheme, the first generation of 191 candidate solutions is randomly generated with no prior information. The 16 CPU/task scheme, on the other hand, only produces 11 solutions in the first generation, and has had ~17 generations worth of learning by the time it reaches 191 FE.

Additionally, the Borg MOEA has a variety of other auto-adaptive features that can improve the performance of the search process over the course of a single run (see Hadka & Reed, 2013). For example, Borg has six different recombination operators that can be used to mate and mutate previously evaluated solutions in order to generate new candidates for evaluation. The MOEA adaptively updates these selection probabilities in favor of the operators which have been most successful at producing successful offspring. The hybrid parallelization allows these probabilities to be adapted earlier and more regularly due to the more continuous accumulation of information (Figure 3). This can improve the performance of the MOEA.

Figure 3: Operator probabilities vs number of function evaluations (NFE) for parallelization schemes with 1 and 16 CPU per function evaluation. The subplots represent the auto-adaptive selection probabilities for the six recombination operators used by the Borg MOEA: differential evolution (DE), parent-centric crossover (PCX), simulated binary crossover (SBX), simplex crossover (PCX), uniform mutation (UM), and unimodal normal distribution crossover (UNDX).

What happens in longer runs?

At this point, it may seem that hybrid parallelization is always beneficial, since it can increase the algorithm’s adaptability (Figure 3) and improve performance (Figure 1) for a fairly negligible computational cost (Figure 2). However, the experiments in Tables 1 & 2 are very constrained in terms of the number of generations that are run. In most contexts, we will have more function evaluations and/or fewer nodes, so that the ratio of nodes to FE is much smaller and thus the number of generations is much larger. In these situations, do the advantages of hybrid parallelization hold up?

Figure 4 shows the results of a longer experiment, which is identical to above except for its use of 10,000 FE rather than 800 FE. Over the longer experiment, the advantage of the parallelized scheme (16 CPU/FE) is found to diminish, with random seed variability becoming dominant. With more generations, the differences between the two schemes seems to fade.

Figure 4: Same as Figure 1, but for a longer experiment of 10,000 function evaluations.

Figure 5 shows the evolution of the recombination operator selection probabilities over the longer experiment. The serial scheme still lags in terms of adaptation speed, but by the end of the experiment, the two formulations have very similar behavior.

Figure 5: Same as Figure 3, but for a longer experiment of 10,000 function evaluations.

Lastly, Figure 6 shows the runtime as a function of FEs for the two formulations. The parallelized scheme takes approximately 15% longer than the serial scheme. We can compare this to the theoretical slowdown based on the Borg Master overhead: the 16 CPU/task scheme runs 11 concurrent FE, each using 16 CPU, for a total of 176 CPU, while the serial version runs 191 concurrent FE, each using 1 CPU. 191 is 9% larger than 176. The difference between 15% and 9% indicates additional bottlenecks, such as writing to the shared array. This overhead may vary based on the characteristics of the cluster and the computational experiment, so testing should be performed prior to running any large experiment.

Figure 6: Same as Figure 2, but for a longer experiment of 10,000 function evaluations.

Given all this information, what to do? Should you use hybrid parallelization for your function evaluations? The answer, unsatisfyingly, is “it depends.” The most important factor to consider is the number of generations for your optimization, calculated as the total number of FE divided by the number of concurrent FE. If this number is relatively high (say, >10-20), then you may be better off sticking with serial FEs in order to improve efficiency. If the number of generations is small, it may be worthwhile to introduce hybrid parallelization in order to stimulate adaptive learning within the MOEA. Where to draw the line between these two poles will likely to vary based on a number of factors, including the number of nodes you are using, the computational architecture of the cluster, the runtime of each function evaluation, and the difficulty of the optimization problem, so testing is recommended to decide what to do in any given scenario.

Clustering basics and a demonstration in clustering infrastructure pathways

Introduction

When approaching a new dataset, the complexity may be daunting. If patterns in the data are not readily apparent, one potential first-step would be to cluster the data and search for groupings. Clustering data, or cluster analysis, can reveal relationships between the observations and provide insight into the data.

Figure: Graphical representation of how clustering can be used to identify patterns in a data set.

Motivation

The outcome of a cluster analysis can be highly dependent upon the clustering algorithm being used, the clustering criteria, and the specified number of clusters to be found. Anticipating these influences and how to manipulate them will set you up for success in your data analysis.

In this post, I introduce some fundamental clustering algorithms, the hierarchical and K-Means clustering algorithms. Then I will share some important considerations, such as how to select a suitable number of clusters. Finally, I will demonstrate the power of clustering by applying the methods to an infrastructure pathways dataset, to cluster infrastructure pathways generated for 1,000 different states of the world into different sets of ‘light’, ‘moderate’, and ‘heavy’ infrastructure development.

Let’s begin!

Methods

Hierarchical Clustering

Hierarchical clustering can be performed “bottom-up”, through an agglomerative approach, which starts by placing each observation in its own cluster and then successively links the observations into increasingly larger clusters. Alternatively, the clustering can take a “top-down”, or divisive, approach which begins with all observations within a single cluster and partitions the set into smaller clusters until each observation exists within its own cluster.

Figure: A graphical representation of the hierarchical clustering process. Red circles denote clusters. If read from left to right, then the figure is representative of the agglomerative approach. If read from right to left, then the figure is representative of the divisive approach.

Specifying how clusters are preferentially assigned will influence the final shape and contents of the clusters. When performing hierarchical clustering, the user must specify the linkage criteria, which determines how the distance between clusters is measured, and consequently which clusters will be group during the next iteration. Three common linkage criteria are:

Linkage TypeDescriptionFormulation
Complete (max)the maximum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Single (min)the minimum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Averagethe average distance between a group and
observations in another group.
average{ d(a, b) : a in A, b in B}

Additionally, when calculating the linkage criteria, different methods of measuring the distance can be used. For example, when working with correlated data, it may be preferable to use the Mahalanobis distance over the regular Euclidean norm.

The results of the clustering analysis can be viewed through a dendrogram, which displays a tree-like representation of the cluster structure. For more detail on dendrograms, and a step-by-step tutorial for creating color-coded dendrograms in R, see Rohini’s (2018) post here.

K-Means Clustering

The K-means algorithm follows three simple steps.

  1. Selection of k centroids within the sample space.

In the traditional naïve K-Means algorithms, centroids do not correspond to specific observations within the space. They can be randomly selected, or chosen through more informed methods (see, for example, K-Means++)

2. Assignment of each observation to the cluster with the nearest centroid.

3. Re-define each cluster centroid as the mean of the data points assigned to that centroid.

Steps 2 and 3 of the algorithm are repeated until the centroids begin to stabilize, such that subsequent iterations produce centroids located within some small, tolerable distance from one another.

Figure: A graphical representation of the k-means clustering process, with 3 clusters. X’s denote the centroids for that respective cluster. The dashed arrows denote the shift in the location of the centroid during the next iteration. Points are colored according to their assigned cluster during that iteration.

The centroids toward which the algorithm converges may be a locally optimal clustering solution, and can be highly dependent upon the initial centroid selection. Often, this sensitivity is handled by performing the clustering processes several times, and choosing the set of clusters which yield the smallest within-cluster sum of squares.

Selecting the number of clusters

In some contexts, it is possible to know how many clusters are required prior to the analysis. For example, when working with Fisher’s famous iris flower data set, three clusters are needed because there are three types of flowers observed within the data.

In an exploratory context, however, the appropriate number of clusters may not be readily apparent. One method of determining the preferred number of clusters is by testing multiple different values and evaluating the separation distances between the clusters. If the number of clusters is well-chosen, then the clusters will be both well-spaced and the intra-cluster distance will be small.

The Silhouette Coefficient, or silhouette score, is one method of measuring the goodness of fit for a set of clusters, and takes into consideration both the spacing between clusters (inter-cluster distance) and the distance between points within a cluster (intra-cluster spacing). The Silhouette scores is defined as:

Where:

a = average intra-cluster distance,

b = average inter-cluster distance.

The silhouette score ranges from [-1, 1]. A value near +1 suggests that the clusters are far from one another, with a small distances within the cluster, and thus the chosen number of clusters is well-suited for the data. A value near -1, however, suggests that the clusters are poorly describing the patterns in the data.

Example: Clustering Infrastructure Pathways

In water supply planning contexts, infrastructure pathways describe the sequencing of infrastructure investment decisions throughout time, in response to changes in system (construction of new reservoirs, expansions to water treatment capacity etc.). Adaptive infrastructure pathways use a risk-of-failure (ROF) based set of policy rules to trigger new infrastructure development. Adaptive infrastructure pathways have been shown to effectively coordinate water infrastructure decisions in complex multi-actor systems (Zeff et al., 2016). While the benefits of this adaptive strategy are clear, ROF based pathways also bring new challenges for decision makers – rather than specifying an single sequence of infrastructure investments, an adaptive rule system will generate a unique sequence of investments for every future state of the world. To understand how a rule system behaves, decision makers must have tools to visualize an ensemble of infrastructure pathways. This is where cluster analysis comes in. Using clustering, we can group similar infrastructure sequences together, allowing decision makers to summarize how a candidate infrastructure investment policy will behave across many future states of the world.

Here, I analyze a set of infrastructure pathways generated for a hypothetical water utility with four infrastructure options:

  • small water treatment plant expansion
  • large water treatment plant expansion
  • construction of a small new run of river intake
  • construction of a large new run of river intake

The utility is examining a candidate policy and has generated pathways for 1,000 different states of the world, or realizations, which are each characterized by unique hydrologic and population dynamics. Given the 1,000 different realizations, each with a unique infrastructure pathway, I am interested in clustering the pathways according the the timing of the infrastructure construction decisions to understand how the policy behaves.

The dataset used for this demonstration is courtesy of David Gold, who is also responsible for providing significant portions of the following code and assistance along the way. Thanks David!

Follow along with this demonstration by downloading .txt data file HERE!

Let’s begin by loading in the data.

import pandas as pd

# Load in data
pathways_df = pd.read_csv('./ClusterData.txt', sep = '\t', header = None)

The data contains information describing the timing of infrastructure construction for 1,000 different simulated realizations. Each column represents a different infrastructure option, and each row is 1 of the 1,000 different simulated realizations. The contents of each cell denote a standardized value corresponding to the timing of the infrastructure construction within each realization, ranging from [0, 1]. Infrastructure options containing the value 1 were not built during the 45-year simulation period, in that specific realization.

RealizationInfra. Opt. #1Infra. Opt. #2Infra. Opt. #M
1W1,1W1,2W1,M
2W2,1W2,2W2,M
NWN,2WN,2WN,M
Table: Example output data from the infrastructure pathways optimization process; within each realization (state of the world (SOW)) infrastructure construction decisions are made during different weeks within the simulation period (45-years), considering a subset of infrastructure options for each utility. Wi,j denotes a standardized value corresponding to the timing within realization i that infrastructure option j is built.

With the data available, we can begin clustering!

Here, I take advantage of the scikit-learn.cluster module, which has both hierarchical and K-Means clustering capabilities.

For the purposes of this demonstration, I am going to only show the process of producing the hierarchical clusters… the curious reader may choose to explore alternative clustering methods available in the scikit-learn toolbox.

### Perform hierarchical clustering with 3 clusters ###

# Import the necessary tools
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import pairwise_distances_argmin

# Begin by assuming 3 clusters
num_clusters = 3

# Initialize the hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters = 3)

# Cluster the data
hierarchical.fit(cluster_input)  

# Produce a array which specifies the cluster of each pathway
hierarchical_labels = hierarchical.fit_predict(cluster_input)

In order to interpret the significance of these clusters, we want find a way to visualize differences in cluster behavior through the infrastructure pathways.

To do this, I am going to consider the median timing of each infrastructure option within each cluster, across all 1,000 realizations. Additionally, I want to specify sort the clusters according to some qualitative characteristic. In this case, I define pathways with early infrastructure investments as “heavy”, and pathways within infrastructure investment later in the period as “light”. The “moderate” infrastructure pathways are contained within the cluster between these these.

The following script calculates the median of the clustered pathways, and sorts them according to their infrastructure timing.

### Interpreting clusters via pathways ###

# Group realizations by pathway
# Calculate the median week each infra. opt. is constructed in each cluster
import numpy as np
cluster_pathways = []
cluster_medians = []

# Loop through clusters 
for i in range(0,num_clusters):
    
    # Select realizations contained within i-th cluster
    current_cluster = cluster_input[hierarchical_labels ==i, :]
    
    # Multiply by 2344 weeks to convert from [0,1] to [0,2344] weeks
    current_cluster = current_cluster*2344
    
    cluster_pathways.append(current_cluster)
    
    # Find the median infrastructure build week in each realization
    current_medians = np.zeros(len(current_cluster[0,:]))
    
    # Loop through infrastructure options and calculate medians
    for j in range(0, len(current_cluster[0,:])):
        current_medians[j] = np.median(current_cluster[:,j])
        
    cluster_medians.append(current_medians)

# Sort clusters by average of median build week 
# to get light, moderate, and heavy infrastructure clusters
cluster_means = np.mean(cluster_medians, axis = 1)

sorted_indices = np.argsort(cluster_means)

# Re-order cluster medians according to sorted mean build week
cluster_medians = np.vstack((cluster_medians[sorted_indices[2]], 
                            cluster_medians[sorted_indices[1]],
                            cluster_medians[sorted_indices[0]]))

With the median timing of each infrastructure option specified for each cluster, I can begin the process of visualizing the behavior. The function below plots a single infrastructure pathway:

import numpy as np
import matplotlib.pyplot as plt

def plot_single_pathway(cluster_medians, cluster_pathways, inf_options_idx,
                        c, inf_names, y_offset, ylabeling, xlabeling, 
                        plot_legend, ax):
    
    """
    Makes a plot of an infrastructure pathway.
    
    PARAMETERS:
        cluster_medians: an array with median weeks each option is built
        cluster_pathways: an array with every pathway in the cluster
        inf_options: an array with numbers representing each option (y-axis vals)
        should start at zero to represent the "baseline"
        c: color to plot pathway
        y_offset: distance to offset each pathway from 0 (easier to visualize stacked paths)
        ylabeling: labeling for infrastructure options
        xlabeling: labeling x axis
        plot_legend: (bool) to turn on or off
        ax: axes object to plot pathway
    """
    
    # Extract all non-baseline infrastructure options
    inf_options_idx_no_baseline=inf_options_idx[1:]
    sorted_inf = np.argsort(cluster_medians)
    
    # Plot Pathways
    # create arrays to plot the pathway lines. To ensure pathways have corners 
    # we need an array to have length 2*num_inf_options
    pathway_x = np.zeros(len(cluster_medians)*2+2)
    pathway_y = np.zeros(len(cluster_medians)*2+2)
    
    # To have corners, each inf. opt. must be located
    # in both the cell it is triggered, and adjacent cell
    
    cluster_medians = np.rint(cluster_medians/45)
    for i in range(0,len(cluster_medians)):
        for j in [1,2]:
            pathway_x[i*2+j] = cluster_medians[sorted_inf[i]]
            pathway_y[i*2+j+1] = inf_options_idx_no_baseline[sorted_inf[i]]

    # Set the end of the pathway at year 45
    pathway_x[-1] = 45

    # plot the pathway line    
    ax.plot(pathway_x, pathway_y + y_offset, color=c, linewidth=5, 
            alpha = .9, zorder=1)

    # Formatting plot elements 
    ax.set_xlim([0,44])
    inf_name_lables = inf_names
    inf_options_idx = np.arange(len(inf_name_lables))
    ax.set_yticks(inf_options_idx)
    ax.set_xlabel('Week')
    
    if ylabeling:
        ax.set_yticklabels(inf_name_lables)
    else:
        ax.set_yticklabels([])
    
    ax.set_ylim([-0.5,len(cluster_medians)+.5])

    plt.show()

The above script can be used to plot the median pathway for a single cluster, but it would be much more worthwhile to compare the median pathways of the different clusters against one another. To do this, we can define another function which will use the plot_single_pathway() iteratively to plot each of the median cluster pathways on the same figure.

import numpy as np
import matplotlib.pyplot as plt
from plot_single_pathway import plot_single_pathway

def plot_many_paths(clust_meds, clust_pathways, n_clusters, cluster_colors, fig, 
                    gspec, fig_col, ylabeling, plot_legend):
    
    
    y_offsets = [-.15, 0, .15]
    
    # Specific information for Utility
    inf_options_idx = np.array([0, 1,2, 3, 4])
    utility_inf = ['Baseline', 'Small intake expans.', 'Large intake expans.',
                    'Small WTP expans.', 
                    'Large WTP expans.']
    
    ax = fig.add_subplot(gspec[0,0])
    
    for i in np.arange(n_clusters):
        plot_single_pathway(clust_meds[i], clust_pathways[i], inf_options_idx, cluster_colors[i],
                            utility_inf, y_offsets[i], ylabeling, False, True, ax)
        plt.show()
    
    if plot_legend and (n_clusters == 3):
        ax.legend(['Light inf.', 'Moderat inf.', 'Heavy inf.'], 
                  loc='upper left')
    elif plot_legend and (n_clusters == 2):
        ax.legend(['Light inf.', 'Heavy inf.'], loc='upper left')
        
    ax.tick_params(axis = "y", which = "both", left = False, right = False)
    
    plt.show()
    return fig

Time for the reveal! Execution of this final portion of the script will plot each of the three clusters, and color-code them according to classifications as “heavy”, “moderate”, or “light” infrastructure.

### Plotting ###
from plot_many_paths import plot_many_paths

fig = plt.figure(figsize = (5,5))
gspec = fig.add_gridspec(ncols=1, nrows=1)

# Colorscale
heavy_inf_color = '#132a13'
mid_inf_color = '#4f772d'
light_inf_color = '#90a955'
cluster_colors = [light_inf_color, mid_inf_color, heavy_inf_color]

plot_many_paths(cluster_medians, cluster_pathways, 3, cluster_colors, 
                fig, gspec, 0, True, True)
Figure: Median infrastructure pathways for clustered pathways.

Voilà!

The figure above shows a clear contrast in the behavior of the light and heavy infrastructure pathways; the heavy infrastructure pathway is not only characterized by early infrastructure construction, but also by more frequent construction throughout the simulation period, in comparison to the median of the light infrastructure cluster which only requires a single infrastructure expansion within the same simulation period.

From the plot above, it is not apparent that three clusters are necessarily appropriate for this data set. Although the heavy and light infrastructure clusters are clearly demarcated from one another, the moderate infrastructure cluster has characteristics of both: a late first-expansion similar to the light infrastructure cluster, but a high frequency of construction throughout the period similar to the heavy infrastructure. Should the moderate infrastructure really be a separate cluster?

Let’s take advantage of the knowledge we developed earlier, and consider the silhouette scores corresponding to different numbers of clusters! Fortunately, scikit-learn has a suite of tools that made the silhouette analysis very easy.

In the following script, I evaluate the silhouette score associated with clustering for 2, 3, and 4 different clusters.

### Performing silhouette analysis ###
from sklearn.metrics import silhouette_score

num_clust = [2, 3, 4]

# Set up hierarchical clustering with different numbers of clusters
ac2 = AgglomerativeClustering(n_clusters = num_clust[0])
ac3 = AgglomerativeClustering(n_clusters = num_clust[1])
ac4 = AgglomerativeClustering(n_clusters = num_clust[2])

# Extract the silhouette score for each hierarchical grouping
silhouette_scores = []
silhouette_scores.append(silhouette_score(cluster_input, ac2.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac3.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac4.fit_predict(cluster_input)))
                                                  
plt.bar(num_clust, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 14)
plt.ylabel('Silhouette Score', fontsize = 14)
plt.xticks(num_clust)
plt.title("Silhouette analysis")
plt.show()
Figure: Silhouette analysis for the hierarchical clustering of infrastructure pathways.

Interestingly, the silhouette score was highest when the clustering was performed using only two clusters, suggesting that the moderate infrastructure class may not have been sufficiently unique to warrant a separate cluster.

Another method of visualizing the cluster structure is through the use of dendrograms. Below, I show dendrograms color-coded with only two clusters (left) and using three clusters (right).

Figure: Dendrograms showing the cluster structure of infrastructure pathways for a utility, using 2 (left) and 3 (right) clusters.

The dendrogram containing only two clusters (left) reveals that the moderate and light clusters can be combined into a single cluster, because the moderate cluster is less-different from the light infrastructure cluster than it is from the heavy infrastructure cluster. Although, it is important to note that dendrograms are not a reliable method of determining the number of clusters on their own. In fact, given that the difference in silhouette scores between the 2-cluster and 3-cluster hierarchies is relatively modest (~0.05) there is not sufficiently strong justification to choose the 2 clusters over 3 clusters.

For the sake of comparison, here are the 2-cluster and 3-cluster median pathway plots:

Conclusion

I hope you’ve gained some insight into clustering, infrastructure pathways (or preferably both) in this post.

Analysis of infrastructure pathways via statistical clustering presents a unique application of a very common statistical tool. In this analysis I clustered the pathways with respect to the timing of the infrastructure expansions, however there is potential for a much more in-depth analysis of how the hydrologic or urban contexts of each realization shape these infrastructure pathways… possibly justifying another blog post at a later date. Stay tuned!

Thanks again to David Gold, for providing the data and the foundational python code.

References

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research52(9), 7327-7346.

Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB

This post is dedicated to Chris Vernon and Travis Thurber from PNNL who taught me how to use these tools!

Lately, I’ve been generating synthetic weather scenarios for basins in California. The weather scenarios are created at a daily time step and are informed by tree-ring products which span across 600 years. We want to develop multiple ensembles of these scenarios that are representative of plausible future climate changes that the region could experience. It became clear to me that I would need to make sure that I was storing data efficiently and that I was able to parse through these data quickly to generate plots and develop metrics.

Below, I wanted to discuss two points that have really helped my workflow: (1) focusing on file structure and compression and (2) changing the way that I explore the resulting scenarios.

File choice and compression

To illustrate the importance of considering how you choose to save your data, I grab the historical trace of daily precipitation over the last 50 years for the Tuolumne River Basin and upload it into R as a dataframe. Below I show the code to take this dataframe and save it in various formats.

#Traditional RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne.rds"))

#Compressed RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne_compressed.rds",compress = 'xz'))

#Text 
system.time(write.table(prcp_tuolumne,"E:/blog/prcp_tuolumne.txt"))

#CSV
system.time(write.csv(prcp_tuolumne,"E:/blog/prcp_tuolumne.csv"))

#Parquet
library(arrow)
system.time(write_parquet(prcp_tuolumne,"E:/blog/prcp_tuolumne.parquet",compression = "gzip"))

#HDF5 
library(rhdf5)
system.time(h5write(prcp_tuolumne,"E:/blog/prcp_tuolumne.h5","df"))


#NetCDF
library(ncdf4)
# path and file name, set dname
ncpath <- "E:/blog/"
ncname <- "prcp_tuolumne"  
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "prcp"  # note: tmp means temperature (not temporary)
timedim <- ncdim_def("time","days since 1950-11-1",as.numeric(dates))
# define variables
fillvalue <- 1e32
dlname <- "observed precipitation"
prcp_def <- ncvar_def("prcp","mm",list(timedim),1e32,"observed prcp",prec = "double")
# create netCDF file and put arrays
ncout <- nc_create(ncfname,list(prcp_def),force_v4=TRUE)
# put variables
system.time(ncvar_put(ncout,prcp_def,as.matrix(prcp_tuolumne[ ,1])))

Here, I summarize the resulting size and write times for each format.

File TypeSize (Kb)Writing Time (seconds)
RDS9,0652.13
Compressed RDS4,01217.73
HDF55,204323.03
Text26,85112.69
CSV27,02412.14
netCDF141,3603.12
Parquet8,3960.98

Note that the original file size was on the smaller side so this exercise may seem trivial, but when you are creating many ensembles of both precipitation and temperature, across hundreds of years, across may basins, across many climate change scenarios, small files will add up and can potentially limit the scope of experiment that you can conduct if you don’t have enough file storage. Of these file types, one of my recent favorites has been Apache Parquet. Parquet is an open-source column-oriented format that is specifically designed for efficient storage and accessibility of data. Whereas the compressed RDS file and HDF5 beat Parquet in terms of size, it takes much longer to write to these files and subsequently read them back in. Another advantage of Parquet is that it is recognized by common coding languages (R, Matlab, Python) which allows for a more seamless workflow between models of different languages. If you have a little more storage to work with, Parquet is a good choice to balance size and writing time tradeoffs.

Querying data with SQL and DuckDB

Once your data are in an efficient format like Parquet, the next order of business is to make sure that you can easily sort through it and, as Chris would say, “talk to your data and let it talk back”. One way to communicate with your data and ask questions in a more semantically meaningful way is to use Structured Query Language (SQL). There are many methods of querying in R, but a useful data management tool that I have enjoyed interacting with is DuckDB which is a relational database management system. DuckDB has an R API and allows you to use the SQL syntax to query data in an efficient way. It should be noted that you can perform similar aggregation and subsetting using other R base functions, but DuckDB allows you to perform these tasks faster and across multiple files that you may have in a directory.

Let’s use a case where you have developed 30 traces of daily basin-averaged synthetic precipitation for the Tuolumne based on the historical period and each trace is stored in a respective parquet file (“ensemble1.parquetgzip, ensemble2.parquetgzip…ensemble30.parquetgzip). Each ensemble looks something like this:

Example of the parquet file structure

Now let’s state a question that I want to ask to the data and see how to structure these questions using SQL and DuckDB.

“Hey DuckDB, can you create a dataframe for me that provides an annual average across all my synthetic traces? Let’s plot that with respect to the observed annual average and show the bounds of the synthetic traces.”

First we need to open a connection to DuckDB. Then we are going to have to create annual averages from the daily historical and synthetic traces and then create max/min bounds to show the additional variability we are getting from the generator.

In order to answer this question, DuckDB will query all our parquet files simultaneously (!) and return a dataframe with annual averages and we can do some clever sub-querying to get the max/min of the averages. I really enjoy that I don’t have to read in the files into my R workspace. R reads in files and stores them based on memory, so those of you who have used larger datasets might have found that you are limited in how many datasets you can have open, which can be frustrating! This is not an issue with DuckDB.

#Historical trace 
historical=data.frame(dates.year,dates.month,dates.day,rowMeans(prcp_sacsma))
historical_yearly=aggregate(historical,by=list(dates.year),FUN=mean)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

# query to find annual average across all synthetic traces
df=dbGetQuery(con, 
              "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM 'E:/NHMM/blog/*.parquet'
           GROUP BY year
           ORDER BY year")

#For the max/min, we need to find the average first and then return the max/min values in something like a nested approach 

#First create a dataframe with all files 
all_files=dbGetQuery(con,"SELECT *
                FROM 'E:/NHMM/blog/*.parquet'")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "all_files_table", all_files)


annual_average=dbGetQuery(con, 
               "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM all_files_table
           GROUP BY year,sample
           ORDER BY year")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "annual_table", annual_average)

#query to find max
max=dbGetQuery(con, 
               "SELECT year,
           MAX(yearly_average) AS max_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")

#query to find min
min=dbGetQuery(con, 
               "SELECT year,
           min(yearly_average) AS min_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")
#Plot the results!
library(ggplot2)

ggplot(df, aes(x = year, y = yearly_average)) +
  geom_ribbon(aes(ymin = min$min_yearly_average,
                  ymax = max$max_yearly_average),
              alpha = 0.2,fill="blue") +
  geom_line()+geom_line(color="blue",size=2)+ geom_line(data=basin.wide.yearly[3:64,],aes(x=year,y=precipitation), color="Black",size=2)+ ggtitle("Annual Average Precipitation") +
  xlab("Year") + ylab("Annual Average Precipitation (mm)")+theme_ridges()

Here’s the resulting figure:

Synthetic generation in blue compared to the historical trace in black

“Hey DuckDB, from all of the synthetic traces that I have generated, how many instances are there where we are producing daily precipitation that is above 10 mm for specifically the years 1980 or 2000?”

In order to answer this question, DuckDB will once again query all our parquet files simultaneously and return a dataframe with all the instances of >10 mm and only for the years 1980 or 2000. This is how you would write this question in “code”.

library(arrow)
library(duckdb)
library(fs)
library(tidyverse)
library(DBI)
library(glue)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

df=dbGetQuery(con,"SELECT *
                FROM 'E:/blog/*.parquet'
                WHERE precipitation > 10
                AND (year = 1980 OR year = 2000)
                order BY year")

If we look at the size of the resulting dataframe it looks like we have generated 1335 instances of daily precipitation that are greater than 10 mm in specifically the years 1980 or 2000:

Hey DuckDB, I want to think about meteorological drought. In how many instances am I half a standard deviation below the long-term average monthly precipitation?

To try this out, let’s just look at one synthetic trace. First we need to find the long term average across the trace and let’s plot what a drought might look like.

# query to grab the first synthetic trace
scenario1=dbGetQuery(con, 
              "SELECT month,year,
           AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'
           GROUP by year, month")

# query to find mean
long_term_average=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'") 

# query to standard deviation
stdev=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_stdev
           FROM 'E:/NHMM/blog/ensemble1.parquet'")

We can then plot the synthetic trace (blue), mean (black), and a 1/2 standard deviation (dashed line) below the mean.

Now we can count the number of months that we are classified in a drought across the synthetic trace. Based on the stats that we calculated in the last block, I hard coded in the drought threshold, but in theory, you can nest many functions with more complicated sub-queries.

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "scenario_1", scenario1)


instances=dbGetQuery(con, 
                     "SELECT month,year
           FROM scenario_1
           WHERE monthly_avg < 3.125291
           GROUP by year, month
           ORDER by year,month")

It looks like 115/325 months or about 35% of the time we are in a drought in this particular synthetic trace. Yikes! But it makes for a good way to assess the vulnerability of the system to future plausible meteorological drought conditions.

I hope that these approaches can be as useful to you as they have been for me!

MORDM VII: Optimality, robustness, and reevaluation under deep uncertainty

In the previous MORDM post, we visualized the reference set of performance objectives for the North Carolina Research Triangle and conducted a preliminary multi-criterion robustness analysis using two criteria: (1) regional reliability should be at least 98%, and (2) regional restriction frequency should be not more than 20%. Using these metrics, we found that Pareto-optimality does not guarantee satisfactory robustness, a statement that is justified by showing that not all portfolios within the reference set satisfied the robustness criteria.

In this post, we will explain the differences between optimality and robustness, and justify the importance of robust optimization instead of sole reliance on a set of optimal solutions (aka an optimal portfolio). To demonstrate the differences better, we will also reevaluate the Pareto optimal set of solutions under a more challenging set of states of the world (SOWs), a method first used in Herman et al (2013, 2014) and Zeff et al (2014). The formal term for this method, Deeply Uncertain (DU) Re-evaluation was coined in a 2019 paper by Trindade et al.

Optimality vs robustness

The descriptions of optimality are many. From a purely technical perspective, a Pareto optimal set is a set of decision variables or solutions that maps to a Pareto front, or a set of performance objectives where improving one objective cannot be improved without causing performance degradation in another. For the purposes of this blog post, we shall use the definition of optimality as laid out by Beyer and Sendoff in their 2007 paper:

The global optimal design…depends on the…(objective) functions and constraints…however, these functions always represent models and/or approximations of the real world.

Beyer and Sendhoff (2007)

In other words, a Pareto reference set is only optimal within the bounds of the model it was generated from. This makes sense; models are only approximations of the real world. It is difficult and computationally expensive to have bounds on the degree of certainty to which the model optimum maps to the true optimum due to uncertainties driven by human action, natural variability, and incomplete knowledge. Optimization is static in relation to reality – the set of solutions found do not change with time, and only account for the conditions within the model itself. Any deviation from this set of solutions or unaccounted differences between the actual system and model may result in failure (Herman et al, 2015; Read et al 2014).

This is why searching the set of optimal solutions for robust solutions is important. Herman et al (2015) quotes an earlier works by Matalas and Fiering (1977) that defines robustness as the insensitivity a system’s portfolio to uncertainty. Within the MORDM context, robustness was found to be best defined using the multi-criterion satisficing robustness measure (Herman et al, 2015), which refers to the ability of a solution to meet one or more requirements (or criteria) set by the decision-makers when evaluated under a set of challenging scenarios. More information on alternative robustness measures can be found here.

In this blog post, we will begin to explore this concept of robustness by conducting DU Re-evaluation, where we will perform the following steps:

Generate a set of ROF tables from a more challenging set of SOWs

Recall that we previously stored our Pareto-optimal solution set in a .csv file names ‘NC_refset.csv’ (find the original Git Repository here). Now, we will write a quick Python script (called rof_tables_reeval.py in the Git Repository) using MPI4PY that will parallelize and speed up the ROF table generation and the bash script to submit the job. More information on parallelization using MPI4PY can be found in this handy blog post by Dave Gold.

First, create a Python virtual environment within the folder where all your sourcecode is kept and activate the virtual environment. I called mine python3_venv:

python3 -m venv python_venv
source python_venv/bin/activate

Next, install the numpy and mpi4py libraries:

pip install numpy mpi4py

Then write the Python script as follows:

# -*- coding: utf-8 -*-
"""
Created on Tues March 1 2022 16:16
@author: Lillian Bei Jia Lau
"""

from mpi4py import MPI
import numpy as np
import subprocess, sys, time
import os

# 5 nodes, 50 RDMs per node
# 16 tasks per node
# 5 RDMs per task
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # up to 20 processes
print('rank = ', rank)

N_RDMs_needed = 100
N_REALIZATIONS = 100
N_RDM_PER_NODE = 20
N_TASKS_PER_NODE = 10 
N_RDM_PER_TASK = 2 # each task handles two RDMs
N_TASKS = 50 # rank ranges from 0 to 50

DATA_DIR = "/scratch/lbl59/blog/WaterPaths/"
SOLS_FILE_NAME = "NC_dvs_all_noheader.csv"
N_SOLS = 1

OMP_NUM_THREADS = 32

for i in range(N_RDM_PER_TASK):
    current_RDM = rank + (N_TASKS * i)

    command_gen_tables = "./waterpaths -T {} -t 2344 -r {} -d {} -C 1 -O rof_tables_reeval/rdm_{} -e 0 \
            -U TestFiles/rdm_utilities_test_problem_reeval.csv \
            -W TestFiles/rdm_water_sources_test_problem_reeval.csv \
            -P TestFiles/rdm_dmp_test_problem_reeval.csv \
            -s {} -f 0 -l {} -R {}\
            -p false".format(OMP_NUM_THREADS, N_REALIZATIONS, DATA_DIR, current_RDM, SOLS_FILE_NAME, N_SOLS, current_RDM)

    print(command_gen_tables)
    os.system(command_gen_tables)

comm.Barrier()

Before proceeding, a quick explanation on what all this means:

  • Line 12: We are parallelizing this job across 5 nodes on Cornell’s THECUBE (The Cube) computing cluster.
  • Lines 19-28: On each node, we are submitting 10 tasks to each of the 5 nodes requested. Each task, in turn, is handling 2 robust decision-making (RDM) multiplier files that scale up or scale down a hydroclimatic realization make a state of the world more (or less) challenging. In this submission, we are creating 400 different variations of each hydroclimatic scenario using the 400 RDM files, and running it across only one solution
  • Line 16 and 31: The ‘rank’ is the order of the tasks in which there are submitted. Since there are 10 tasks over 5 nodes, there will be a total of 50 tasks being submitted. Note and understand how the current_RDM is calculated.
  • Lines 32 to 37: This is the command that you are going to submit to The Cube. Note the -C and -O flags; a value of 1 for the -C flag tells WaterPaths to generate ROF tables, and the -O tells WaterPaths to output each ROF table file into a folder within rof_tables_reeval/rdm_{}for each RDM. Feel free to change the filenames as you see fit.

To accompany this script, first create the following folders: output, out_reeval, and rof_tables_reeval. The output folder will contain the simulation results from running the 1 solution across the 100 hydroclimatic realizations. The out_reeval folder will store any output or error messages such as script runtime.

Then, write the following bash submission script:

#!/bin/bash
#SBATCH -n 50 -N 5 -p normal
#SBATCH --job-name=rof_tables_reeval
#SBATCH --output=out_reeval/rof_tables_reeval.out
#SBATCH --error=out_reeval/rof_tables_reeval.err
#SBATCH --time=200:00:00
#SBATCH --mail-user=lbl59@cornell.edu
#SBATCH --mail-type=all

export OMP_NUM_THREADS=32

module load openmpi3/3.1.4
module spider py3-mpi4py
module spider py3-numpy/1.15.3

START="$(date +%s)"

mpirun python3 rof_tables_reeval.py

DURATION=$[ $(date +%s) - ${START} ]

echo ${DURATION}

You can find the bash script under the filename rof_table_gen_reeval.sh. Finally, submit the script using the following line:

sbatch ./rof_table_gen_reeval.sh

The run should take roughly 5 hours. We’re good for some time!

Re-evaluate your solutions (and possibly your life choices, while you’re at it)

Once the ROF tables are generated, it’s time to get a little hands-on with the underlying WaterPaths code. Navigate to the following PaperTestProblem.cpp file using:
cd /yourfilepath/WaterPaths/src/Problem/

Follow the next steps carefully.

  1. Delete PaperTestProblem.cpp and replace it with the file PaperTestProblem-reeval.cpp. It can be found in the the main Git Repository.
  2. Rename the latter file PaperTestProblem.cpp – it will be the new PaperTestProblem file that will be able to individually read each RDM scenario’s ROF tables.
  3. Re-make WaterPaths by calling make clean and then make gcc in the command line. This will ensure that WaterPaths has no problems running the new PaperTestProblem.cpp file.

Next, write the following Python script (called run_du_reeval.py in the Git repository):

# -*- coding: utf-8 -*-
"""
Created on Tues March 1 2022 16:16

@author: Lillian Bei Jia Lau
"""

from mpi4py import MPI
import numpy as np
import subprocess, sys, time
import os

N_RDMs_needed = 100  # N_TASKS_PER_NODE * N_RDM_PER_TASK * num nodes
N_REALIZATIONS = 100
N_RDM_PER_NODE = 20
N_TASKS_PER_NODE = 10 # rank ranges from 0 to 15
N_RDM_PER_TASK = 2 # each task handles five RDMs
N_TASKS = 50

comm = MPI.COMM_WORLD
rank = comm.Get_rank() # up to 20 processes
print('rank = ', rank)

DATA_DIR = "/scratch/lbl59/blog/WaterPaths/"
SOLS_FILE_NAME = "NC_dvs_all_noheader.csv"
N_SOLS = 69
OMP_NUM_THREADS = 32

for i in range(N_RDM_PER_TASK):
    current_RDM = rank + (N_TASKS * i)

    command_run_rdm = "./waterpaths -T {} -t 2344 -r {} -d {} -C -1 -O rof_tables_reeval/rdm_{} -e 0 \
            -U TestFiles/rdm_utilities_test_problem_reeval.csv \
            -W TestFiles/rdm_water_sources_test_problem_reeval.csv \
            -P TestFiles/rdm_dmp_test_problem_reeval.csv \
            -s {} -R {} -f 0 -l 69\
            -p false".format(OMP_NUM_THREADS, N_REALIZATIONS, DATA_DIR, \
                    current_RDM , SOLS_FILE_NAME, current_RDM)

    print(command_run_rdm)
    os.system(command_run_rdm)

comm.Barrier()

Note the change in the -C flag; its value is now -1, telling WaterPaths that it should import the ROF table values from the folder indicated by the -O flag. The resulting objective values for each RDM will be saved in the output folder we previously made.

The accompanying bash script, named du_reeval.sh is as follows:

#!/bin/bash
#SBATCH -n 50 -N 5 -p normal
#SBATCH --job-name=mordm_training_du_reeval
#SBATCH --output=out_reeval/mordm_training_du_reeval.out
#SBATCH --error=out_reeval/mordm_training_du_reeval.err
#SBATCH --time=200:00:00
#SBATCH --mail-user=lbl59@cornell.edu
#SBATCH --mail-type=all

export OMP_NUM_THREADS=32
module load openmpi3/3.1.4
module spider py3-numpy/1.15.3

START="$(date +%s)"

mpirun python3 run_du_reeval.py

DURATION=$[ $(date +%s) - ${START} ]

echo ${DURATION}

This run should take approximately three to four days. After that, you will have 1000 files containing 69 objective value sets resulting from running the 69 solutions across 1000 deeply-uncertain states of the world.

Summary

In this post, we defined optimality and robustness. We demonstrated how to run a DU re-evaluation across 100 challenging SOWs to observe how these ‘optimal’ solutions perform in more extreme scenarios. This is done to show that optimality is bound by current model states, and any deviation from the expected circumstances as defined by the model may lead to degradations in performance.

In the next blog post, we will be visualizing these changes in performance using a combination of sensitivity analysis, scenario discovery, and tradeoff analysis.

References

Beyer, H. and Sendhoff, B., 2007. Robust optimization – A comprehensive survey. Computer Methods in Applied Mechanics and Engineering, 196(33-34), pp.3190-3218.

Herman, J., Reed, P., Zeff, H. and Characklis, G., 2015. How Should Robustness Be Defined for Water Systems Planning under Change?. Journal of Water Resources Planning and Management, 141(10), p.04015012.

Herman, J., Zeff, H., Reed, P. and Characklis, G., 2014. Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty. Water Resources Research, 50(10), pp.7692-7713.

Matalas, N. C., and Fiering, M. B. (1977). “Water-resource systems planning.” Climate, climatic change, and water supply, studies in geophysics, National Academy of Sciences, Washington, DC, 99–110.

Read, L., Madani, K. and Inanloo, B., 2014. Optimality versus stability in water resource allocation. Journal of Environmental Management, 133, pp.343-354.

Zeff, H., Kasprzyk, J., Herman, J., Reed, P. and Characklis, G., 2014. Navigating financial and supply reliability tradeoffs in regional drought management portfolios. Water Resources Research, 50(6), pp.4906-4923.