Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Containerizing your code for HPC (Docker + Singularity)

If you’ve ever run into software dependency issues when trying to download and run a computer model on a new machine, you’re in good company. Dependency issues are a common headache for modelers, whether you are trying to run someone else’s code or share your own code with collaborators. Problems can arise due to differences in operating system (e.g., Windows vs OSX vs Ubuntu), compilers (e.g., Gnu vs Intel), software versions (e.g., Python 3.3 vs 3.10), or other factors. These challenges are exacerbated on high performance computing (HPC) clusters, which often have highly specialized architectures and software environments that are beyond our control. For example, I regularly develop code on my laptop, test parallel jobs on TheCube cluster at Cornell, and finally run much larger production jobs on the Stampede2 cluster at TACC. Ensuring compatibility across these three environments can be a challenge, especially for models involving complex software interdependencies and/or MPI parallel communication. Thankfully, containers offer an elegant solution to many of these challenges.

Containers such as Docker allow you to package an application in a self-contained, ready-to-run package. The container includes its own operating system and all libraries and dependencies needed to run your code. The application then runs inside this custom container where it is isolated from the individual machine that you are running it on, making it much simpler to build portable applications. This has several benefits for researchers by allowing you to spend more time on scientific research and model building and less time on installing finicky software dependencies. Additionally, it can encourage “open science” and collaboration by streamlining the process of sharing and borrowing code.

However, HPC clusters present some unique challenges to the “build once, run anywhere” mantra of containerization. First, Docker containers generally require root permissions that are disallowed on shared supercomputers. This challenge can be overcome by using a second type of container called Singularity. Second, when running parallel applications with MPI, or running applications on GPUs, the container must be built in a way that is consistent with the hardware and software setup of the cluster. This challenge is harder to address, and does reduce the portability of the container somewhat, but as we shall see it can still be overcome with careful container design.

This blogpost will demonstrate how to build containers to run parallelized code on HPC clusters. We will build two different containers: one which can be run on two different TACC clusters, Stampede2 and Frontera, and one which can be run on two different Cornell Center for Advanced Computing (CAC) clusters, TheCube and Hopper. Both of these can also be run on a personal laptop. If you want to follow along with this post and/or containerize your own code, the first thing you will need to do is to download and install Docker Desktop and set up an account on DockerHub. You can find instructions for doing that, as well as a lot more great information on containerization and HPC, in this documentation from the Texas Advanced Computing Center (TACC). You can also find more information on Docker in earlier Water Programming Blog posts by Lillian Lau, Rohini Gupta, and Charles Rouge.

Containerizing a Cython reservoir model with Docker

All code related to this blogpost can be found in the “WPB_containerization” directory of my “cython_tutorial” GitHub repository, here. If you want to follow along, you can clone this repository and navigate in your terminal to the WPB_containerization directory. This directory contains a simple reservoir simulation model which is built using Cython, a software used to convert Python-like code into typed, compiled C code, which can significantly improve execution time. Cython is beyond the scope of this blog post (though I may cover it in a future post, so stay tuned!), but the interested reader can find an overview in the base directory of this GitHub repository, and references therein. For this post, suffice it to say that Cython introduces non-trivial software dependencies and build complexities beyond a more standard Python application, which makes it a good candidate for containerization. Additionally, this directory contains scripts for running parallel simulations across multiple nodes on an HPC cluster using MPI, which introduces additional containerization challenges.

The first container we will build is for the TACC clusters Stampede2 and Frontera. The main step involved in building a Docker container is writing the Dockerfile (see the file “Dockerfile_tacc”), which outlines the steps needed to build the container:

FROM tacc/tacc-ubuntu18-impi19.0.7-common 

RUN apt-get update && apt-get upgrade -y && apt-get install -y libjpeg-dev python3-pip 

RUN pip3 install numpy pandas matplotlib cython mpi4py 

ADD . /code WORKDIR /code 

RUN python3 setup.py build_ext --inplace 

RUN chmod +rx /code/run_reservoir_sim.py 

ENV PATH "/code:$PATH" RUN useradd -u 8877 <user>

The first line tells Docker which “base image” to start from. There are a wide variety of base images to choose from, ranging from standard operating systems (e.g., Ubuntu 18.04) to smaller specialized images for software languages like Python. Due to the complexity of TACC’s hardware and software infrastructure, they provide special base images that are set up to efficiently interface with the clusters’ particular processors, compilers, communication networks, etc. The base image imported here is designed to operate on both Stampede2 and Frontera. In general, it should also work on your local computer as well, which helps streamline application development.

The second line updates the Ubuntu operating system and installs Python3 and another necessary library. Third, we install Python libraries needed for our application, including Cython and mpi4py. We then add the contents of the current working directory into the container, in a new directory called /code, and make this the working directory inside the container. Next, we run setup.py, which cythonizes and compiles our application. We then make the file “run_reservoir_sim.py” executable, so that we can run it directly from the container, and add the /code directory to the container’s path. Lastly, we change the user from root to (fill in your user profile name on your computer) to avoid permissions issues on some machines.

To build the container, we run the command:

docker build -f Dockerfile_tacc -t <username>/cython_reservoir:0.1 .

where should be replaced with your DockerHub username. This command will run each line in our Dockerfile to build the image of the container, then push the image to Dockerhub and create a new project called “cython_reservoir”, with the version tag 0.1.

To run the container locally, we run the command:

docker run --rm -v <working_directory>/results:/code/results <username>/cython_reservoir:0.1 mpirun -np 4 run_reservoir_sim.py

This command has several parts. First, “docker run” means to create and start a container from the image (this is like instantiating an object from a class). “—rm” means that the container can be removed after running (just the instantiated container, not the image itself). Next, the “-v” flag creates a directory mount from “code/results” inside the container to “/results” on your machine, where should be replaced with the full path of your working directory. This allows us to access the simulation results even after the container is removed. Lastly, “mpirun -np 4 run_reservoir_sim.py” tells the container to run the script across four processors – this will run four independent simulations and store each result as a csv file in the results directory.

Running on TACC clusters with Singularity

Now that we have verified that this container works on a local laptop, let’s move to Stampede2, a supercomputer owned by TACC (with allocations managed through NSF XSEDE). Unfortunately, Docker isn’t allowed on TACC’s clusters due to permission issues. However, another containerization software called Singularity is allowed, and has the ability to run containers created through Docker. After logging into the cluster through the terminal, we need to load the Singularity module:

module load tacc-singularity

We also need to create a directory called “results”. TACC doesn’t allow Singularity to be run directly from the login node, so we need to request an interactive session:

idev -m 30

When the interactive session on a compute node has been granted, we can “pull” our container from DockerHub:

singularity pull docker://ahamilton144/cython_reservoir:0.1

Singularity creates and downloads a single file, cython_reservoir_0.1.sif, as a Singularity representation of the Docker container. Once this has downloaded, you can exit your interactive session. We will run the job itself non-interactively by submitting it to the SLURM scheduler, just like you would with a non-containerized job:

sbatch submit_reservoir_stampede2.sh

This submission script, in addition to typical SLURM directives, contains the line

mpirun singularity run cython_reservoir_0.1.sif run_reservoir_sim.py

And that’s it! This command will run the simulation across all processors assigned to the job, and store the results in the results folder. These same commands and container can also be used on the Frontera cluster (also owned by TACC), using the “submit_reservoir_frontera.sh” script. As you can see, this makes it straightforward to develop and run code across your local laptop and two different TACC clusters, without having to worry about software dependency issues.

Customizing containers to run on non-TACC clusters

Unfortunately, it’s not quite so simple to port containers to other HPC systems if your application involves MPI parallel processing. This is because it is necessary for the MPI configuration inside the container to be interoperable with the MPI configuration in the cluster’s environment. TACC’s customized base images take care of this complexity for us, but may not port to alternative systems, such as CAC’s TheCube and Hopper. In this case, we need to provide explicit instructions in the Dockerfile for how to set up MPI:

### start with ubuntu base image 
FROM ubuntu:18.04 

### install basics, python3, and modules needed for application 
RUN apt-get update && apt-get upgrade -y && apt-get install -y build-essential zlib1g-dev libjpeg-dev python3-pip openssh-server 
RUN pip3 install Pillow numpy pandas matplotlib cython 

### install openMPI version 4.0.5, consistent with Hopper & TheCube 
RUN wget 'https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.5.tar.gz' -O openmpi-4.0.5.tar.gz 
RUN tar -xzf openmpi-4.0.5.tar.gz openmpi-4.0.5; cd openmpi-4.0.5; ./configure --prefix=/usr/local; make all install 
RUN ldconfig 

### install mpi4py now that openmpi is installed 
RUN pip3 install mpi4py 

### add all code from current directory into “code” directory within container, and set as working directory 
ADD . /code WORKDIR /code ENV PATH "/code:$PATH" 

### compile cython for this particular application 
RUN python3 setup.py build_ext --inplace 

### set python file as executable so it can be run by docker/singularity 
RUN chmod +rx /code/run_reservoir_sim.py 

### change username from root 
RUN useradd -u 8877 <username>

For this container, we start with a base Ubuntu image rather than TACC’s specialized image. We then install several important libraries and Python modules. The third block of commands downloads and installs OpenMPI version 4.0.5, which is the version installed on Hopper. TheCube has an older version, 3.1.4, but I found version 4.0.5 to be backwards compatible and to work on both CAC clusters. The rest of the commands are equivalent to the first Dockerfile. We can now build this second container and push it to Dockerhub as a separate version, 0.2.

docker build -f Dockerfile -t <username>/cython_reservoir:0.2 .

We can run this container on our local machine using the same command from above, but substituting 0.2 for 0.1. Similarly, we can log onto TheCube or Hopper and execute the same commands from above, again substituting 0.2 for 0.1. Note that the singularity module on CAC clusters is “singularity” rather than “tacc-singularity”. “submit_reservoir_cube.sh” and “submit_reservoir_hopper.sh” can be used to submit to the SLURM scheduler on these systems.

Thus, while MPI does compromise container portability somewhat, it is relatively straightforward to incorporate separate builds for different HPC systems using custom Dockerfiles. With the Dockerfile in place for each system, a streamlined workflow can be developed that allows for easy project development and deployment across multiple HPC systems.

Many thanks to TACC staff for teaching me all about containers at a recent virtual workshop, and to CAC staff who helped install and troubleshoot Singularity on their system.

Continuous Deployment with GitHub Actions (or, What Gives Life to a Living eBook?)

Continuous Deployment with GitHub Actions (or, What Gives Life to a Living eBook?)

Last month we introduced our free eBook Addressing Uncertainty in MultiSector Dynamics Research. Did you know that, to date, we have published 74 revisions? Without an automated release and deployment process, this would have been very tedious! But, with the help of GitHub Actions, every small edit, minor clarification, or major improvement can go live within moments of merging the code. Read on to learn how the eBook team leverages GitHub Actions for CI/CD (Continuous Integration / Continuous Delivery).

GitHub Workflow

A reliable CI/CD strategy depends on a robust code review process⁠—continuously delivering bugs and typos will not impress anyone! There are many acceptable workflows; the best one to use will depend on team composition and codebase. In our case, a feature branching workflow suffices:

Feature Branching Workflow

The feature branching workflow consists of a main code branch representing published assets. New features or bug fixes require authors to create a new branch from main, implement their changes, and then submit a pull request back into main. A pull request is a formal code review process that gives other authors a chance to provide feedback on the proposed changes. Once consensus has been reached, the feature branch is merged into the main branch, kicking off the CI/CD process.

Automation

While the code review process is designed to catch content and conceptual errors, subtle process and system based errors can often slip through the cracks. Thus the first step in the CI/CD process should be running a suite of automated tests that span a range of systems, behaviors, and any known pain points. The depth and breadth of these tests should be sufficient to ensure an adequate degree of publication readiness without being overly burdensome to maintain. The test suite can grow over time!

For the eBook, our tests simply ensure that the Python dependencies install correctly on Linux, Mac, and Windows containers, that the supporting code can be imported on these systems without error, and that the HTML and PDF versions of the publication generate successfully. If any tests fail, the publication process is cancelled and the authors are notified with details of the failure.

This test, release, and publication process is orchestrated by GitHub Actions. Read on to learn more!

GitHub Actions

GitHub Actions are available to any project hosted in GitHub—totally free for public repositories, and free to a limited extent for private repositories. An Action can be defined as a unit of work performed using a virtual machine in response to an event. In GitHub, Actions are defined using YAML files places into the .github/workflows directory of a repository. YAML (YAML Ain’t Markup Language) is a concise, human-readable syntax for conveying semi-structured data. The minimal content of a GitHub Action includes a name, one or more event triggers, and one or more jobs. A job consists of a name, one or more virtual machine targets, and one or more steps. For example, the eBook test Action looks like this:

.github/workflows/01_test.yml
name: Test
on:
  push:
    branches: [ main ]
jobs:
  test:
      runs-on: ${{ matrix.os }}
      strategy:
        matrix:
          os: [ubuntu-latest, macos-latest, windows-latest]
      steps:
        - uses: actions/checkout@v2
        - name: Set up Python
          uses: actions/setup-python@master
          with:
            python-version: 3.9
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Run tests
          run: |
            pip install pytest
            pytest

There is a lot going on here, so let’s take it step by step!

name: Test
on:
  push:
    branches: [ main ]

This snippet gives our Action a name, and specifies that it should trigger on updates to the main branch of the repository.

jobs:
  test:
      runs-on: ${{ matrix.os }}
      strategy:
        matrix:
          os: [ ubuntu-latest, macos-latest, windows-latest ]

This snippet defines a job within our “Test” Action named “test”, and then uses special syntax to declare that the job should be run on three different virtual machines: the latest Ubuntu Linux, macOS, and Windows containers. Running the tests on multiple operating systems helps catch bugs with system-specific dependencies.

      steps:
        - uses: actions/checkout@v2
        - name: Set up Python
          uses: actions/setup-python@master
          with:
            python-version: 3.9
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Run tests
          run: |
            pip install pytest
            pytest

This snippet outlines the actual units of work within the job; each “-” separates a unique task. The uses syntax is special in that it allows one to leverage tasks written by others hosted in the GitHub Actions Marketplace. The actions/checkout@v2 task clones the repository onto the virtual machine, and the actions/setup-python@master task installs and configures the specified Python version. The final two steps use the run directive to invoke custom code, in this case installing Python dependencies and running the Python test suites.

Deployment

Once the tests successfully pass, it’s time to publish! Since the eBook is essentially a web app, GitHub Pages is the perfect deployment platform. Pages hosts the content of a branch as a website, and is free for public repositories.

If you followed along with the previous eBook post, you learned about the Python, Sphinx, and Restructured Text workflow for compiling the eBook content into a polished product. Let’s create a GitHub Action to compile the eBook and deploy it to GitHub Pages! Here’s the full YAML file:

.github/workflows/02_deploy.yml
name: Deploy
on:
  push:
    branches: [ main ]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install latex dependencies
        run: sudo apt-get update -y && sudo apt-get install -y texlive latexmk texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended ghostscript
      - name: Update pip and install python dependencies
        working-directory: 'docs/'
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Build html and pdf ebook
        working-directory: 'docs/'
        run: |
          make html latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
        continue-on-error: true
      - name: Get current datetime in ISO format
        id: date
        run: echo "::set-output name=date::$(date -u +'%Y-%m-%d')"
      - name: Create GitHub release
        id: gh_release
        uses: softprops/action-gh-release@v1
        with:
          files: docs/build/latex/addressinguncertaintyinmultisectordynamicsresearch.pdf
          tag_name: ${{ steps.date.outputs.date }}v${{ github.run_number }}
      - name: Commit the compiled files
        run: |
          cd docs/build/html
          git init
          git add -A
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git commit -m 'deploy' -a || true
      - name: Push changes to gh-pages
        uses: ad-m/github-push-action@master
        with:
          branch: gh-pages
          directory: docs/build/html
          force: true
          github_token: ${{ secrets.GITHUB_TOKEN }}

A lot to unpack here! Let’s take it step by step. As before, we start by naming the Action, triggering it on updates to the main branch, declaring that it should run only on an ubuntu-latest virtual machine, checking out out the code, and setting up Python. Then we get into the new job steps:

      - name: Install latex dependencies
        run: sudo apt-get update -y && sudo apt-get install -y texlive latexmk texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended ghostscript

This step installs all the operating system dependencies needed to support the Latex syntax and compilation to PDF. There was some trial and error involved in getting this right, but once correct it should be pretty stable.

      - name: Build html and pdf ebook
        working-directory: 'docs/'
        run: |
          make html latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
        continue-on-error: true

This step runs the Sphinx makefile to compile the HTML and PDF versions of the eBook. The verbosity and repetitiveness of these commands works around some unusual oddities of the Latex and PDF compilation. --keep-going LATEXMKOPTS="-interaction=nonstopmode" prevents the command from waiting for user input. || true and the repeated make latexpdf lines allow the PDF engine to fully resolve all the references in the restructured text files; otherwise the PDF file would be incomplete and garbled (this one stumped us for awhile!).

      - name: Get current datetime in ISO format
        id: date
        run: echo "::set-output name=date::$(date -u +'%Y-%m-%d')"
      - name: Create GitHub release
        id: gh_release
        uses: softprops/action-gh-release@v1
        with:
          files: docs/build/latex/addressinguncertaintyinmultisectordynamicsresearch.pdf
          tag_name: ${{ steps.date.outputs.date }}v${{ github.run_number }}

To make it easier to chronologically place our eBook releases, we wanted to include a date stamp in our version tags. The first step above assigns the date to a variable. The second step above creates and tags an official GitHub release (using the date and an auto-incrementing run number), and includes the PDF version of the eBook as an asset attached to the release.

      - name: Commit the compiled files
        run: |
          cd docs/build/html
          git init
          git add -A
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git commit -m 'deploy' -a || true
      - name: Push changes to gh-pages
        uses: ad-m/github-push-action@master
        with:
          branch: gh-pages
          directory: docs/build/html
          force: true
          github_token: ${{ secrets.GITHUB_TOKEN }}

These two steps cause the GitHub Actions user to commit the compiled HTML files and force push them to the gh-pages branch of our repository, using a secret token. This is a common “hack” to enable publishing only the desired web assets and not the entire repository. Never force push to other shared code branches!

Action Status

Check the status of the CI/CD pipeline using the Actions tab of the GitHub repository. Successful Actions show a check mark, in progress Actions show a spinner, and failed Actions show an X. Clicking into a particular Action will show more details, log messages, and error traces if relevant.

The GitHub Actions Tab

Slack Integration

To take our workflow to the next level (and to avoid the need to read even more email 😅 ), we added the GitHub app to our eBook Slack channel. This adds a bot that can subscribe to GitHub repositories and report on activity; for instance: new issues, new pull requests, and new releases. We can then discuss and iterate inline in Slack, without having to jump to other apps or sites.

GitHub Slack Integration

To add the GitHub bot to a channel, right click on the channel name, select “Open channel details”, and navigate to the “Integrations” tab. From here, you can choose “Add apps” and search for GitHub. Once added, type a bot command such as /github subscribe [repository name] to start receiving notifications in the channel. The bot can also be used to open and close issues!

Conclusion

Using the GitHub Actions workflow to automate the testing and publication of our eBook enabled our team to focus more on the content and quality of release rather than stumbling over the intricacies of the publication process. CI/CD has been a powerful tool in software communities, but can greatly benefit researchers and academics as well. We hope our learnings presented above will speed you along in your own workflows!

MORDM VIII: Characterizing the effects of deep uncertainty

In the previous post, we defined robustness using the satisficing metric where (1) reliability should be at least 98%, (2) restriction frequency should be not more than 10% and (3) worst-case cost of drought mitigation action should not be more than 10% of annual net volumetric income. To calculate the robustness of these set of actions (portfolios) against the effects of challenging states of the world (SOWs) on the initial set of actions, we once again re-simulated them to discover how they fail.

In this penultimate post, we will perform simple sensitivity analysis across the average performance of all sixty-nine portfolios of actions to understand which uncertainties control the performance of each utility (Raleigh, Durham and Cary) and the regions across all uncertain SOWs.

Calculating average performance across 100 DU SOWs

First, we create a new folder to hold the output of the next few post-processing steps. Navigate to the WaterPaths/ folder and create a folder called post_processing. Now, let’s calculate the average performance of each of the sixty-nine initial portfolios across the 100 DU SOWs that we previously simulated them over. The code for this can be found in the post_processing_code folder under gather_objs.py file and should look like this:

# -*- coding: utf-8 -*-
"""
Created on Mon April 26 2022 11:12

@author: Lillian Bei Jia Lau
Organizes output objectives by mean across RDMs and across solutions

"""
import numpy as np

obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

'''
Performs regional minimax
'''
def minimax(N_SOLNS, objs):
    for i in range(N_SOLNS):
        for j in range(5):
            if j == 0:
                objs[i,15] = np.min([objs[i,0],objs[i,5], objs[i,10]])
            else:
                objs[i, (j+15)] = np.max([objs[i,j],objs[i,j+5], objs[i,j+10]])
    return objs

'''
Calculates the mean performance acorss all SOWs
'''
def mean_performance_across_rdms(objs_by_rdm_dir, N_RDMS, N_SOLNS):
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')
    objs_means = np.zeros((N_SOLNS,20), dtype='float')

    for i in range(N_RDMS):
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'
        objs_file = np.loadtxt(filepathname, delimiter=",")
        objs_matrix[:,:15,i] = objs_file

        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    for n in range(N_SOLNS):
        for n_objs in range(20):
            objs_means[n,n_objs] = np.mean(objs_matrix[n,n_objs,:])

    return objs_means

'''
Calculates the mean performance acorss all SOWs
'''
def mean_performance_across_solns(objs_by_rdm_dir, N_RDMS, N_SOLNS):
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')
    objs_means = np.zeros((N_RDMS,20), dtype='float')

    for i in range(N_RDMS):
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'
        objs_file = np.loadtxt(filepathname, delimiter=",")
        objs_matrix[:,:15,i] = objs_file
        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    for n in range(N_RDMS):
        for n_objs in range(20):
            objs_means[n,n_objs] = np.mean(objs_matrix[:,n_objs,n])

    return objs_means

# change number of solutions available depending on the number of solutions
# that you identified
N_SOLNS = 69
N_RDMS = 100

# change the filepaths
objs_by_rdm_dir = '/yourFilePath/WaterPaths/output/Objectives_RDM'
objs_og_dir = '/yourFilePath/WaterPaths/'

fileoutpath = '/yourFilePath/WaterPaths/post_processing/'

fileoutpath_byRDMs = fileoutpath + 'meanObjs_acrossRDM.csv'
fileoutpath_bySoln = fileoutpath + 'meanObjs_acrossSoln.csv'

# should have shape (N_SOLNS, 20)
objs_byRDM = mean_performance_across_rdms(objs_by_rdm_dir, N_RDMS, N_SOLNS)
# should have shape (N_RDMS, 20)
objs_bySoln = mean_performance_across_solns(objs_by_rdm_dir, N_RDMS, N_SOLNS)

np.savetxt(fileoutpath_byRDMs, objs_byRDM, delimiter=",")
np.savetxt(fileoutpath_bySoln, objs_bySoln, delimiter=",")

This will output two .csv files: meanObjs_acrossRDM.csv contains the average performance of each of the sixty-nine objectives evaluated over 100 DU SOWs, while meanObjs_acrossSoln.csv contains the average performance of all solutions within one SOW. Take some time to understand this difference, as this will be important when performing sensitivity analysis and scenario discovery.

Calculate the robustness of each portfolio to deep uncertainty

Now, let’s identify how each of these solutions perform under a set of more challenging SOWs. Within post_processing_code/, identify the file called calc_robustness_across_rdms.py. It should look like this:

# -*- coding: utf-8 -*-
"""
Created on Mon April 26 2022 11:12

@author: Lillian Bei Jia Lau

Calculates the fraction of RDMs over which each perturbed version of the solution meets all four satisficing criteria
"""
import numpy as np
import pandas as pd

obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

'''
Performs regional minimax
'''
def minimax(N_SOLNS, objs):
    for i in range(N_SOLNS):
        for j in range(5):
            if j == 0:
                objs[i,15] = np.min([objs[i,0],objs[i,5], objs[i,10]])
            else:
                objs[i, (j+15)] = np.max([objs[i,j],objs[i,j+5], objs[i,j+10]])
    return objs

'''
For each rdm, identify if the perturbed solution version x satisfies the satisficing criteria
'''
def satisficing(df_objs):
    for i in range(4):
        df_objs['satisficing_C'] = (df_objs['REL_C'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_C'] <= 0.10).astype(int) *\
                                    (df_objs['RF_C'] <= 0.10).astype(int)

        df_objs['satisficing_D'] = (df_objs['REL_D'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_D'] <= 0.10).astype(int) *\
                                    (df_objs['RF_D'] <= 0.10).astype(int)

        df_objs['satisficing_R'] = (df_objs['REL_R'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_R'] <= 0.10).astype(int) *\
                                    (df_objs['RF_R'] <= 0.10).astype(int)

    df_objs['satisficing_reg'] = np.max(df_objs.iloc[:, 20:23])
    return df_objs

def calc_robustness(objs_by_rdm_dir, N_RDMS, N_SOLNS):

    # matrix structure: (N_SOLNS, N_OBJS, N_RDMS)
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')

    satisficing_matrix = np.zeros((N_SOLNS,4,N_RDMS), dtype='float')
    solution_robustness = np.zeros((N_SOLNS,4), dtype='float')

    for i in range(N_RDMS):
        # get one perturbed instance's behavior over all RDMs
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'

        objs_file = np.loadtxt(filepathname, delimiter=",")

        objs_matrix[:,:15,i] = objs_file

        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        # NaN check
        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    # for the perturbed instances
    for r in range(N_RDMS):

        df_curr_rdm = pd.DataFrame(objs_matrix[:, :, r], columns = obj_names)

        df_curr_rdm = satisficing(df_curr_rdm)
        satisficing_matrix[:N_SOLNS,:,r] = df_curr_rdm.iloc[:,20:24]

    for n in range(N_SOLNS):
        solution_robustness[n,0] = np.sum(satisficing_matrix[n,0,:])/N_RDMS
        solution_robustness[n,1] = np.sum(satisficing_matrix[n,1,:])/N_RDMS
        solution_robustness[n,2] = np.sum(satisficing_matrix[n,2,:])/N_RDMS

    solution_robustness[:,3] = np.min(solution_robustness[:,:3], axis=1)

    return solution_robustness

'''
Change number of solutions available depending on the number of solutions
that you identified and the number of SOWs that you are evaluating them over.
'''
N_RDMS = 100
N_SOLNS = 69

objs_by_rdm_dir = '/scratch/lbl59/blog/WaterPaths/output/Objectives_RDM'

fileoutpath_robustness = '/scratch/lbl59/blog/WaterPaths/post_processing/' + \
    'robustness_' + str(N_RDMS) + '_SOWs.csv'

robustness = calc_robustness(objs_by_rdm_dir, N_RDMS, N_SOLNS)

np.savetxt(fileoutpath_robustness, robustness, delimiter=",")

When you run this script from the terminal, you should have a .csv file called ‘robustness_100_SOWs.csv‘ appear in your post_processing/ folder. Now, let’s get onto some sensitivity analysis!

Delta moment-independent sensitivity analysis

The Delta moment-independent (DMI) method (Borgonovo, 2007) is sensitivity analysis method that compares the entire probability distribution of both input and output parameters to estimate the sensitivity of the output to a specific input parameter. It is one of many global sensitivity analysis methods, which in itself is one of two main categories of sensitivity analysis that enables the assessment of the degree to which uncertainty in model inputs map to the degree of uncertainty in model output. For purposes of this test case, the DMI is preferable as it does not rely on any one statistical moment (variance, mean and skew) to describe the dependence of model output to its input parameters. It is also time-sensitive, reflecting the current state of knowledge within the system, which philosophically pairs well with our use of the ROF triggers. More information on alternative global sensitivity methods can be found here.

Within the context of our test case, we will be using the DMI method to identify uncertainties in our decision variables that most strongly influence our performance over the 100 DU SOWs. To perform DMI sensitivity analysis, first navigate to the post_processing/ folder. Within the folder, create two sub-folders called delta_output_DV/ and delta_output_DUF/. This is where all your DMI output will be stored. Next, locate the delta_sensitivity.py file within the post_processing_code/ folder. The code should look similar to the script provided below:

import sys
from SALib.analyze import delta
from SALib.util import read_param_file
import numpy as np
import pandas as pd

'''
Finds the upper and lower bounds of input parameters
'''
def find_bounds(input_file):
    bounds = np.zeros((input_file.shape[1],2), dtype=float)
    for i in range(input_file.shape[1]):
        bounds[i,0] = min(input_file[:,i])
        bounds[i,1] = max(input_file[:,i])

    return bounds
'''
Performs delta moment-independent sensitivity analysis
Source: https://github.com/SALib/SALib/tree/main/examples/delta
'''
def delta_sensitivity(dec_vars, measured_outcomes, names, mo_names, bounds, rdm, mode):
    X = dec_vars
    Y = measured_outcomes

    problem = {
        'num_vars': int(dec_vars.shape[1]),
        'names': names,
        'bounds': bounds
    }

    for i in range(measured_outcomes.shape[1]):
        mo_label = mo_names[i]
        if i == 2 and mode == 'objs':
          break
        else:
          filename = '../post_processing/delta_output_' + rdm + '/S1_' + mo_label + '.csv'
          S1 = delta.analyze(problem, X, Y[mo_label].values, num_resamples=10, conf_level=0.95, print_to_console=False)
        numpy_S1 = np.array(S1["S1"])
        fileout = pd.DataFrame([names, numpy_S1], index = None, columns = None)
        fileout.to_csv(filename, sep=",")

'''
0 - Name all file headers and compSol to be analyzed
'''
obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

dv_names = ['RT_C', 'RT_D', 'RT_R', 'TT_D', 'TT_R', 'LMA_C', 'LMA_D', 'LMA_R',\
            'RC_C', 'RC_D', 'RC_R', 'IT_C', 'IT_D', 'IT_R', 'IP_C', 'IP_D', \
            'IP_R', 'INF_C', 'INF_D', 'INF_R']

rdm_headers_dmp = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff']
rdm_headers_utilities = ['Demand growth\nmultiplier', 'Bond term\nmultiplier', \
                        'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier']
rdm_headers_ws = ['Streamflow amp', 'SCR PT', 'SCR CT', 'NRR PT', 'NRR CT', 'CR Low PT', 'CR Low CT',\
                  'CR High PT', 'CR High CT', 'WR1 PT', 'WR1 CT', 'WR2 PT', 'WR2 CT',\
                  'DR PT', 'DR CT', 'FR PT', 'FR CT']

duf_names = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff', 'Demand growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

N_RDMS = 100
N_SOLNS = 69

'''
1 - Load DU factor files and DV files
'''
# change to your own filepath
rdm_factors_directory = '/yourFilePath/WaterPaths/TestFiles/'
rdm_dmp_filename = rdm_factors_directory + 'rdm_dmp_test_problem_reeval.csv'
rdm_utilities_filename = rdm_factors_directory + 'rdm_utilities_test_problem_reeval.csv'
rdm_watersources_filename = rdm_factors_directory + 'rdm_water_sources_test_problem_reeval.csv'

rdm_dmp = pd.read_csv(rdm_dmp_filename, sep=",", names=rdm_headers_dmp)
rdm_utilities = pd.read_csv(rdm_utilities_filename, sep=",", names=rdm_headers_utilities)
rdm_ws_all = np.loadtxt(rdm_watersources_filename, delimiter=",")
rdm_ws = pd.DataFrame(rdm_ws_all[:,:17], columns=rdm_headers_ws)

dufs = pd.concat([rdm_dmp, rdm_utilities, rdm_ws], axis=1, ignore_index=True)
duf_numpy = dufs.to_numpy()

# change to your own filepath
dv_directory = '/yourFilePath/WaterPaths/'
dv_filename = dv_directory + 'NC_dvs_all_noheader.csv'
dvs = np.loadtxt(dv_filename, delimiter=",")

'''
2 - Get bounds for DU factors and DVs
'''
duf_bounds = find_bounds(duf_numpy)
dv_bounds = find_bounds(dvs)

'''
3 - Load robustness file and objectives file
'''
# change to your own filepath
main_dir = '/yourFilePath/WaterPaths/post_processing/'

robustness_filename = main_dir + 'robustness_100_SOWs.csv'
robustness_arr = np.loadtxt(robustness_filename, delimiter=",")
robustness_df = pd.DataFrame(robustness_arr, columns=utilities)

objs_mean_rdm_filename = main_dir + 'meanObjs_acrossRDM.csv'
objs_mean_rdm_arr = np.loadtxt(objs_mean_rdm_filename, delimiter=",")
objs_mean_rdm_df = pd.DataFrame(objs_mean_rdm_arr, columns=obj_names)

objs_mean_soln_filename = main_dir + 'meanObjs_acrossSoln.csv'
objs_mean_soln_arr = np.loadtxt(objs_mean_soln_filename, delimiter=",")
objs_mean_soln_df = pd.DataFrame(objs_mean_soln_arr, columns=obj_names)

# to change  depending on whether DV or DUF is being analyzed
dec_vars = dvs
measured_outcomes = objs_mean_rdm_df
names = dv_names
mo_names = obj_names
bounds = dv_bounds
rdm = 'DV'
mode = 'objs'
###

delta_sensitivity(dec_vars, measured_outcomes, names, mo_names, bounds, rdm, mode)

The code above identifies the sensitivity of the average values of all sixty-nine performance objective sets over all 100 deeply-uncertain SOWs to the decision variables. This is why we use the meanObjs_acrossRDM.csv file – this file contains sixty-nine mean values of the performance objectives, where each set of performance objectives inversely maps to their corresponding portfolio of actions.

To identify the sensitivity of performance objectives to the DU factors, change lines 121 to 127 to the following:

# to change  depending on whether DV or DUF is being analyzed
dec_vars = duf_numpy[:100,:]
measured_outcomes = objs_mean_soln_df
names = duf_names
mo_names = obj_names
bounds = duf_bounds[:100,:]
rdm = 'DUF'
mode = 'objs'
###

The code above identifies the sensitivity of the average values of all twenty performance objectives over each of the sixty-nine different portfolios to the set of deeply uncertain hydroclimatic and demand scenarios. This is why we use the meanObjs_acrossSoln.csv file – this file contains one hundred mean values of the performance objectives, where each set of performance objectives inversely maps to their corresponding DU SOW.

Great job so far! Now let’s visualize the sensitivity of our output to our input parameters using heatmaps. Before being able to visualize each utility’s performance sensitivity, we must first organize the sensitivity indices of the decision variables into a file containing the indices over all objectives for each utility. The gather_delta.py script does this. Simply change the value of mode on line 11 to ‘DUF‘ to gather the indices for the DU factors.

"""
Created on Tue April 26 2022 16:12

@author: Lillian Bei Jia Lau

Gathers the delta sensitivity indices into files per utility
"""
import numpy as np
import pandas as pd

mode = 'DV'
main_dir = '/yourFilePath/WaterPaths/post_processing/delta_output_' + mode + '/'
utilities = ['_C', '_D', '_R', '_reg']
objs = ['REL', 'RF', 'INF_NPC', 'PFC', 'WCC']
utilities_full = ['Cary', 'Durham', 'Raleigh', 'Regional']

dv_names = ['RT_C', 'RT_D', 'RT_R', 'TT_D', 'TT_R', 'LMA_C', 'LMA_D', 'LMA_R',\
            'RC_C', 'RC_D', 'RC_R', 'IT_C', 'IT_D', 'IT_R', 'IP_C', 'IP_D', \
            'IP_R', 'INF_C', 'INF_D', 'INF_R']

duf_names = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff', 'Demand growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

s1_dv_cary = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_durham = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_raleigh = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_regional = np.zeros((len(objs), len(dv_names)), dtype=float)

s1_dv_dict = {
    '_C': s1_dv_cary,
    '_D': s1_dv_durham,
    '_R': s1_dv_raleigh,
    '_reg': s1_dv_regional
}

s1_duf_cary = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_durham = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_raleigh = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_regional = np.zeros((len(objs), len(duf_names)), dtype=float)

s1_duf_dict = {
    '_C': s1_duf_cary,
    '_D': s1_duf_durham,
    '_R': s1_duf_raleigh,
    '_reg': s1_duf_regional
}

for i in range(len(utilities)):
    s1_util = []
    hdrs = []
    if mode == 'DV':
        s1_util = s1_dv_dict[utilities[i]]
        hdrs = dv_names
    elif mode == 'DUF':
        s1_util = s1_duf_dict[utilities[i]]
        hdrs = duf_names

    for j in range(len(objs)):
        curr_file = main_dir + 'S1_' + objs[j] + utilities[i] + '.csv'
        s1_util[j, :] = pd.read_csv(curr_file, sep=',', skiprows=2, header=None).iloc[0,1:]

    s1_util_df = pd.DataFrame(s1_util, columns=hdrs)
    out_filepath = main_dir + utilities_full[i] + '.csv'

    s1_util_df.to_csv(out_filepath, sep=',', index=False)

Now, let’s plot our heatmaps! The code to do so can be found in sensitivity_heatmap.py, and should look similar to the code provided below:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import AxesGrid

sns.set_theme()

# change depending on compromise solution and whether its sensitivity to DUF or DVs
mode = 'DUF'
rot = 90    # if DV use 0; if DUF use 45
main_dir = '/YourFilePath/WaterPaths/post_processing/delta_output_' + mode + '/'
c_filepath = main_dir + 'Cary.csv'
d_filepath = main_dir + 'Durham.csv'
r_filepath = main_dir + 'Raleigh.csv'
reg_filepath = main_dir + 'Regional.csv'

cary = pd.read_csv(c_filepath, index_col=False, header=0)
durham = pd.read_csv(d_filepath, index_col=False, header=0)
raleigh = pd.read_csv(r_filepath, index_col=False, header=0)
regional = pd.read_csv(reg_filepath, index_col=False, header=0)

savefig_dir = '/YourFilePath/WaterPaths/post_processing/'
savefig_name = savefig_dir + 'dmi_heatmap_' + mode + '.svg'

grid_kws = {"height_ratios": (0.20, 0.20, 0.20, 0.20, .02), "hspace": 0.5}
f, (ax1, ax2, ax3, ax4, cbar_ax) = plt.subplots(5, figsize=(15, 20), gridspec_kw=grid_kws)
plt.subplots_adjust(top = 0.95, bottom = 0.05,
            hspace = 0, wspace = 0.05)

y_objs=['REL', 'RF', 'INPC', 'PFC', 'WCC']

x_dvs=['$RT_{C}$', '$RT_{D}$', '$RT_{R}$', '$TT_{D}$', '$TT_{R}$', '$LM_{C}$', '$LM_{D}$', '$LM_{R}$',\
                '$RC_{C}$', '$RC_{D}$', '$RC_{R}$', '$IT_{C}$', '$IT_{D}$', '$IT_{R}$', '$IP_{C}$', \
                '$IP_{D}$', '$IP_{R}$','$INF_{C}$', '$INF_{D}$', '$INF_{R}$']
x_dufs = ['Cary\nrestr. eff', 'Durham\nrestr. eff', 'Raleigh\nrestr. eff', 'Dem. growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infra. interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

x_labs = []
if mode == 'DV':
    x_labs = x_dvs
elif mode == 'DUF':
    x_labs = x_dufs

plt.rc('xtick', labelsize=1)
plt.rc('ytick', labelsize=3)
plt.rc('axes', labelsize=5)
plt.rc('axes', titlesize=14)

#ax1 = fig.add_subplot(411)
ax1.set_title("Cary")
sns.heatmap(cary, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax1, cbar=False)
ax1.set_yticklabels(y_objs, rotation=0)

#ax2 = fig.add_subplot(412)
ax2.set_title("Durham")
sns.heatmap(durham, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax2, cbar=False)
ax2.set_yticklabels(y_objs, rotation=0)

#ax3 = fig.add_subplot(413)
ax3.set_title("Raleigh")
sns.heatmap(raleigh, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax3, cbar=False)
ax3.set_yticklabels(y_objs, rotation=0)

#ax4 = fig.add_subplot(414, fontsize=10)
ax4.set_title("Regional")
ax4 = sns.heatmap(regional, linewidths=.05, cmap="YlOrBr", xticklabels=x_labs,
    yticklabels=y_objs, ax=ax4, cbar=True, cbar_ax=cbar_ax,
    cbar_kws={'orientation': 'horizontal'})     # change depending on whether analyzing DUF or DV
ax4.set_xticklabels(x_labs, rotation=rot, fontsize=10)
ax4.set_yticklabels(y_objs, rotation=0)

plt.savefig(savefig_name)

Running this for the sensitivity to decision variables and DU factors will generate the following images:

Sensitivity of performance objectives to decision variables.

In the figure above, the color of each box represents the sensitivity of a performance objective (y-axis) to a specific decision variable (x-axis). It is interesting to note that the restriction triggers (RT) of all utilities strongly influence each of their individual and regional reliability and restriction frequency. This indicates the potential for regional conflict, as possible errors is operating one utility’s restriction trigger may adversely affect other utilities’ reliabilities and ability to maintain full control over their own use of water-use restrictions. Furthermore, Raleigh’s performance is sensitive to more decision variables than its remaining two counterparts, with it’s worst-case cost (WCC) being affected most by Cary’s infrastructure investments. This observation highlights the importance of careful cooperation between a region’s member utilities to ensure that all partners play their part in maintaining both their own and their counterparts’ performance.

Sensitivity of performance objectives to DU factors.

In this next figure, we observe that uncertainty in demand growth is the only DU factor that significantly drives changes in individual and regional performance. This finding can thus help utilities to focus on demand management programs, or formulate operations and management policies that enable them to more quickly adapt to changes in consumer and industrial demand growth.

Overall, in this post, we have performed a simple sensitivity analysis to identify uncertainties in the decision variables and DU factors that control regional and individual performance. All the code for processing the output data can be found in this GitHub repository here. In the next post, we will end the MORDM blogpost series by performing scenario discovery to map regions of success and failure as defined by our robustness metrics.

References

Borgonovo, E. (2007). A new uncertainty importance measure. Reliability Engineering &Amp; System Safety, 92(6), 771-784. doi: 10.1016/j.ress.2006.04.015

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

A step-by-step tutorial for scenario discovery with gradient boosted trees

Our recently published eBook, Addressing Uncertainty in Multisector Dynamics Research, provides several interactive tutorials for hands on training in model diagnostics and uncertainty characterization. The purpose of this post is to expand upon these trainings by providing a tutorial demonstrating gradient boosted trees for scenario discovery. I’ll first provide some brief background on scenario discovery and gradient boosted trees, then demonstrate a Python implementation on a water supply planning problem. All code here is written in Python, but the workflow is model agnostic, and can be paired with simulation models in any language. I’ve included my code within the text below, but all code and data for this post can also be found in this git repository.

Scenario discovery gradient boosted trees

In water resources planning and management, decision makers are often faced with uncertainty about how their system will change in the future. Traditionally, planners have confronted this uncertainty by developing prespecified narrative scenarios, which reduce the multitude of possible future conditions into a small subset of important future states of the world (a prominent example is the ‘scenario matrix framework’ used to evaluate climate change (O’Neill et al., 2014)). While this approach provides intuitive appeal, it may increase system vulnerability if future conditions do not evolve as decision makers expect (for a detailed critique of scenario based planning see Reed et al., 2022). This vulnerability is especially apparent for systems facing deep uncertainty, where decision makers do not know or cannot agree upon the probability density functions of key system inputs (Kwakkel et al., 2016).

Scenario discovery (Groves and Lempert, 2007) is an exploratory modeling centered approach that seeks to discover consequential future scenarios using computational experiments rather than relying on prespecified information. To perform scenario discovery, decision makers first identify a set of relevant uncertainties and their plausible ranges. Next, an ensemble of these uncertainties is developed by sampling across parameter ranges. Candidate policies are then evaluated across this ensemble and machine learning or data mining algorithms are used to examine which combinations of uncertainties generate vulnerability in the system. These combinations can then be used to develop narrative scenarios to inform implementation and monitoring efforts or new policy development.

A core element of the scenario discovery process is the algorithm used to classify future states of the world. Popular algorithms include the PRIM, CART and logistic regression. Recently, gradient boosted trees have been applied as an alternative classificiation algorithm. Gradient boosted trees have advantages over other scenario discovery algorithms because they can easily capture nonlinear and non-differentiable boundaries in the uncertainty space (which are particularly prevalent in water supply planning problems that have discrete capacity expansion options), are highly resistant to overfitting and provide a clear means of ranking the importance of uncertain factors (Trindade et al., 2020). For a comprehensive overview of gradient boosted trees, see Bernardo’s post here.

Test case: the Sedento Valley

To demonstrate gradient boosted trees for scenario discovery we’ll use the Sedento Valley water supply planning test case (Trindade et al., 2020). In the Sedento Valley, three water utilities seek to discover cooperative water supply managment and infrastructure investment portfolios to meet several conflicting objectives in a system facing deep uncertainty. In this post, we’ll investigate how these deep uncertainties (which include demand growth, the efficacy of water use restrictions, financial variables and parameters governing infrastructure permitting and construction time) impact a utility’s ability to maintain three performance criteria: keeping reliability > 98%, restriction frequency < 20% and worst case cost less than 10% of annual revenue. For simplicity, we’ll focus on one regional water utility named Watertown.

Step 1: create a sample of deeply uncertain states of the world

To start the scenario discovery process, we generate an ensemble of deep uncertainties that represent future states of the world (SOWs). Here, we’ll use Latin Hypercube Sampling with an implementation I found in the Surrogate Modeling Toolbox.

import numpy as np
from smt.sampling_methods import LHS

'''
This script will generate 1000 Latin Hypercube Samples (LHS)
of deeply uncertain system parameters for the Sedento Valley
'''


# create an array storing the ranges of deeply uncertain parameters
DU_factor_limits = np.array([
    [0.9, 1.1], # Watertown restriction efficacy 
    [0.9, 1.1], # Dryville restriction efficacy
    [0.9, 1.1], # Fallsland restriction efficacy
    [0.5, 2.0], # Demand growth rate multiplier
    [1.0, 1.2], # Bond term
    [0.6, 1.0], # Bond interest rate
    [0.6, 1.4], # Discount rate
    [0.75, 1.5], # New River Reservoir permitting time
    [1.0, 1.2], # New River Reservoir construction time
    [0.75, 1.5], # College Rock Reservoir (low) permitting time
    [1.0, 1.2], # College Rock Reservoir (low) construction time
    [0.75, 1.5], # College Rock Reservoir (high) permitting time
    [1.0, 1.2], # College Rock Reseroir (high) construction time
    [0.75, 1.5], # Water Reuse permitting time
    [1.0, 1.2], # Water Reuse construction time
    [0.8, 1.2], # Inflow amplitude
    [0.2, 0.5], # Inflow frequency
    [-1.57, 1.57]]) # Inflow phase

# Use the smt package to set up the LHS sampling
sampling = LHS(xlimits=DU_factor_limits)

# We will create 1000 samples
num = 1000

# Create the actual sample
x = sampling(num)

# save to a csv file
np.savetxt('DU_factors.csv', x, delimiter=',')

Step 2: Evaluate performance across SOWs

Next, we’ll evaluate the performance of our policy across the LHS sample of DU factors. For the Sedento Valley test case, we use WaterPaths, an open-source simulation system for integrated water supply portfolio management and infrastructure investment planning (for more see Trindade et al., 2020). This step is not included in the git repository as it requires high-performance computing for this system, but results can be found in the “Model_output.csv” file. For simulation details, see Gold et al., 2022.

Step 3: Convert model output into a boolean array for classification

To perform classification, we need to convert the results of our simulations to a binary array classifying each SOW as a “success” or “failure” based on whether the policy met the performance criteria under the SOW. First, we define a small function to determine if an SOW meets a set of criteria, then we apply this function to our results. We also load the DU factor LHS sample.

# First, define a function to check whether performance criteria are met
def check_criteria(objectives, crit_objs, crit_vals):
    """
    Determines if an objective meets a given set of criteria for a set of SOWs

    inputs:
        objectives: np array of all objectives across a set of SOWs
        crit_objs: the column index of the objective in question
        crit_vals: an array containing [min, max] of the values 
    
    returns:
        meets_criteria: an numpy array containing the SOWs that meet both min and max criteria

    """
    
    # check max and min criteria for each objective
    meet_low = objectives[:, crit_objs] >= crit_vals[0]
    meet_high = objectives[:, crit_objs] <= crit_vals[1]

    # check if max and min criteria are met at the same time
    meets_criteria = np.hstack((meet_low, meet_high)).all(axis=1)

    return meets_criteria


##### Load data and pre-process #####

# load objectives and create input array of boolean values for SD input
Reeval_objectives = np.loadtxt('Model_output.csv', skiprows=1, delimiter=',')
REL = check_criteria(Reeval_objectives, [0], [.979, 1])
RF = check_criteria(Reeval_objectives, [1], [0, 0.10])
WCC = check_criteria(Reeval_objectives, [2], [0, 0.10])
SD_input = np.vstack((REL, RF, WCC)).SD_input(axis=0)


# load DU factors
DU_factors = np.loadtxt('DU_factors.csv', skiprows=1, delimiter=',')
DU_names = ['Watertown Rest. Eff.', 'Dryville Rest. Eff.', 'FSD_inputsland Rest. Eff.',
            'Demand Growth Rate', 'Bond Term', 'Bond Interest',
            'Discount Rate', 'NRR Perm', 'NRR Const', 'CRR L Perm',
            'CRR L Const.',	'CRR H Perm.', 'CRR H Const.', 'WR1 Perm.',
             'WR1 Const.', 'Inflows A', 'Inflows m','Inflows p']

Step 4: Fit a boosted trees classifier

After we’ve formatted the data, we’re ready to perform boosted trees classification. There are several packages for boosted trees in Python, here we’ll use the implementation from scikit-learn. We’ll use an ensemble of 200 trees with depth 3 and a learning rate of 0.1. These parameters need to be tuned for the individual problem, I found this nice post that goes into detail on parameter tuning.

##### Boosted Tree Classification #####

from sklearn.ensemble import GradientBoostingClassifier

# create a gradient boosted classifier object
gbc = GradientBoostingClassifier(n_estimators=200,
                                 learning_rate=0.1,
                                 max_depth=3)

# fit the classifier
gbc.fit(DU_factors, SD_input)

Step 5: Examine which DU factors have the most impact on performance criteria

Now we’re ready to examine the results of our classification. First, we’ll examine how important each DU factor is to the classification results generated by boosted trees. To rank the imporance of each DU factor, we examine the percentage decrease in impurity of the ensemble of trees that is associated with each factor. In plain english, this is a measure of how helpful each DU factor is to correctly classifying SOWs. This infromation is generated during the fit of the classifier above and is easily accessible as an attribute of our scikit-learn classifier.

For our example, one deep uncertainty, demand growth rate, clearly stands out as the most influential, as shown in the figure below. A second, the restriction efficacy for Watertown (the utility we’re focusing on), also stands out as a higher level of importance. All other DU factors have little impact on the classification in this case.

##### Factor Ranking #####

# Extract the feature importances
feature_importances = deepcopy(gbc.feature_importances_)

# rank the feature importances and plot
importances_sorted_idx = np.argsort(feature_importances)
sorted_names = [DU_names[i] for i in importances_sorted_idx]

fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax.barh(np.arange(len(feature_importances)), feature_importances[importances_sorted_idx])
ax.set_yticks(np.arange(len(feature_importances)))
ax.set_yticklabels(sorted_names)
ax.set_xlim([0,1])
ax.set_xlabel('Feature Importance')
plt.tight_layout()

Step 6: Create factor maps

Finally, we visualize the results of our classification through factor mapping. In the plot below, we show the uncertainty space projected onto the two most influential factors, demand growth rate and restriciton efficacy. Each point represents a sampled SOW, red points represent SOWs that resulted in failure, while white points represent SOWs that resulted in success. The color in the background shows the predicted regions of success and failure from the boosted trees classification.

Here we observe that high levels of demand growth are the primary source of vulnerability for the water utility. When restriction efficacy is lower than our estimate (multiplier < 1), the utility faces failures at demand growth levels of about 1.7 times the estimated values. When restriction effectiveness is at or above estimates, the acceptable scaling of demand growth raises to about 1.8.

Taken as a whole, these results provide valueable insights for decision makers. From our original 18 deep uncertainties, we have discovered that two are critical for the success of our water supply management policy. Further, we have defined thresholds within the uncertainty space that define scenarios that lead to failure. We can use this information to inform monitoring efforts for the water supply policy, or to inform a new problem formulation that tailors actions to mitigate these vulnerabilities.

##### Factor Mapping #####

# Select the top two factors discovered above
selected_factors = DU_factors[:, [3,0]]

# Fit a classifier using only these two factors
gbc_2_factors = GradientBoostingClassifier(n_estimators=200,
                                 learning_rate=0.1,
                                 max_depth=3)
gbc_2_factors.fit(selected_factors, SD_input)

# plot prediction contours
x_data = selected_factors[:,0]
y_data = selected_factors[:,1]

x_min, x_max = (x_data.min(), x_data.max())
y_min, y_max = (y_data.min(), y_data.max())

# create a grid to makes predictions on
xx, yy = np.meshgrid(np.arange(x_min, x_max * 1.001, (x_max - x_min) / 100),
                        np.arange(y_min, y_max * 1.001, (y_max - y_min) / 100))
                        
dummy_points = list(zip(xx.ravel(), yy.ravel()))

z = gbc_2_factors.predict_proba(dummy_points)[:, 1]
z[z < 0] = 0.
z = z.reshape(xx.shape)

# plot the factor map        
fig = plt.figure(figsize=(10,8))
ax = fig.gca()
ax.contourf(xx, yy, z, [0, 0.5, 1.], cmap='RdBu',
                alpha=.6, vmin=0.0, vmax=1)
ax.scatter(selected_factors[:,0], selected_factors[:,1],\
            c=SD_input, cmap='Reds_r', edgecolor='grey', 
            alpha=.6, s= 100, linewidth=.5)
ax.set_xlim([.5, 2])
ax.set_ylim([.9,1.1])
ax.set_xlabel('Demand Growth Multiplier')
ax.set_ylabel('Restriction Eff. Multiplier')

References

Gold, D. F., Reed, P. M., Gorelick, D. E., & Characklis, G. W. (2022). Power and Pathways: Exploring Robustness, Cooperative Stability, and Power Relationships in Regional Infrastructure Investment and Water Supply Management Portfolio Pathways. Earth’s Future, 10(2), e2021EF002472.

Groves, D. G., & Lempert, R. J. (2007). A new analytic method for finding policy-relevant scenarios. Global Environmental Change, 17(1), 73-85.

Kwakkel, J. H., Walker, W. E., & Haasnoot, M. (2016). Coping with the wickedness of public policy problems: approaches for decision making under deep uncertainty. Journal of Water Resources Planning and Management, 142(3), 01816001.

O’Neill, B. C., Kriegler, E., Riahi, K., Ebi, K. L., Hallegatte, S., Carter, T. R., … & van Vuuren, D. P. (2014). A new scenario framework for climate change research: the concept of shared socioeconomic pathways. Climatic change, 122(3), 387-400.

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B., & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772.

Checkpointing and restoring runs with the Borg MOEA

Introduction

Simulation-optimization is an increasingly popular tool in many engineering fields. As the name implies, simulation-optimization requires two major components: a simulation model of the system we want to optimize (e.g., a water supply system) and an algorithm for iteratively tuning the decision levers within the system to optimize the simulated performance. For example, Multi-Objective Evolutionary Algorithms (MOEAs), such as the Borg MOEA, have been successfully employed across a variety of challenging stochastic multi-objective simulation-optimization problems in water resources and other fields. However, simulation-optimization can require millions of iterative simulation trials in order to adequately explore the decision and uncertainty spaces, especially under multi-objective framings. Given the ever-increasing size and complexity of simulation models, this can become a significant computational challenge.

Running the MOEA in parallel on high-performance computing clusters can greatly alleviate this challenge, but not eliminate it. For example, in my last blog post on hybrid parallelization of the Borg MOEA, I gave the example of optimizing infrastructure investments using a simulation model that takes 10 minutes to run a 30-year realization. Let’s assume we run 48 Monte Carlo realizations per function evaluation (FE) to account for natural hydrologic variability. Even using the maximum job size of 128 nodes on the Stampede2 supercomputer, with 48 CPU each, we will only be about to complete ~20,000 FEs per 24 hours of wall clock. To complete 100,000 FE, a relatively low target compared to other similarly-complex studies, the MOEA will need to run for ~5 days. This raises a major problem: many computing clusters have strict wall clock limits. On Stampede2, for example, each submission can run for a maximum of 48 hours. As a result, we would be limited to ~40,000 FE for a single Stampede2 run, which may be insufficient based on the difficulty of the problem.

Fortunately, the Borg MOEA’s “checkpointing” and “restoring” functionalities allow us to circumvent this issue. Checkpointing is the process of saving “snapshots” of the MOEA’s internal state at regular intervals, and restoring is the act of reading a snapshot into memory at the beginning of a run and continuing the optimization from that starting point. This allows us to effectively split a 5-day optimization into three sequential submissions, in order to comply with a 2-day wall clock limit. This functionality can also be helpful for iteratively assessing convergence, since we don’t necessarily know a priori how many FE will need to be run.

In this blog post, I will demonstrate the use of checkpointing and restoring with the Borg MOEA using a less computationally-demanding example: the Python-based DPS Lake Problem. However, the methods and results described here should be applicable to much larger experiments such as the one described above. See my last blog post and references therein for more details on the Python implementation of the DPS Lake Problem.

Setup

For this post, I ran the following short proof of concept: first I ran five random seeds of the Borg MOEA for 5 minutes each. Each seed was assigned 2 nodes (16 CPU each) on the Cube cluster at Cornell. Each seed completed approximately 6100 FE in this first round. Next, I ran a second round in which each seed began from the 6100-FE checkpoint and ran for another 5 minutes. All code for this post, along with instructions for downloading the correct branch of the Borg MOEA, can be found in this GitHub repository.

Setting up checkpointing only takes a few extra lines of code in the file “borg_lake_msmp.py”. First, we need to define the baseline file name used to store the checkpoint files:

newCheckpointFileBase = 'results/checkpoints/seed' + str(seed) 

Within the simulation, checkpoints will be made at regular intervals. The checkpoint file names will automatically have the FE appended; for example, the checkpoint for seed 5 after 6100 FE is written to “results/checkpoints/seed5_nfe6100.checkpoint”. The write frequency for both checkpoints and runtime metrics can be changed based on the “runtimeFrequency” parameter in “submit_msmp.sh”.

We can also provide a previous checkpoint to read in at the beginning of the run:

if maxEvalsPrevious > 0:
    oldCheckpointFile = 'results/checkpoints/seed' + str(seed) + '_nfe' + str(maxEvalsPrevious) + '.checkpoint'

where “maxEvalsPrevious” is the number of FE for the prior checkpoint that we want to read in. This parameter is input in “submit_msmp.sh”. The previous checkpoint file must be placed in the “checkpoints” folder prior to runtime.

We provide both checkpoint file names to the Borg MOEA, along with other important parameters such as the maximum number of evaluations.

result = borg.solveMPI(maxEvaluations=maxEvals, runtime=runtimeFile, frequency=runtimeFreq, newCheckpointFileBase=newCheckpointFileBase, oldCheckpointFile=oldCheckpointFile, evaluationFile=evaluationFile)

The Python wrapper for the Borg MOEA will use this command to initialize and run the MOEA across all available nodes. For more details on the implementation of the experiment, see the GitHub repository above.

The checkpoint file

So what information is saved in the checkpoint file? This file includes all internal state variables needed to initialize a new instance of the Borg MOEA that begins where the previous run left off. As an example, I will highlight select information from “seed5_nfe6100.checkpoint”, the last checkpoint file (at 6100 FE) for seed 5 after the first round.

First, the file lists basic problem formulation information such as the number of objectives, the decision variable bounds, and the epsilon dominance parameters:

Version: 108000
Number of Variables: 6
Number of Objectives: 4
Number of Constraints: 1
Lower Bounds: -2 0 0 -2 0 0
Upper Bounds: 2 2 1 2 2 1
Epsilons: 0.01 0.01 0.0001 0.0001
Number of Operators: 6

Next, it lists current values for several parameters which evolve over the course of the optimization, such as the selection probabilities for the six recombination operators, the number of times the algorithm has triggered a “restart” due to stalled progress, and the number of solutions in the population:

Operator Selection Probabilities: 0.93103448275862066 0.0086206896551724137 0.017241379310344827 0.025862068965517241 0.0086206896551724137 0.0086206896551724137
Number of Evaluations: 6100
Tournament Size: 3
Window Size: 200
...
Number of Restarts: 0
...
Population Size: 196
Population Capacity: 196

The file then lists each of the 196 solutions currently stored in the population:

Population:
 0.52276605276517329 1.1277550312248441 0.57174902357263913 -0.053873968914707165 1.5491763609435476 0.68045375364276195 -0.2810269062318374 0.18479149938317199 -1 -1 0 0
...
 0.27892075855585108 0.45130064043648516 0.77550656566943632 0.31799415686665755 1.0881848609123497 0.60273027430733062 -0.25193459430454329 0.16022123986158276 -0.98989898989898994 -1 0 0

Each line in this section represents a different solution; the numbers represent the 6 decision variables, 4 objectives, and 1 constraints for this problem, as well as the index of the operator used to create each solution.

Similarly, the file lists the current archive, which is the subset of 64 solutions from the larger population that are non-dominated:

Archive Size: 64
Archive:
 0.65953806401594717 1.3878689423124047 0.5978484500419472 0.15387554544946802 1.1931240566680981 0.62731909408989983 -0.32254774259834024 0.21682247784367689 -1 -1 0 0
...
 0.28214763078942073 0.44992540083980864 0.77495402511037126 0.1548097106329358 1.0881858398758053 0.6850068814635788 -0.2456560390454271 0.15760082516545762 -0.98989898989898994 -1 0 0

The checkpointing file then lists several parameters associated with the most recently used recombination operators:

Number of Improvements: 194
Recency List Size: 50
Recency List Position: 47
Recency List: 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

We see that after 6100 FE, the algorithm is almost exclusively using the operator with index 0; this is consistent with operator 0’s 93% selection probability listed earlier in the file.

Finally, the checkpoint file closes with the current state of the random number generator:

RNG State Length: 624
RNG State: 2103772999 2815404811 3893540187 ...
RNG Index: 200
Next Gaussian: -0.13978937466755278
Have Next Gaussian: 0

Results

So does it work? Figure 1 shows the convergence of the five seeds according to six different measures of quality for the evolving archive. Performance for all seeds is found to be continuous across the two rounds, suggesting that the checkpointing and restoring is working properly. Additionally, Figure 2 demonstrates the continuity of the auto-adaptive selection probabilities for the Borg MOEA’s different recombination operators.

Figure 1: Performance of 5 random seeds across 2 rounds, quantified by six different measures of approximate Pareto set quality.
Figure 2: Adaptive selection probabilities for the Borg MOEA’s six recombination operators.

The checkpointing process thus allows for continuity of learning across rounds: not only learning about which solutions are most successful (i.e., continuity of the population and archive), but also learning about which strategies for generating new candidate solutions are most successful (i.e., continuity of the recombination operator histories and probabilities). This capability allows for seamlessly sequencing separate “blocks” of function evaluations in situations where wall clock limits are limiting, or where the convergence behavior of the problem is not well understood a priori.

Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!)

We are pleased to announce that we have published our free eBook on Addressing Uncertainty in Multisector Dynamics Research, a primer on uncertainty characterization methods for research on sustainability, climate, and human-natural systems: https://lnkd.in/d6AMPnM3

This eBook is the product of years of collaboration across several teams that brings in perspectives from natural sciences, decision making under (deep) uncertainty, software engineering, statistics, and risk analysis. The breakdown of the book is as follows (summarized by Antonia Hadjimichael):

Chapter 1 uses the IM3 project as a living lab to encapsulate the challenges that emerge in bridging disciplines to make consequential model-based insights while acknowledging the tremendous array of uncertainties that shape them.

Chapter 2 helps the reader to better understand the importance of using diagnostic modeling and the diverse disciplinary perspectives that exist on how best to pursue consequential model-based insights.

Chapter 3 is a technical tools-focused primer for readers on the key elements of uncertainty characterization that includes ensemble-based design of experiments, quantitative methods for computing global sensitivities, and a summary of existing software packages.

Chapter 4 narrates for readers how and why the tools from the previous chapter can be applied in a range of tasks from diagnosing model performance to formal exploratory modeling methods for making consequential model-based discoveries.

The supplemental appendices provide a glossary, a brief summary of uncertainty quantification tools, and a suite of Jupyter notebook tutorials that provide hands-on training tied to the contents of Chapters 3 and 4.

A central theme of this eBook is that it is a living document that will be actively maintained through its GitHub repository. If you work on exploratory modeling, sustainability, management of human-natural systems, complex systems or adjacent areas we’d love your feedback on how the document could be improved! You can open an issue here.

Over the next few months, we will be posting more about the process of making a book and extending out some of the Jupyter Notebook coding exercises to get readers more comfortable with adjusting the source code for their own use. 

In this post, Chris Vernon and I will be discussing how you can create a similar style of eBook or documentation using Python and Sphinx. Subsequent posts will contain more information on how the book is hosted and managed in GitHub. Please note that this is a brief overview of some of the aspects of the eBook, but for full details, the source code is available here. You can find the sourcefiles for this tutorial in this repo and download the index.html file to click through the HTML page interactively.

Introduction to Sphinx

Sphinx is a tool written in Python that can facilitate the creation of a variety of documentation. Sphinx will translate a set of reStructuredText (reST) files that are linked in a hierarchy defined in an index file. First download and install Sphinx as follows (Note: the following pip commands assume that your version of pip is pointing to the instance of Python you want to use):

$pip install -U sphinx

To quickly set up a directory, execute the following from your terminal or command line while in your desired parent directory:

$sphinx-quickstart

This will create a build and source directory along with a makefile (a Windows Batch File: make.bat), and within the source directory, you will find an index.rst file and a file called conf.py. The index.rst file is termed the root documentation and houses the table of contents which organizes the structure of the document.

Changing the Theme

The default Sphinx theme is “alabaster” which has minimalist look. There are many other built-in themes and examples of documentation that use these themes can be found here. We are using an external theme for the book (found here) that has a sleek design and more of a “book feel”. To download this theme, simply execute the following in your terminal or command line:

$pip install sphinx-book-theme

Then change the theme which is located in conf.py.

html_theme = ‘sphinx_book_theme’

Adding Content

To add content to our book, we need to write our text files using reStructuredText, which is the default plaintext markup language used by Sphinx. The syntax is similar to Latex and easy to pick up as you get used to it. We will first add front matter so that when the HTML page is accessed, the front of the book is not empty. We place the front matter in index.rst as follows using a “directive”. Directives begin with an explicit markup start (two periods and a space), followed by the directive type and two colons (collectively, the “directive marker”).  Here we use an epigraph directive to render a classic block quote.

Creating front matter

Let’s build our documentation to see what our book looks like so far. I use a terminal called Cygwin since I am working on a Windows computer and run the following:

$ sphinx-build source/ build/

I can then open the resulting index.html located in the build directory by dragging and dropping it into my favorite browser or simply double clicking the file. 

eBook homepage

Now let’s add more sections to my book. I’m going to create a preface, a single section of material, and a bibliography. I first create a text file in my source directory that I call preface.rst. I add text to this file as shown below.

Preface Content

Then I add a section to the index.rst file that tells Sphinx where the preface should be placed in the book through the toctree directive (“table of contents tree”). We customize the table of contents to contain multiple sections and thus adjust the default toctree using the raw directive.

Updated Table of Contents

Next we create a section of content. In this section, I will demonstrate paragraph text, citations, images, notes, and equations, which cover the gamut of what you’ll likely want to include in various parts of your book. Let’s first install the necessary extensions for the bibliography as follows:

$pip install sphinxcontrib-bibtex

We then must add the extension into config.py. I also place the name of the file that contains my references.

Adding bibliography extensions and file names to config.py

Below is an example of what refs.bib looks like along with the Bibliography file.

Biblography .rst file and associated Bibtex

R.Bibliography.rst and refs.bib must written in the source directory and R.Bibliography.rst should be referenced in the index.rst file as follows.

Updated Table of Contents

Now let’s start writing our section and then place the reference to it in our index.rst file as shown above. In this section, note the first line that denotes the reference to the section that is used in the index.rst file. We show how to place a figure in lines 14-20 and the syntax for in-text citations is shown in the figure caption.   

If we render the webpage (e.g., sphinx-build source/ build/) it looks like this:

Updated webpage

The bibliography has the one reference so far:

Bibliography as seen on the webpage

The last two components that I will demonstrate are equations and notes, which may come in handy for scientific communications. In order to write equations, we use the MathJax Sphinx extension which allows for the best display of math in HTML.

Updated extensions

In our same section, we add additional text to show the equation and note functionality.

Adding equations and notes

We can then render the webpage one last time.

Updated webpage

Well there you go, after completing this tutorial, you will create the foundations of a book. You can feel free to use these files as a base for your own project! Enjoy!   

Running a Python script using Excel macros

Why run a Python script through Excel? Why bother with the middleman when environments such as Spyder and Jupyter Notebook exists?

Something that I have been learning of late is the importance of diversifying methods of presenting one’s work in the spirit of open science, communicability and inclusion. In that line of thinking, running a Python script via an Excel ‘user interface’ addresses two issues:

  1. Excel VBA’s slower reading and writing of data
  2. The steep learning curve associated with learning how to code in Python

In short, executing Python via Excel provides those with sufficient experience in the language with an avenue to efficiently communicate and visualize their data in a way that most people can see and understand. This blog post will demonstrate this point by extracting the mean and standard deviation of the sepal length and width, as well as the petal length and width, of three different iris subspecies available from the famed Iris dataset, that can be found here.

Download the dataset, called ‘iris.data’ into a folder or location of your choice. Next, change the extension of the file to ‘.csv’. Let’s start processing this in Python!

1. Writing the Python script

Here, we will be writing a Python script to generate two files: the means (iris_means.csv) and standard deviations (iris_std.csv) of each iris subspecies’ attributes. We will first write the Python script to do so:

import pandas as pd

# the types of data within the iris dataset
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "subspecies"]

iris_data = pd.read_csv('C:/your-preferred-location/iris.csv', 
                        sep=',', header=None, names=col_names, index_col=None)

#%%
iris_setosa = iris_data[iris_data['subspecies']=='Iris-setosa']
iris_setosa=iris_setosa.drop(['subspecies'], axis=1)

iris_versicolor = iris_data[iris_data['subspecies']=='Iris-versicolor']
iris_versicolor=iris_versicolor.drop(['subspecies'], axis=1)

iris_virginica = iris_data[iris_data['subspecies']=='Iris-virginica']
iris_virginica=iris_virginica.drop(['subspecies'], axis=1)

#%%
mean_setosa = iris_setosa.mean(axis=0)
std_setosa = iris_setosa.std(axis=0)

mean_versicolor = iris_versicolor.mean(axis=0)
std_versicolor = iris_versicolor.std(axis=0)

mean_virginica = iris_virginica.mean(axis=0)
std_virginica = iris_virginica.std(axis=0)

subspecies = ['Setosa', 'Versicolor', 'Virginica']

mean_vals = pd.concat([mean_setosa, mean_versicolor, mean_virginica], axis=1)
mean_vals.columns=subspecies

std_vals = pd.concat([std_setosa, std_versicolor, std_virginica], axis=1)
std_vals.columns=subspecies

mean_vals.to_csv('C:/Users/your-preferred-location/iris_means.csv', 
                 sep=',', header=True, index=True)
std_vals.to_csv('C:/Users/your-preferred-location/iris_std.csv', 
                sep=',', header=True, index=True)

2. Write the Excel VBA macro

We will first set up the Excel document that will execute the Excel macro. Create an Excel worksheet file. Mine is named ‘iris_GUI.xlsx’. Next, navigate to the ‘File’ tab and select ‘Options’. Go to ‘Customize Ribbon’ and make sure that ‘Developer’ is checked:

Figure 1: Enabling the Developer tab.

Click ‘OK’. The developer tab should now be visible in your Toolbar Ribbon:

Figure 2: Where the Developer tab is located.

Let’s get to the macro! Under the Developer tab, identify the ‘Macros’ tool on the far-left side of the toolbar. Select it and give your macro a suitable name. I called mine ‘link_python_excel’.

Figure 3: Name and create your macro.

Once this is done, click ‘Create’. Next, you should see a window like this pop up:

Figure 4: This is where you will write your macro.

Within the provided space, first initialize the macro using Sub link_python_excel(). This tells Excel VBA (Excel’s programming language) that you are about to write a macro called ‘link_python_excel’.

Next, declare your macro as an object, and your Python executable and Python script as strings. This is to enable VBA to locate the Python executable and use it to run the script as intended.

Dim objShell As Object
Dim PythonExe, PythonScript As String

You will then want to assign a macro object to its declaration:

Set objShell = VBA.CreateObject("Wscript.shell")

Please do not tamper with the “Wscript.shell” term. This assignment is the portion of the code that enables the macro to interact with Windows PowerShell, thus enabling VBA to execute the Python script. More information on this matter can be found at this website.

Following this, provide the filepath to the Python executable and the Python script:

PythonExe = """C:\Users\lbl59\AppData\Local\Programs\Python\Python39\python.exe"""
PythonScript = "C:\Users\lbl59\Desktop\run_python_in_excel\process_iris_data.py"

Note the use of the triple quotation marks. This method of assigning a string in VBA is used when the string potentially contains spaces. It is generally considered good practice to use “””…””” for file paths.

Finally, run your Python script and activate your workbook. The activation is necessary if you would like to run the script via a button in Excel, which we shall be going through in a bit.

objShell.Run PythonExe & PythonScript
Application.Goto Reference:="link_python_excel"

Finally, don’t forget to end the macro using End Sub.

Overall, your script should look as such:

Sub link_python_excel()

' link_python_excel Macro
' Declare all variables
Dim objShell As Object
Dim PythonExe, PythonScript As String
    
    'Create a new Shell Object
    Set objShell = VBA.CreateObject("Wscript.shell")
        
    'Provide the file path to the Python Exe
    PythonExe = """C:\Users\lbl59\AppData\Local\Programs\Python\Python39\python.exe"""
        
    'Provide the file path to the Python script
    PythonScript = "C:\Users\lbl59\Desktop\run_python_in_excel\process_iris_data.py"
        
    'Run the Python script
    objShell.Run PythonExe & PythonScript
    Application.Goto Reference:="link_python_excel"
    
End Sub

3. Run the Python script for Excel

Save the macro. Note that you will have to save the Excel workbook as a ‘.xlsm’ file to enable macro functionality. Once this is done, navigate to the ‘Developer’ tab and select ‘Insert’ and click on the button icon.

Figure 5: Inserting a button into the Excel workbook.

Draw the button the same way you would a rectangular shape. Rename the button, if you so prefer. In this exercise, the button is labeled ‘Run Code’. Next, right click on the button and select ‘Assign Macro’.

Figure 6: Assign the macro.

Once this is selected, you should be able to see the option to add the ‘link_python_excel’ macro to the button. Select the macro, and you are done! The two output files should have been output into the same location where you stored your iris.csv dataset.

Summary

In this post, we walked through the steps of writing a Python script to be run using an Excel macro. First, a Python script was written to process the iris dataset and output two files. Next, the Excel macro was written to execute this script. Finally, the macro was assigned to a button in the Excel sheet, where the Python script can be executed when the button is clicked.

Hope you found this useful!

To parallelize or not to parallelize? On performance of hybrid schemes in Python with the Borg MOEA

Hybrid parallelization is often used when performing noisy simulation-optimization on high-performance computing clusters. One common technique is to divide the available CPUs into a smaller number of “tasks”, each of which uses its subset of CPUs to assess the performance of candidate solutions through parallel evaluation of multiple scenarios. In this post, I will demonstrate how to use Python’s Multiprocessing module to set up a hybrid parallelization scheme with the Borg MOEA, and explore the resulting tradeoff between efficiency and learning rate in order to better understand when one should (and should not) implement such a scheme.

This post will bring together and build upon several previous posts on this blog, including (1) this demonstration by Andrew Dircks for how to use the Borg MOEA’s Python wrapper to optimize multi-objective simulations written in Python, using the Lake Problem as example; (2) this explanation of the Borg MOEA’s hybrid parallelization strategies by David Gold; and (3) this post by Bernardo Trindade on parallelizing Monte Carlo evaluations using Python and C/C++. The present post will give a brief overview of these topics in order to show how they can be brought together, but the reader is referred to the posts above for more detail on any of these topics.

Why hybrid parallelization?

As a motivating example, consider a complex water resources simulation model that takes 10 minutes to run a 30 year simulation. We want to explore the potential for new infrastructure investments to improve multiple objectives such as water supply reliability, cost, etc. We will try to discover high-performing solutions using an MOEA. To account for uncertainty, each candidate investment will be evaluated in 48 different stochastic scenarios.

Given the expense of our simulation model, this will be a major computational challenge. If we evaluate 20,000 candidate solutions (which is quite low compared to other studies), the total run time will be 20000 * 48 * 10 minutes = 6,667 days. Clearly, we cannot afford to run this serially on a laptop. If we have access to a computer cluster, we can parallelize the MOEA search and speed things up drastically. For example, with an XSEDE allocation on TACC Stampede2, I can run jobs with up to 128 Skylake nodes, each of which has 48 CPU, for a total of 6144 CPU per job. With this setup, we could finish the same experiment in ~1.1 days.

With the simplest possible parallelization strategy, we have a single CPU as the “Master” for the Borg MOEA, and 6143 CPU running individual function evaluations (FE). Each FE is allocated a single CPU, which runs 48 stochastic simulations in serial (i.e., there is no hybrid parallelization taking place). At the other extreme, I could set the MOEA to run with an entire node per “task”. This would allocate a full node to the Borg Master process, plus 127 nodes running individual FEs. Each FE would distribute its 48 stochastic simulations across its 48 CPUs in parallel. Thus, we are running fewer concurrent FEs (127 vs. 6143), but each finishes much faster (10 minutes vs. 480 minutes). In between these two extremes, there are intermediate parallelization schemes that allocate 2, 3, 4, 6, 8, 12, 16, or 24 CPU per task (Table 1).

So how to choose a parallelization strategy? The first thing to consider is efficiency. In theory, the scheme without hybrid parallelization (1 CPU per task) is the most efficient, since it allocates only one CPU to the Borg Master process, allowing for 6143 CPU to be simultaneously performing simulations. The 48 CPU/task scheme, on the other hand, allocates an entire node to the Borg Master (even though it only needs a single CPU to perform its duties), so that only 127*48=6096 CPU are running simulations at a given time. Thus, it should be ~1% slower in theory. However, in practice, a variety of factors such as communication or I/O bottlenecks can change this behavior (see below), so it is important to run tests on your particular machine.

If the hybrid parallelization scheme is slower, why should we even consider it? The key point is the increased potential for “learning” within the MOEA loop. Remember that the 1 CPU/task scheme has 6143 FE running concurrently, while the 48 CPU/task scheme has 127 FE. This means that the former will only have approximately 20000/6143=3.3 “generations,” while the latter will have 20000/127=157.5 generations*. Put another way, each FE will take 480 minutes (30% of total 1.1 days) to complete for the 1 CPU/task scheme, compared to 10 minutes (0.6% of total) for the 48 CPU/task scheme. The latter thus provides a more continuous stream of information to the MOEA Master, and therefore many more opportunities for the MOEA to learn what types of solutions are successful through iterative trial and error.

*Note: The Borg MOEA is a steady-state algorithm rather than a generational algorithm, so I use the term “generation” loosely to refer to the number of FE that each task will complete. See Dave Gold’s blog post on the Borg MOEA’s parallelization for more details on generational vs. steady-state algorithms. Note that for this post, I am using the master-worker variant of the Borg MOEA.

Function evaluations (FE)Total CPUTasks/nodeCPU/taskConcurrent FEGenerations
20000614448161433.3
20000614424230716.5
20000614416320479.8
200006144124153513.0
20000614486102319.6
2000061446876726.1
20000614441251139.1
20000614431638352.2
20000614422425578.4
200006144148127157.5
Table 1: Scaling of concurrent function evaluations and generations for different hybrid parallelization schemes, using 128 Skylake nodes (48 CPU each) on Stampede2.

A simple example: hybrid parallelization of the Lake Problem in Python

Above, I have outlined a purely theoretical tradeoff between efficiency and algorithmic learning. How does this tradeoff play out in practice? What type of hybrid scheme is “best”? I will now explore these questions in more detail using a smaller-scale example. The Lake Problem is a popular example for demonstrating concepts such as non-linear dynamics, deep uncertainty, and adaptive multi-objective control; a search for “Lake Problem” on the Water Programming Blog returns many results (I stopped counting at 30). In this problem, the decision-maker is tasked with designing a regulatory policy that balances the economic benefits of phosphorus with its environmental risks, given the large uncertainty in the complex dynamics that govern lake pollution concentrations over time. In a 2017 article in Environmental Modelling & Software, Julie Quinn demonstrated the merits of Evolutionary Multi-Objective Direct Policy Search (EMODPS) for developing control policies that adaptively update phosphorus releases based on evolving conditions (e.g., current lake concentration).

A Python version of the EMODPS Lake Problem was introduced along with a Python wrapper for the Borg MOEA in this blog post by Andrew Dircks. Additionally, this post by Bernardo Trindade demonstrates how to parallelize Monte Carlo ensembles in Python as well as C/C++. For the present example, I combined aspects of these two posts, in order to parallelize the Monte Carlo evaluations within each function evaluation, using the Multiprocessing module in Python. For this blog post, I will highlight a few of the important differences relative to the non-parallel version of the problem in Andrew Dircks’ blog post, but the interested reader is referred to these previous two post for more detail. All code used to produce these results can be found in this GitHub repository.

The lake_mp.py file specifies the analysis that the Borg MOEA should run for each function evaluation – we will evaluate each candidate operating policy using an ensemble of 48 simulations, each of which uses a different 100-year stochastic realization of hydrology. This ensemble will be divided among a smaller number of CPUs (say, 16) assigned to each function evaluation. Thus, we will have to set up our code a bit differently than in the serial version. First, consider the overall “problem” function which is called by the Borg Master. Within this function, we can set up shared arrays for each objective. The Array class from the multiprocessing module allows for thread-safe access to the same memory array from multiple CPUs. Note that this module requires shared memory (i.e., the CPUs must be on the same node). If you want to share memory across multiple nodes, you will need to use the mpi4py module, rather than multiprocessing.

from multiprocessing import Process, Array
# create shared memory arrays which each process can access/write.
discounted_benefit = Array('d', nSamples)
yrs_inertia_met = Array('i',  nSamples)
yrs_Pcrit_met = Array('i', nSamples)
average_annual_P = Array('d', nSamples*nYears)

Now we need to figure out which simulations should be performed by each CPU, and dispatch those simulations using the Process class from the multiprocessing module.

# assign MC runs to different processes
nbase = int(nSamples / nProcesses)
remainder = nSamples - nProcesses * nbase
start = 0
shared_processes = []
for proc in range(nProcesses):
    nTrials = nbase if proc >= remainder else nbase + 1
    stop = start + nTrials
    p = Process(target=dispatch_MC_to_procs, args=(proc, start, stop, discounted_benefit, yrs_inertia_met, yrs_Pcrit_met, average_annual_P, natFlow, b, q, critical_threshold, C, R, newW))

    shared_processes.append(p)
    start = stop

# start processes
for sp in shared_processes:
    sp.start()

# wait for all processes to finish
for sp in shared_processes:
    sp.join()

The “args” given to the Process function contains information needed by the individual CPUs, including the shared memory arrays. Once we have appended the processes for each CPU, we start them, and then wait for all processes to finish (“join”).

The “target” in the Process function tells each CPU which subset of Monte Carlo samples to evaluate, and runs the individual evaluations in a loop:

#Sub problem to dispatch MC samples from individual processes
def dispatch_MC_to_procs(proc, start, stop, discounted_benefit, yrs_inertia_met, yrs_Pcrit_met, average_annual_P, natFlow, b, q, critical_threshold, C, R, newW):
    lake_state = np.zeros([nYears + 1])
    Y = np.zeros([nYears])
    for s in range(start, stop):
        LakeProblem_singleMC(s, discounted_benefit, yrs_inertia_met, yrs_Pcrit_met, average_annual_P, lake_state, natFlow[s, :], Y, b, q, critical_threshold, C, R, newW)

Within the loop, we call LakeProblem_singleMC to run each Monte Carlo evaluation. After running the simulation, this function can store its sub-objectives in the shared arrays:

# fill in results for MC trial in shared memory arrays, based on index of this MC sample (s)
average_annual_P[(s*nYears):((s+1)*nYears)] = lake_state[1:]
discounted_benefit[s] = db
yrs_inertia_met[s] = yim
yrs_Pcrit_met[s] = yPm

Finally, back in the main “problem” function, we can evaluate the objectives by aggregating over the shared memory arrays:

# Calculate minimization objectives (defined in comments at beginning of file)
objs[0] = -1 * np.mean(discounted_benefit)  #average economic benefit
objs[1] = np.max(np.mean(np.reshape(average_annual_P, (nSamples, nYears)), axis=0))  #minimize the max average annual P concentration
objs[2] = -1 * np.sum(yrs_inertia_met) / ((nYears - 1) * nSamples)  #average percent of transitions meeting inertia thershold
objs[3] = -1 * np.sum(yrs_Pcrit_met) / (nYears * nSamples)  #average reliability
constrs[0] = max(0.0, reliability_threshold - (-1 * objs[3]))

time.sleep(slp)

In the last line, I add a half-second sleep after each function evaluation in order to help this example better mimic the behavior of the much slower motivating problem at the beginning of this post. Without the sleep, the Lake Problem is much faster than larger simulation codes, and therefore could exhibit additional I/O and/or master-worker interaction bottlenecks.

Again, the interested reader is referred to the GitHub repository, as well as Andrew Dircks’ original Lake Problem blog post, for more details on how these code snippets fit into the broader program.

This experiment is run on the Cube cluster at Cornell University, using 12 nodes of 16 CPU each. I tested 5 different hybrid parallelization schemes, from 1 to 16 CPU per task (Table 2). Each formulation is run for 5 seeds to account for the random dynamics of evolutionary algorithms. For this experiment, I will show results through 800 function evaluations. Although this is much smaller than the standard practice, this short experiment allows us to mimic the behavior of the much larger experiment introduced at the beginning of this post (compare the number of Generations in Tables 1 & 2).

Function evaluations (FE)Total CPUTasks/nodeCPU/taskConcurrent FEGenerations
8001921611914.2
80019282958.4
800192444717.0
800192282334.8
8001921161172.7
Table 2: Scaling of concurrent function evaluations and generations for different hybrid parallelization schemes, using 12 nodes (16 CPU each) on TheCube.

How do different parallelization schemes compare?

Figure 1 shows the performance of the 1 & 16 CPU/task schemes, in terms of hypervolume of the approximate Pareto set. The hybrid parallelization scheme (16 CPU/task) is found to outperform the serial scheme (1 CPU/task). In general across the five seeds, the parallelized scheme is found to improve more quickly and to reach a higher overall hypervolume after 800 FE. The intermediate parallelization schemes (2, 4, & 8 CPU/task) are found to perform similarly to the 16 CPU/task (see GitHub repository for figures)

Figure 1: Hypervolume vs. number of function evaluations for parallelization schemes with 1 and 16 CPU per function evaluation.

Figure 2 shows the number of completed function evaluations over time, which helps illuminate the source of this advantage. The serial formulation shows a staircase pattern: large chunks of time with no progress, followed by short bursts with many FE finishing all at once. Overall, the FE arrive in 5 major bursts (with the last smaller than the rest), at approximately 25, 50, 75, 100, and 125 seconds. As we move to 2, 4, 8, and 16 CPU/task, the number of “stairs” is seen to double each time, so that the 16 CPU/task scheme almost approximates a straight line. These patterns are consistent with the generational pattern shown in Table 2.

Figure 2: Runtime vs. number of function evaluations for parallelization schemes with 1, 2, 4, 8, and 16 CPU per function evaluation.

So why does this matter? Elitist evolutionary algorithms use the relative performance of past evaluations to influence the random generation of new candidate solutions. Because a hybrid parallelization strategy evaluates fewer solutions concurrently and has more “generations” of information, there will be more ability to learn over the course of the optimization. For example, with the serial scheme, the first generation of 191 candidate solutions is randomly generated with no prior information. The 16 CPU/task scheme, on the other hand, only produces 11 solutions in the first generation, and has had ~17 generations worth of learning by the time it reaches 191 FE.

Additionally, the Borg MOEA has a variety of other auto-adaptive features that can improve the performance of the search process over the course of a single run (see Hadka & Reed, 2013). For example, Borg has six different recombination operators that can be used to mate and mutate previously evaluated solutions in order to generate new candidates for evaluation. The MOEA adaptively updates these selection probabilities in favor of the operators which have been most successful at producing successful offspring. The hybrid parallelization allows these probabilities to be adapted earlier and more regularly due to the more continuous accumulation of information (Figure 3). This can improve the performance of the MOEA.

Figure 3: Operator probabilities vs number of function evaluations (NFE) for parallelization schemes with 1 and 16 CPU per function evaluation. The subplots represent the auto-adaptive selection probabilities for the six recombination operators used by the Borg MOEA: differential evolution (DE), parent-centric crossover (PCX), simulated binary crossover (SBX), simplex crossover (PCX), uniform mutation (UM), and unimodal normal distribution crossover (UNDX).

What happens in longer runs?

At this point, it may seem that hybrid parallelization is always beneficial, since it can increase the algorithm’s adaptability (Figure 3) and improve performance (Figure 1) for a fairly negligible computational cost (Figure 2). However, the experiments in Tables 1 & 2 are very constrained in terms of the number of generations that are run. In most contexts, we will have more function evaluations and/or fewer nodes, so that the ratio of nodes to FE is much smaller and thus the number of generations is much larger. In these situations, do the advantages of hybrid parallelization hold up?

Figure 4 shows the results of a longer experiment, which is identical to above except for its use of 10,000 FE rather than 800 FE. Over the longer experiment, the advantage of the parallelized scheme (16 CPU/FE) is found to diminish, with random seed variability becoming dominant. With more generations, the differences between the two schemes seems to fade.

Figure 4: Same as Figure 1, but for a longer experiment of 10,000 function evaluations.

Figure 5 shows the evolution of the recombination operator selection probabilities over the longer experiment. The serial scheme still lags in terms of adaptation speed, but by the end of the experiment, the two formulations have very similar behavior.

Figure 5: Same as Figure 3, but for a longer experiment of 10,000 function evaluations.

Lastly, Figure 6 shows the runtime as a function of FEs for the two formulations. The parallelized scheme takes approximately 15% longer than the serial scheme. We can compare this to the theoretical slowdown based on the Borg Master overhead: the 16 CPU/task scheme runs 11 concurrent FE, each using 16 CPU, for a total of 176 CPU, while the serial version runs 191 concurrent FE, each using 1 CPU. 191 is 9% larger than 176. The difference between 15% and 9% indicates additional bottlenecks, such as writing to the shared array. This overhead may vary based on the characteristics of the cluster and the computational experiment, so testing should be performed prior to running any large experiment.

Figure 6: Same as Figure 2, but for a longer experiment of 10,000 function evaluations.

Given all this information, what to do? Should you use hybrid parallelization for your function evaluations? The answer, unsatisfyingly, is “it depends.” The most important factor to consider is the number of generations for your optimization, calculated as the total number of FE divided by the number of concurrent FE. If this number is relatively high (say, >10-20), then you may be better off sticking with serial FEs in order to improve efficiency. If the number of generations is small, it may be worthwhile to introduce hybrid parallelization in order to stimulate adaptive learning within the MOEA. Where to draw the line between these two poles will likely to vary based on a number of factors, including the number of nodes you are using, the computational architecture of the cluster, the runtime of each function evaluation, and the difficulty of the optimization problem, so testing is recommended to decide what to do in any given scenario.

Clustering basics and a demonstration in clustering infrastructure pathways

Introduction

When approaching a new dataset, the complexity may be daunting. If patterns in the data are not readily apparent, one potential first-step would be to cluster the data and search for groupings. Clustering data, or cluster analysis, can reveal relationships between the observations and provide insight into the data.

Figure: Graphical representation of how clustering can be used to identify patterns in a data set.

Motivation

The outcome of a cluster analysis can be highly dependent upon the clustering algorithm being used, the clustering criteria, and the specified number of clusters to be found. Anticipating these influences and how to manipulate them will set you up for success in your data analysis.

In this post, I introduce some fundamental clustering algorithms, the hierarchical and K-Means clustering algorithms. Then I will share some important considerations, such as how to select a suitable number of clusters. Finally, I will demonstrate the power of clustering by applying the methods to an infrastructure pathways dataset, to cluster infrastructure pathways generated for 1,000 different states of the world into different sets of ‘light’, ‘moderate’, and ‘heavy’ infrastructure development.

Let’s begin!

Methods

Hierarchical Clustering

Hierarchical clustering can be performed “bottom-up”, through an agglomerative approach, which starts by placing each observation in its own cluster and then successively links the observations into increasingly larger clusters. Alternatively, the clustering can take a “top-down”, or divisive, approach which begins with all observations within a single cluster and partitions the set into smaller clusters until each observation exists within its own cluster.

Figure: A graphical representation of the hierarchical clustering process. Red circles denote clusters. If read from left to right, then the figure is representative of the agglomerative approach. If read from right to left, then the figure is representative of the divisive approach.

Specifying how clusters are preferentially assigned will influence the final shape and contents of the clusters. When performing hierarchical clustering, the user must specify the linkage criteria, which determines how the distance between clusters is measured, and consequently which clusters will be group during the next iteration. Three common linkage criteria are:

Linkage TypeDescriptionFormulation
Complete (max)the maximum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Single (min)the minimum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Averagethe average distance between a group and
observations in another group.
average{ d(a, b) : a in A, b in B}

Additionally, when calculating the linkage criteria, different methods of measuring the distance can be used. For example, when working with correlated data, it may be preferable to use the Mahalanobis distance over the regular Euclidean norm.

The results of the clustering analysis can be viewed through a dendrogram, which displays a tree-like representation of the cluster structure. For more detail on dendrograms, and a step-by-step tutorial for creating color-coded dendrograms in R, see Rohini’s (2018) post here.

K-Means Clustering

The K-means algorithm follows three simple steps.

  1. Selection of k centroids within the sample space.

In the traditional naïve K-Means algorithms, centroids do not correspond to specific observations within the space. They can be randomly selected, or chosen through more informed methods (see, for example, K-Means++)

2. Assignment of each observation to the cluster with the nearest centroid.

3. Re-define each cluster centroid as the mean of the data points assigned to that centroid.

Steps 2 and 3 of the algorithm are repeated until the centroids begin to stabilize, such that subsequent iterations produce centroids located within some small, tolerable distance from one another.

Figure: A graphical representation of the k-means clustering process, with 3 clusters. X’s denote the centroids for that respective cluster. The dashed arrows denote the shift in the location of the centroid during the next iteration. Points are colored according to their assigned cluster during that iteration.

The centroids toward which the algorithm converges may be a locally optimal clustering solution, and can be highly dependent upon the initial centroid selection. Often, this sensitivity is handled by performing the clustering processes several times, and choosing the set of clusters which yield the smallest within-cluster sum of squares.

Selecting the number of clusters

In some contexts, it is possible to know how many clusters are required prior to the analysis. For example, when working with Fisher’s famous iris flower data set, three clusters are needed because there are three types of flowers observed within the data.

In an exploratory context, however, the appropriate number of clusters may not be readily apparent. One method of determining the preferred number of clusters is by testing multiple different values and evaluating the separation distances between the clusters. If the number of clusters is well-chosen, then the clusters will be both well-spaced and the intra-cluster distance will be small.

The Silhouette Coefficient, or silhouette score, is one method of measuring the goodness of fit for a set of clusters, and takes into consideration both the spacing between clusters (inter-cluster distance) and the distance between points within a cluster (intra-cluster spacing). The Silhouette scores is defined as:

Where:

a = average intra-cluster distance,

b = average inter-cluster distance.

The silhouette score ranges from [-1, 1]. A value near +1 suggests that the clusters are far from one another, with a small distances within the cluster, and thus the chosen number of clusters is well-suited for the data. A value near -1, however, suggests that the clusters are poorly describing the patterns in the data.

Example: Clustering Infrastructure Pathways

In water supply planning contexts, infrastructure pathways describe the sequencing of infrastructure investment decisions throughout time, in response to changes in system (construction of new reservoirs, expansions to water treatment capacity etc.). Adaptive infrastructure pathways use a risk-of-failure (ROF) based set of policy rules to trigger new infrastructure development. Adaptive infrastructure pathways have been shown to effectively coordinate water infrastructure decisions in complex multi-actor systems (Zeff et al., 2016). While the benefits of this adaptive strategy are clear, ROF based pathways also bring new challenges for decision makers – rather than specifying an single sequence of infrastructure investments, an adaptive rule system will generate a unique sequence of investments for every future state of the world. To understand how a rule system behaves, decision makers must have tools to visualize an ensemble of infrastructure pathways. This is where cluster analysis comes in. Using clustering, we can group similar infrastructure sequences together, allowing decision makers to summarize how a candidate infrastructure investment policy will behave across many future states of the world.

Here, I analyze a set of infrastructure pathways generated for a hypothetical water utility with four infrastructure options:

  • small water treatment plant expansion
  • large water treatment plant expansion
  • construction of a small new run of river intake
  • construction of a large new run of river intake

The utility is examining a candidate policy and has generated pathways for 1,000 different states of the world, or realizations, which are each characterized by unique hydrologic and population dynamics. Given the 1,000 different realizations, each with a unique infrastructure pathway, I am interested in clustering the pathways according the the timing of the infrastructure construction decisions to understand how the policy behaves.

The dataset used for this demonstration is courtesy of David Gold, who is also responsible for providing significant portions of the following code and assistance along the way. Thanks David!

Follow along with this demonstration by downloading .txt data file HERE!

Let’s begin by loading in the data.

import pandas as pd

# Load in data
pathways_df = pd.read_csv('./ClusterData.txt', sep = '\t', header = None)

The data contains information describing the timing of infrastructure construction for 1,000 different simulated realizations. Each column represents a different infrastructure option, and each row is 1 of the 1,000 different simulated realizations. The contents of each cell denote a standardized value corresponding to the timing of the infrastructure construction within each realization, ranging from [0, 1]. Infrastructure options containing the value 1 were not built during the 45-year simulation period, in that specific realization.

RealizationInfra. Opt. #1Infra. Opt. #2Infra. Opt. #M
1W1,1W1,2W1,M
2W2,1W2,2W2,M
NWN,2WN,2WN,M
Table: Example output data from the infrastructure pathways optimization process; within each realization (state of the world (SOW)) infrastructure construction decisions are made during different weeks within the simulation period (45-years), considering a subset of infrastructure options for each utility. Wi,j denotes a standardized value corresponding to the timing within realization i that infrastructure option j is built.

With the data available, we can begin clustering!

Here, I take advantage of the scikit-learn.cluster module, which has both hierarchical and K-Means clustering capabilities.

For the purposes of this demonstration, I am going to only show the process of producing the hierarchical clusters… the curious reader may choose to explore alternative clustering methods available in the scikit-learn toolbox.

### Perform hierarchical clustering with 3 clusters ###

# Import the necessary tools
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import pairwise_distances_argmin

# Begin by assuming 3 clusters
num_clusters = 3

# Initialize the hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters = 3)

# Cluster the data
hierarchical.fit(cluster_input)  

# Produce a array which specifies the cluster of each pathway
hierarchical_labels = hierarchical.fit_predict(cluster_input)

In order to interpret the significance of these clusters, we want find a way to visualize differences in cluster behavior through the infrastructure pathways.

To do this, I am going to consider the median timing of each infrastructure option within each cluster, across all 1,000 realizations. Additionally, I want to specify sort the clusters according to some qualitative characteristic. In this case, I define pathways with early infrastructure investments as “heavy”, and pathways within infrastructure investment later in the period as “light”. The “moderate” infrastructure pathways are contained within the cluster between these these.

The following script calculates the median of the clustered pathways, and sorts them according to their infrastructure timing.

### Interpreting clusters via pathways ###

# Group realizations by pathway
# Calculate the median week each infra. opt. is constructed in each cluster
import numpy as np
cluster_pathways = []
cluster_medians = []

# Loop through clusters 
for i in range(0,num_clusters):
    
    # Select realizations contained within i-th cluster
    current_cluster = cluster_input[hierarchical_labels ==i, :]
    
    # Multiply by 2344 weeks to convert from [0,1] to [0,2344] weeks
    current_cluster = current_cluster*2344
    
    cluster_pathways.append(current_cluster)
    
    # Find the median infrastructure build week in each realization
    current_medians = np.zeros(len(current_cluster[0,:]))
    
    # Loop through infrastructure options and calculate medians
    for j in range(0, len(current_cluster[0,:])):
        current_medians[j] = np.median(current_cluster[:,j])
        
    cluster_medians.append(current_medians)

# Sort clusters by average of median build week 
# to get light, moderate, and heavy infrastructure clusters
cluster_means = np.mean(cluster_medians, axis = 1)

sorted_indices = np.argsort(cluster_means)

# Re-order cluster medians according to sorted mean build week
cluster_medians = np.vstack((cluster_medians[sorted_indices[2]], 
                            cluster_medians[sorted_indices[1]],
                            cluster_medians[sorted_indices[0]]))

With the median timing of each infrastructure option specified for each cluster, I can begin the process of visualizing the behavior. The function below plots a single infrastructure pathway:

import numpy as np
import matplotlib.pyplot as plt

def plot_single_pathway(cluster_medians, cluster_pathways, inf_options_idx,
                        c, inf_names, y_offset, ylabeling, xlabeling, 
                        plot_legend, ax):
    
    """
    Makes a plot of an infrastructure pathway.
    
    PARAMETERS:
        cluster_medians: an array with median weeks each option is built
        cluster_pathways: an array with every pathway in the cluster
        inf_options: an array with numbers representing each option (y-axis vals)
        should start at zero to represent the "baseline"
        c: color to plot pathway
        y_offset: distance to offset each pathway from 0 (easier to visualize stacked paths)
        ylabeling: labeling for infrastructure options
        xlabeling: labeling x axis
        plot_legend: (bool) to turn on or off
        ax: axes object to plot pathway
    """
    
    # Extract all non-baseline infrastructure options
    inf_options_idx_no_baseline=inf_options_idx[1:]
    sorted_inf = np.argsort(cluster_medians)
    
    # Plot Pathways
    # create arrays to plot the pathway lines. To ensure pathways have corners 
    # we need an array to have length 2*num_inf_options
    pathway_x = np.zeros(len(cluster_medians)*2+2)
    pathway_y = np.zeros(len(cluster_medians)*2+2)
    
    # To have corners, each inf. opt. must be located
    # in both the cell it is triggered, and adjacent cell
    
    cluster_medians = np.rint(cluster_medians/45)
    for i in range(0,len(cluster_medians)):
        for j in [1,2]:
            pathway_x[i*2+j] = cluster_medians[sorted_inf[i]]
            pathway_y[i*2+j+1] = inf_options_idx_no_baseline[sorted_inf[i]]

    # Set the end of the pathway at year 45
    pathway_x[-1] = 45

    # plot the pathway line    
    ax.plot(pathway_x, pathway_y + y_offset, color=c, linewidth=5, 
            alpha = .9, zorder=1)

    # Formatting plot elements 
    ax.set_xlim([0,44])
    inf_name_lables = inf_names
    inf_options_idx = np.arange(len(inf_name_lables))
    ax.set_yticks(inf_options_idx)
    ax.set_xlabel('Week')
    
    if ylabeling:
        ax.set_yticklabels(inf_name_lables)
    else:
        ax.set_yticklabels([])
    
    ax.set_ylim([-0.5,len(cluster_medians)+.5])

    plt.show()

The above script can be used to plot the median pathway for a single cluster, but it would be much more worthwhile to compare the median pathways of the different clusters against one another. To do this, we can define another function which will use the plot_single_pathway() iteratively to plot each of the median cluster pathways on the same figure.

import numpy as np
import matplotlib.pyplot as plt
from plot_single_pathway import plot_single_pathway

def plot_many_paths(clust_meds, clust_pathways, n_clusters, cluster_colors, fig, 
                    gspec, fig_col, ylabeling, plot_legend):
    
    
    y_offsets = [-.15, 0, .15]
    
    # Specific information for Utility
    inf_options_idx = np.array([0, 1,2, 3, 4])
    utility_inf = ['Baseline', 'Small intake expans.', 'Large intake expans.',
                    'Small WTP expans.', 
                    'Large WTP expans.']
    
    ax = fig.add_subplot(gspec[0,0])
    
    for i in np.arange(n_clusters):
        plot_single_pathway(clust_meds[i], clust_pathways[i], inf_options_idx, cluster_colors[i],
                            utility_inf, y_offsets[i], ylabeling, False, True, ax)
        plt.show()
    
    if plot_legend and (n_clusters == 3):
        ax.legend(['Light inf.', 'Moderat inf.', 'Heavy inf.'], 
                  loc='upper left')
    elif plot_legend and (n_clusters == 2):
        ax.legend(['Light inf.', 'Heavy inf.'], loc='upper left')
        
    ax.tick_params(axis = "y", which = "both", left = False, right = False)
    
    plt.show()
    return fig

Time for the reveal! Execution of this final portion of the script will plot each of the three clusters, and color-code them according to classifications as “heavy”, “moderate”, or “light” infrastructure.

### Plotting ###
from plot_many_paths import plot_many_paths

fig = plt.figure(figsize = (5,5))
gspec = fig.add_gridspec(ncols=1, nrows=1)

# Colorscale
heavy_inf_color = '#132a13'
mid_inf_color = '#4f772d'
light_inf_color = '#90a955'
cluster_colors = [light_inf_color, mid_inf_color, heavy_inf_color]

plot_many_paths(cluster_medians, cluster_pathways, 3, cluster_colors, 
                fig, gspec, 0, True, True)
Figure: Median infrastructure pathways for clustered pathways.

Voilà!

The figure above shows a clear contrast in the behavior of the light and heavy infrastructure pathways; the heavy infrastructure pathway is not only characterized by early infrastructure construction, but also by more frequent construction throughout the simulation period, in comparison to the median of the light infrastructure cluster which only requires a single infrastructure expansion within the same simulation period.

From the plot above, it is not apparent that three clusters are necessarily appropriate for this data set. Although the heavy and light infrastructure clusters are clearly demarcated from one another, the moderate infrastructure cluster has characteristics of both: a late first-expansion similar to the light infrastructure cluster, but a high frequency of construction throughout the period similar to the heavy infrastructure. Should the moderate infrastructure really be a separate cluster?

Let’s take advantage of the knowledge we developed earlier, and consider the silhouette scores corresponding to different numbers of clusters! Fortunately, scikit-learn has a suite of tools that made the silhouette analysis very easy.

In the following script, I evaluate the silhouette score associated with clustering for 2, 3, and 4 different clusters.

### Performing silhouette analysis ###
from sklearn.metrics import silhouette_score

num_clust = [2, 3, 4]

# Set up hierarchical clustering with different numbers of clusters
ac2 = AgglomerativeClustering(n_clusters = num_clust[0])
ac3 = AgglomerativeClustering(n_clusters = num_clust[1])
ac4 = AgglomerativeClustering(n_clusters = num_clust[2])

# Extract the silhouette score for each hierarchical grouping
silhouette_scores = []
silhouette_scores.append(silhouette_score(cluster_input, ac2.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac3.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac4.fit_predict(cluster_input)))
                                                  
plt.bar(num_clust, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 14)
plt.ylabel('Silhouette Score', fontsize = 14)
plt.xticks(num_clust)
plt.title("Silhouette analysis")
plt.show()
Figure: Silhouette analysis for the hierarchical clustering of infrastructure pathways.

Interestingly, the silhouette score was highest when the clustering was performed using only two clusters, suggesting that the moderate infrastructure class may not have been sufficiently unique to warrant a separate cluster.

Another method of visualizing the cluster structure is through the use of dendrograms. Below, I show dendrograms color-coded with only two clusters (left) and using three clusters (right).

Figure: Dendrograms showing the cluster structure of infrastructure pathways for a utility, using 2 (left) and 3 (right) clusters.

The dendrogram containing only two clusters (left) reveals that the moderate and light clusters can be combined into a single cluster, because the moderate cluster is less-different from the light infrastructure cluster than it is from the heavy infrastructure cluster. Although, it is important to note that dendrograms are not a reliable method of determining the number of clusters on their own. In fact, given that the difference in silhouette scores between the 2-cluster and 3-cluster hierarchies is relatively modest (~0.05) there is not sufficiently strong justification to choose the 2 clusters over 3 clusters.

For the sake of comparison, here are the 2-cluster and 3-cluster median pathway plots:

Conclusion

I hope you’ve gained some insight into clustering, infrastructure pathways (or preferably both) in this post.

Analysis of infrastructure pathways via statistical clustering presents a unique application of a very common statistical tool. In this analysis I clustered the pathways with respect to the timing of the infrastructure expansions, however there is potential for a much more in-depth analysis of how the hydrologic or urban contexts of each realization shape these infrastructure pathways… possibly justifying another blog post at a later date. Stay tuned!

Thanks again to David Gold, for providing the data and the foundational python code.

References

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research52(9), 7327-7346.

Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB

This post is dedicated to Chris Vernon and Travis Thurber from PNNL who taught me how to use these tools!

Lately, I’ve been generating synthetic weather scenarios for basins in California. The weather scenarios are created at a daily time step and are informed by tree-ring products which span across 600 years. We want to develop multiple ensembles of these scenarios that are representative of plausible future climate changes that the region could experience. It became clear to me that I would need to make sure that I was storing data efficiently and that I was able to parse through these data quickly to generate plots and develop metrics.

Below, I wanted to discuss two points that have really helped my workflow: (1) focusing on file structure and compression and (2) changing the way that I explore the resulting scenarios.

File choice and compression

To illustrate the importance of considering how you choose to save your data, I grab the historical trace of daily precipitation over the last 50 years for the Tuolumne River Basin and upload it into R as a dataframe. Below I show the code to take this dataframe and save it in various formats.

#Traditional RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne.rds"))

#Compressed RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne_compressed.rds",compress = 'xz'))

#Text 
system.time(write.table(prcp_tuolumne,"E:/blog/prcp_tuolumne.txt"))

#CSV
system.time(write.csv(prcp_tuolumne,"E:/blog/prcp_tuolumne.csv"))

#Parquet
library(arrow)
system.time(write_parquet(prcp_tuolumne,"E:/blog/prcp_tuolumne.parquet",compression = "gzip"))

#HDF5 
library(rhdf5)
system.time(h5write(prcp_tuolumne,"E:/blog/prcp_tuolumne.h5","df"))


#NetCDF
library(ncdf4)
# path and file name, set dname
ncpath <- "E:/blog/"
ncname <- "prcp_tuolumne"  
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "prcp"  # note: tmp means temperature (not temporary)
timedim <- ncdim_def("time","days since 1950-11-1",as.numeric(dates))
# define variables
fillvalue <- 1e32
dlname <- "observed precipitation"
prcp_def <- ncvar_def("prcp","mm",list(timedim),1e32,"observed prcp",prec = "double")
# create netCDF file and put arrays
ncout <- nc_create(ncfname,list(prcp_def),force_v4=TRUE)
# put variables
system.time(ncvar_put(ncout,prcp_def,as.matrix(prcp_tuolumne[ ,1])))

Here, I summarize the resulting size and write times for each format.

File TypeSize (Kb)Writing Time (seconds)
RDS9,0652.13
Compressed RDS4,01217.73
HDF55,204323.03
Text26,85112.69
CSV27,02412.14
netCDF141,3603.12
Parquet8,3960.98

Note that the original file size was on the smaller side so this exercise may seem trivial, but when you are creating many ensembles of both precipitation and temperature, across hundreds of years, across may basins, across many climate change scenarios, small files will add up and can potentially limit the scope of experiment that you can conduct if you don’t have enough file storage. Of these file types, one of my recent favorites has been Apache Parquet. Parquet is an open-source column-oriented format that is specifically designed for efficient storage and accessibility of data. Whereas the compressed RDS file and HDF5 beat Parquet in terms of size, it takes much longer to write to these files and subsequently read them back in. Another advantage of Parquet is that it is recognized by common coding languages (R, Matlab, Python) which allows for a more seamless workflow between models of different languages. If you have a little more storage to work with, Parquet is a good choice to balance size and writing time tradeoffs.

Querying data with SQL and DuckDB

Once your data are in an efficient format like Parquet, the next order of business is to make sure that you can easily sort through it and, as Chris would say, “talk to your data and let it talk back”. One way to communicate with your data and ask questions in a more semantically meaningful way is to use Structured Query Language (SQL). There are many methods of querying in R, but a useful data management tool that I have enjoyed interacting with is DuckDB which is a relational database management system. DuckDB has an R API and allows you to use the SQL syntax to query data in an efficient way. It should be noted that you can perform similar aggregation and subsetting using other R base functions, but DuckDB allows you to perform these tasks faster and across multiple files that you may have in a directory.

Let’s use a case where you have developed 30 traces of daily basin-averaged synthetic precipitation for the Tuolumne based on the historical period and each trace is stored in a respective parquet file (“ensemble1.parquetgzip, ensemble2.parquetgzip…ensemble30.parquetgzip). Each ensemble looks something like this:

Example of the parquet file structure

Now let’s state a question that I want to ask to the data and see how to structure these questions using SQL and DuckDB.

“Hey DuckDB, can you create a dataframe for me that provides an annual average across all my synthetic traces? Let’s plot that with respect to the observed annual average and show the bounds of the synthetic traces.”

First we need to open a connection to DuckDB. Then we are going to have to create annual averages from the daily historical and synthetic traces and then create max/min bounds to show the additional variability we are getting from the generator.

In order to answer this question, DuckDB will query all our parquet files simultaneously (!) and return a dataframe with annual averages and we can do some clever sub-querying to get the max/min of the averages. I really enjoy that I don’t have to read in the files into my R workspace. R reads in files and stores them based on memory, so those of you who have used larger datasets might have found that you are limited in how many datasets you can have open, which can be frustrating! This is not an issue with DuckDB.

#Historical trace 
historical=data.frame(dates.year,dates.month,dates.day,rowMeans(prcp_sacsma))
historical_yearly=aggregate(historical,by=list(dates.year),FUN=mean)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

# query to find annual average across all synthetic traces
df=dbGetQuery(con, 
              "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM 'E:/NHMM/blog/*.parquet'
           GROUP BY year
           ORDER BY year")

#For the max/min, we need to find the average first and then return the max/min values in something like a nested approach 

#First create a dataframe with all files 
all_files=dbGetQuery(con,"SELECT *
                FROM 'E:/NHMM/blog/*.parquet'")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "all_files_table", all_files)


annual_average=dbGetQuery(con, 
               "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM all_files_table
           GROUP BY year,sample
           ORDER BY year")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "annual_table", annual_average)

#query to find max
max=dbGetQuery(con, 
               "SELECT year,
           MAX(yearly_average) AS max_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")

#query to find min
min=dbGetQuery(con, 
               "SELECT year,
           min(yearly_average) AS min_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")
#Plot the results!
library(ggplot2)

ggplot(df, aes(x = year, y = yearly_average)) +
  geom_ribbon(aes(ymin = min$min_yearly_average,
                  ymax = max$max_yearly_average),
              alpha = 0.2,fill="blue") +
  geom_line()+geom_line(color="blue",size=2)+ geom_line(data=basin.wide.yearly[3:64,],aes(x=year,y=precipitation), color="Black",size=2)+ ggtitle("Annual Average Precipitation") +
  xlab("Year") + ylab("Annual Average Precipitation (mm)")+theme_ridges()

Here’s the resulting figure:

Synthetic generation in blue compared to the historical trace in black

“Hey DuckDB, from all of the synthetic traces that I have generated, how many instances are there where we are producing daily precipitation that is above 10 mm for specifically the years 1980 or 2000?”

In order to answer this question, DuckDB will once again query all our parquet files simultaneously and return a dataframe with all the instances of >10 mm and only for the years 1980 or 2000. This is how you would write this question in “code”.

library(arrow)
library(duckdb)
library(fs)
library(tidyverse)
library(DBI)
library(glue)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

df=dbGetQuery(con,"SELECT *
                FROM 'E:/blog/*.parquet'
                WHERE precipitation > 10
                AND (year = 1980 OR year = 2000)
                order BY year")

If we look at the size of the resulting dataframe it looks like we have generated 1335 instances of daily precipitation that are greater than 10 mm in specifically the years 1980 or 2000:

Hey DuckDB, I want to think about meteorological drought. In how many instances am I half a standard deviation below the long-term average monthly precipitation?

To try this out, let’s just look at one synthetic trace. First we need to find the long term average across the trace and let’s plot what a drought might look like.

# query to grab the first synthetic trace
scenario1=dbGetQuery(con, 
              "SELECT month,year,
           AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'
           GROUP by year, month")

# query to find mean
long_term_average=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'") 

# query to standard deviation
stdev=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_stdev
           FROM 'E:/NHMM/blog/ensemble1.parquet'")

We can then plot the synthetic trace (blue), mean (black), and a 1/2 standard deviation (dashed line) below the mean.

Now we can count the number of months that we are classified in a drought across the synthetic trace. Based on the stats that we calculated in the last block, I hard coded in the drought threshold, but in theory, you can nest many functions with more complicated sub-queries.

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "scenario_1", scenario1)


instances=dbGetQuery(con, 
                     "SELECT month,year
           FROM scenario_1
           WHERE monthly_avg < 3.125291
           GROUP by year, month
           ORDER by year,month")

It looks like 115/325 months or about 35% of the time we are in a drought in this particular synthetic trace. Yikes! But it makes for a good way to assess the vulnerability of the system to future plausible meteorological drought conditions.

I hope that these approaches can be as useful to you as they have been for me!