Continuous Deployment with GitHub Actions (or, What Gives Life to a Living eBook?)

Continuous Deployment with GitHub Actions (or, What Gives Life to a Living eBook?)

Last month we introduced our free eBook Addressing Uncertainty in MultiSector Dynamics Research. Did you know that, to date, we have published 74 revisions? Without an automated release and deployment process, this would have been very tedious! But, with the help of GitHub Actions, every small edit, minor clarification, or major improvement can go live within moments of merging the code. Read on to learn how the eBook team leverages GitHub Actions for CI/CD (Continuous Integration / Continuous Delivery).

GitHub Workflow

A reliable CI/CD strategy depends on a robust code review process⁠—continuously delivering bugs and typos will not impress anyone! There are many acceptable workflows; the best one to use will depend on team composition and codebase. In our case, a feature branching workflow suffices:

Feature Branching Workflow

The feature branching workflow consists of a main code branch representing published assets. New features or bug fixes require authors to create a new branch from main, implement their changes, and then submit a pull request back into main. A pull request is a formal code review process that gives other authors a chance to provide feedback on the proposed changes. Once consensus has been reached, the feature branch is merged into the main branch, kicking off the CI/CD process.

Automation

While the code review process is designed to catch content and conceptual errors, subtle process and system based errors can often slip through the cracks. Thus the first step in the CI/CD process should be running a suite of automated tests that span a range of systems, behaviors, and any known pain points. The depth and breadth of these tests should be sufficient to ensure an adequate degree of publication readiness without being overly burdensome to maintain. The test suite can grow over time!

For the eBook, our tests simply ensure that the Python dependencies install correctly on Linux, Mac, and Windows containers, that the supporting code can be imported on these systems without error, and that the HTML and PDF versions of the publication generate successfully. If any tests fail, the publication process is cancelled and the authors are notified with details of the failure.

This test, release, and publication process is orchestrated by GitHub Actions. Read on to learn more!

GitHub Actions

GitHub Actions are available to any project hosted in GitHub—totally free for public repositories, and free to a limited extent for private repositories. An Action can be defined as a unit of work performed using a virtual machine in response to an event. In GitHub, Actions are defined using YAML files places into the .github/workflows directory of a repository. YAML (YAML Ain’t Markup Language) is a concise, human-readable syntax for conveying semi-structured data. The minimal content of a GitHub Action includes a name, one or more event triggers, and one or more jobs. A job consists of a name, one or more virtual machine targets, and one or more steps. For example, the eBook test Action looks like this:

.github/workflows/01_test.yml
name: Test
on:
  push:
    branches: [ main ]
jobs:
  test:
      runs-on: ${{ matrix.os }}
      strategy:
        matrix:
          os: [ubuntu-latest, macos-latest, windows-latest]
      steps:
        - uses: actions/checkout@v2
        - name: Set up Python
          uses: actions/setup-python@master
          with:
            python-version: 3.9
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Run tests
          run: |
            pip install pytest
            pytest

There is a lot going on here, so let’s take it step by step!

name: Test
on:
  push:
    branches: [ main ]

This snippet gives our Action a name, and specifies that it should trigger on updates to the main branch of the repository.

jobs:
  test:
      runs-on: ${{ matrix.os }}
      strategy:
        matrix:
          os: [ ubuntu-latest, macos-latest, windows-latest ]

This snippet defines a job within our “Test” Action named “test”, and then uses special syntax to declare that the job should be run on three different virtual machines: the latest Ubuntu Linux, macOS, and Windows containers. Running the tests on multiple operating systems helps catch bugs with system-specific dependencies.

      steps:
        - uses: actions/checkout@v2
        - name: Set up Python
          uses: actions/setup-python@master
          with:
            python-version: 3.9
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Run tests
          run: |
            pip install pytest
            pytest

This snippet outlines the actual units of work within the job; each “-” separates a unique task. The uses syntax is special in that it allows one to leverage tasks written by others hosted in the GitHub Actions Marketplace. The actions/checkout@v2 task clones the repository onto the virtual machine, and the actions/setup-python@master task installs and configures the specified Python version. The final two steps use the run directive to invoke custom code, in this case installing Python dependencies and running the Python test suites.

Deployment

Once the tests successfully pass, it’s time to publish! Since the eBook is essentially a web app, GitHub Pages is the perfect deployment platform. Pages hosts the content of a branch as a website, and is free for public repositories.

If you followed along with the previous eBook post, you learned about the Python, Sphinx, and Restructured Text workflow for compiling the eBook content into a polished product. Let’s create a GitHub Action to compile the eBook and deploy it to GitHub Pages! Here’s the full YAML file:

.github/workflows/02_deploy.yml
name: Deploy
on:
  push:
    branches: [ main ]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install latex dependencies
        run: sudo apt-get update -y && sudo apt-get install -y texlive latexmk texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended ghostscript
      - name: Update pip and install python dependencies
        working-directory: 'docs/'
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Build html and pdf ebook
        working-directory: 'docs/'
        run: |
          make html latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
        continue-on-error: true
      - name: Get current datetime in ISO format
        id: date
        run: echo "::set-output name=date::$(date -u +'%Y-%m-%d')"
      - name: Create GitHub release
        id: gh_release
        uses: softprops/action-gh-release@v1
        with:
          files: docs/build/latex/addressinguncertaintyinmultisectordynamicsresearch.pdf
          tag_name: ${{ steps.date.outputs.date }}v${{ github.run_number }}
      - name: Commit the compiled files
        run: |
          cd docs/build/html
          git init
          git add -A
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git commit -m 'deploy' -a || true
      - name: Push changes to gh-pages
        uses: ad-m/github-push-action@master
        with:
          branch: gh-pages
          directory: docs/build/html
          force: true
          github_token: ${{ secrets.GITHUB_TOKEN }}

A lot to unpack here! Let’s take it step by step. As before, we start by naming the Action, triggering it on updates to the main branch, declaring that it should run only on an ubuntu-latest virtual machine, checking out out the code, and setting up Python. Then we get into the new job steps:

      - name: Install latex dependencies
        run: sudo apt-get update -y && sudo apt-get install -y texlive latexmk texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended ghostscript

This step installs all the operating system dependencies needed to support the Latex syntax and compilation to PDF. There was some trial and error involved in getting this right, but once correct it should be pretty stable.

      - name: Build html and pdf ebook
        working-directory: 'docs/'
        run: |
          make html latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
        continue-on-error: true

This step runs the Sphinx makefile to compile the HTML and PDF versions of the eBook. The verbosity and repetitiveness of these commands works around some unusual oddities of the Latex and PDF compilation. --keep-going LATEXMKOPTS="-interaction=nonstopmode" prevents the command from waiting for user input. || true and the repeated make latexpdf lines allow the PDF engine to fully resolve all the references in the restructured text files; otherwise the PDF file would be incomplete and garbled (this one stumped us for awhile!).

      - name: Get current datetime in ISO format
        id: date
        run: echo "::set-output name=date::$(date -u +'%Y-%m-%d')"
      - name: Create GitHub release
        id: gh_release
        uses: softprops/action-gh-release@v1
        with:
          files: docs/build/latex/addressinguncertaintyinmultisectordynamicsresearch.pdf
          tag_name: ${{ steps.date.outputs.date }}v${{ github.run_number }}

To make it easier to chronologically place our eBook releases, we wanted to include a date stamp in our version tags. The first step above assigns the date to a variable. The second step above creates and tags an official GitHub release (using the date and an auto-incrementing run number), and includes the PDF version of the eBook as an asset attached to the release.

      - name: Commit the compiled files
        run: |
          cd docs/build/html
          git init
          git add -A
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git commit -m 'deploy' -a || true
      - name: Push changes to gh-pages
        uses: ad-m/github-push-action@master
        with:
          branch: gh-pages
          directory: docs/build/html
          force: true
          github_token: ${{ secrets.GITHUB_TOKEN }}

These two steps cause the GitHub Actions user to commit the compiled HTML files and force push them to the gh-pages branch of our repository, using a secret token. This is a common “hack” to enable publishing only the desired web assets and not the entire repository. Never force push to other shared code branches!

Action Status

Check the status of the CI/CD pipeline using the Actions tab of the GitHub repository. Successful Actions show a check mark, in progress Actions show a spinner, and failed Actions show an X. Clicking into a particular Action will show more details, log messages, and error traces if relevant.

The GitHub Actions Tab

Slack Integration

To take our workflow to the next level (and to avoid the need to read even more email 😅 ), we added the GitHub app to our eBook Slack channel. This adds a bot that can subscribe to GitHub repositories and report on activity; for instance: new issues, new pull requests, and new releases. We can then discuss and iterate inline in Slack, without having to jump to other apps or sites.

GitHub Slack Integration

To add the GitHub bot to a channel, right click on the channel name, select “Open channel details”, and navigate to the “Integrations” tab. From here, you can choose “Add apps” and search for GitHub. Once added, type a bot command such as /github subscribe [repository name] to start receiving notifications in the channel. The bot can also be used to open and close issues!

Conclusion

Using the GitHub Actions workflow to automate the testing and publication of our eBook enabled our team to focus more on the content and quality of release rather than stumbling over the intricacies of the publication process. CI/CD has been a powerful tool in software communities, but can greatly benefit researchers and academics as well. We hope our learnings presented above will speed you along in your own workflows!

MORDM VIII: Characterizing the effects of deep uncertainty

In the previous post, we defined robustness using the satisficing metric where (1) reliability should be at least 98%, (2) restriction frequency should be not more than 10% and (3) worst-case cost of drought mitigation action should not be more than 10% of annual net volumetric income. To calculate the robustness of these set of actions (portfolios) against the effects of challenging states of the world (SOWs) on the initial set of actions, we once again re-simulated them to discover how they fail.

In this penultimate post, we will perform simple sensitivity analysis across the average performance of all sixty-nine portfolios of actions to understand which uncertainties control the performance of each utility (Raleigh, Durham and Cary) and the regions across all uncertain SOWs.

Calculating average performance across 100 DU SOWs

First, we create a new folder to hold the output of the next few post-processing steps. Navigate to the WaterPaths/ folder and create a folder called post_processing. Now, let’s calculate the average performance of each of the sixty-nine initial portfolios across the 100 DU SOWs that we previously simulated them over. The code for this can be found in the post_processing_code folder under gather_objs.py file and should look like this:

# -*- coding: utf-8 -*-
"""
Created on Mon April 26 2022 11:12

@author: Lillian Bei Jia Lau
Organizes output objectives by mean across RDMs and across solutions

"""
import numpy as np

obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

'''
Performs regional minimax
'''
def minimax(N_SOLNS, objs):
    for i in range(N_SOLNS):
        for j in range(5):
            if j == 0:
                objs[i,15] = np.min([objs[i,0],objs[i,5], objs[i,10]])
            else:
                objs[i, (j+15)] = np.max([objs[i,j],objs[i,j+5], objs[i,j+10]])
    return objs

'''
Calculates the mean performance acorss all SOWs
'''
def mean_performance_across_rdms(objs_by_rdm_dir, N_RDMS, N_SOLNS):
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')
    objs_means = np.zeros((N_SOLNS,20), dtype='float')

    for i in range(N_RDMS):
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'
        objs_file = np.loadtxt(filepathname, delimiter=",")
        objs_matrix[:,:15,i] = objs_file

        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    for n in range(N_SOLNS):
        for n_objs in range(20):
            objs_means[n,n_objs] = np.mean(objs_matrix[n,n_objs,:])

    return objs_means

'''
Calculates the mean performance acorss all SOWs
'''
def mean_performance_across_solns(objs_by_rdm_dir, N_RDMS, N_SOLNS):
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')
    objs_means = np.zeros((N_RDMS,20), dtype='float')

    for i in range(N_RDMS):
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'
        objs_file = np.loadtxt(filepathname, delimiter=",")
        objs_matrix[:,:15,i] = objs_file
        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    for n in range(N_RDMS):
        for n_objs in range(20):
            objs_means[n,n_objs] = np.mean(objs_matrix[:,n_objs,n])

    return objs_means

# change number of solutions available depending on the number of solutions
# that you identified
N_SOLNS = 69
N_RDMS = 100

# change the filepaths
objs_by_rdm_dir = '/yourFilePath/WaterPaths/output/Objectives_RDM'
objs_og_dir = '/yourFilePath/WaterPaths/'

fileoutpath = '/yourFilePath/WaterPaths/post_processing/'

fileoutpath_byRDMs = fileoutpath + 'meanObjs_acrossRDM.csv'
fileoutpath_bySoln = fileoutpath + 'meanObjs_acrossSoln.csv'

# should have shape (N_SOLNS, 20)
objs_byRDM = mean_performance_across_rdms(objs_by_rdm_dir, N_RDMS, N_SOLNS)
# should have shape (N_RDMS, 20)
objs_bySoln = mean_performance_across_solns(objs_by_rdm_dir, N_RDMS, N_SOLNS)

np.savetxt(fileoutpath_byRDMs, objs_byRDM, delimiter=",")
np.savetxt(fileoutpath_bySoln, objs_bySoln, delimiter=",")

This will output two .csv files: meanObjs_acrossRDM.csv contains the average performance of each of the sixty-nine objectives evaluated over 100 DU SOWs, while meanObjs_acrossSoln.csv contains the average performance of all solutions within one SOW. Take some time to understand this difference, as this will be important when performing sensitivity analysis and scenario discovery.

Calculate the robustness of each portfolio to deep uncertainty

Now, let’s identify how each of these solutions perform under a set of more challenging SOWs. Within post_processing_code/, identify the file called calc_robustness_across_rdms.py. It should look like this:

# -*- coding: utf-8 -*-
"""
Created on Mon April 26 2022 11:12

@author: Lillian Bei Jia Lau

Calculates the fraction of RDMs over which each perturbed version of the solution meets all four satisficing criteria
"""
import numpy as np
import pandas as pd

obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

'''
Performs regional minimax
'''
def minimax(N_SOLNS, objs):
    for i in range(N_SOLNS):
        for j in range(5):
            if j == 0:
                objs[i,15] = np.min([objs[i,0],objs[i,5], objs[i,10]])
            else:
                objs[i, (j+15)] = np.max([objs[i,j],objs[i,j+5], objs[i,j+10]])
    return objs

'''
For each rdm, identify if the perturbed solution version x satisfies the satisficing criteria
'''
def satisficing(df_objs):
    for i in range(4):
        df_objs['satisficing_C'] = (df_objs['REL_C'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_C'] <= 0.10).astype(int) *\
                                    (df_objs['RF_C'] <= 0.10).astype(int)

        df_objs['satisficing_D'] = (df_objs['REL_D'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_D'] <= 0.10).astype(int) *\
                                    (df_objs['RF_D'] <= 0.10).astype(int)

        df_objs['satisficing_R'] = (df_objs['REL_R'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_R'] <= 0.10).astype(int) *\
                                    (df_objs['RF_R'] <= 0.10).astype(int)

    df_objs['satisficing_reg'] = np.max(df_objs.iloc[:, 20:23])
    return df_objs

def calc_robustness(objs_by_rdm_dir, N_RDMS, N_SOLNS):

    # matrix structure: (N_SOLNS, N_OBJS, N_RDMS)
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')

    satisficing_matrix = np.zeros((N_SOLNS,4,N_RDMS), dtype='float')
    solution_robustness = np.zeros((N_SOLNS,4), dtype='float')

    for i in range(N_RDMS):
        # get one perturbed instance's behavior over all RDMs
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'

        objs_file = np.loadtxt(filepathname, delimiter=",")

        objs_matrix[:,:15,i] = objs_file

        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        # NaN check
        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    # for the perturbed instances
    for r in range(N_RDMS):

        df_curr_rdm = pd.DataFrame(objs_matrix[:, :, r], columns = obj_names)

        df_curr_rdm = satisficing(df_curr_rdm)
        satisficing_matrix[:N_SOLNS,:,r] = df_curr_rdm.iloc[:,20:24]

    for n in range(N_SOLNS):
        solution_robustness[n,0] = np.sum(satisficing_matrix[n,0,:])/N_RDMS
        solution_robustness[n,1] = np.sum(satisficing_matrix[n,1,:])/N_RDMS
        solution_robustness[n,2] = np.sum(satisficing_matrix[n,2,:])/N_RDMS

    solution_robustness[:,3] = np.min(solution_robustness[:,:3], axis=1)

    return solution_robustness

'''
Change number of solutions available depending on the number of solutions
that you identified and the number of SOWs that you are evaluating them over.
'''
N_RDMS = 100
N_SOLNS = 69

objs_by_rdm_dir = '/scratch/lbl59/blog/WaterPaths/output/Objectives_RDM'

fileoutpath_robustness = '/scratch/lbl59/blog/WaterPaths/post_processing/' + \
    'robustness_' + str(N_RDMS) + '_SOWs.csv'

robustness = calc_robustness(objs_by_rdm_dir, N_RDMS, N_SOLNS)

np.savetxt(fileoutpath_robustness, robustness, delimiter=",")

When you run this script from the terminal, you should have a .csv file called ‘robustness_100_SOWs.csv‘ appear in your post_processing/ folder. Now, let’s get onto some sensitivity analysis!

Delta moment-independent sensitivity analysis

The Delta moment-independent (DMI) method (Borgonovo, 2007) is sensitivity analysis method that compares the entire probability distribution of both input and output parameters to estimate the sensitivity of the output to a specific input parameter. It is one of many global sensitivity analysis methods, which in itself is one of two main categories of sensitivity analysis that enables the assessment of the degree to which uncertainty in model inputs map to the degree of uncertainty in model output. For purposes of this test case, the DMI is preferable as it does not rely on any one statistical moment (variance, mean and skew) to describe the dependence of model output to its input parameters. It is also time-sensitive, reflecting the current state of knowledge within the system, which philosophically pairs well with our use of the ROF triggers. More information on alternative global sensitivity methods can be found here.

Within the context of our test case, we will be using the DMI method to identify uncertainties in our decision variables that most strongly influence our performance over the 100 DU SOWs. To perform DMI sensitivity analysis, first navigate to the post_processing/ folder. Within the folder, create two sub-folders called delta_output_DV/ and delta_output_DUF/. This is where all your DMI output will be stored. Next, locate the delta_sensitivity.py file within the post_processing_code/ folder. The code should look similar to the script provided below:

import sys
from SALib.analyze import delta
from SALib.util import read_param_file
import numpy as np
import pandas as pd

'''
Finds the upper and lower bounds of input parameters
'''
def find_bounds(input_file):
    bounds = np.zeros((input_file.shape[1],2), dtype=float)
    for i in range(input_file.shape[1]):
        bounds[i,0] = min(input_file[:,i])
        bounds[i,1] = max(input_file[:,i])

    return bounds
'''
Performs delta moment-independent sensitivity analysis
Source: https://github.com/SALib/SALib/tree/main/examples/delta
'''
def delta_sensitivity(dec_vars, measured_outcomes, names, mo_names, bounds, rdm, mode):
    X = dec_vars
    Y = measured_outcomes

    problem = {
        'num_vars': int(dec_vars.shape[1]),
        'names': names,
        'bounds': bounds
    }

    for i in range(measured_outcomes.shape[1]):
        mo_label = mo_names[i]
        if i == 2 and mode == 'objs':
          break
        else:
          filename = '../post_processing/delta_output_' + rdm + '/S1_' + mo_label + '.csv'
          S1 = delta.analyze(problem, X, Y[mo_label].values, num_resamples=10, conf_level=0.95, print_to_console=False)
        numpy_S1 = np.array(S1["S1"])
        fileout = pd.DataFrame([names, numpy_S1], index = None, columns = None)
        fileout.to_csv(filename, sep=",")

'''
0 - Name all file headers and compSol to be analyzed
'''
obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

dv_names = ['RT_C', 'RT_D', 'RT_R', 'TT_D', 'TT_R', 'LMA_C', 'LMA_D', 'LMA_R',\
            'RC_C', 'RC_D', 'RC_R', 'IT_C', 'IT_D', 'IT_R', 'IP_C', 'IP_D', \
            'IP_R', 'INF_C', 'INF_D', 'INF_R']

rdm_headers_dmp = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff']
rdm_headers_utilities = ['Demand growth\nmultiplier', 'Bond term\nmultiplier', \
                        'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier']
rdm_headers_ws = ['Streamflow amp', 'SCR PT', 'SCR CT', 'NRR PT', 'NRR CT', 'CR Low PT', 'CR Low CT',\
                  'CR High PT', 'CR High CT', 'WR1 PT', 'WR1 CT', 'WR2 PT', 'WR2 CT',\
                  'DR PT', 'DR CT', 'FR PT', 'FR CT']

duf_names = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff', 'Demand growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

N_RDMS = 100
N_SOLNS = 69

'''
1 - Load DU factor files and DV files
'''
# change to your own filepath
rdm_factors_directory = '/yourFilePath/WaterPaths/TestFiles/'
rdm_dmp_filename = rdm_factors_directory + 'rdm_dmp_test_problem_reeval.csv'
rdm_utilities_filename = rdm_factors_directory + 'rdm_utilities_test_problem_reeval.csv'
rdm_watersources_filename = rdm_factors_directory + 'rdm_water_sources_test_problem_reeval.csv'

rdm_dmp = pd.read_csv(rdm_dmp_filename, sep=",", names=rdm_headers_dmp)
rdm_utilities = pd.read_csv(rdm_utilities_filename, sep=",", names=rdm_headers_utilities)
rdm_ws_all = np.loadtxt(rdm_watersources_filename, delimiter=",")
rdm_ws = pd.DataFrame(rdm_ws_all[:,:17], columns=rdm_headers_ws)

dufs = pd.concat([rdm_dmp, rdm_utilities, rdm_ws], axis=1, ignore_index=True)
duf_numpy = dufs.to_numpy()

# change to your own filepath
dv_directory = '/yourFilePath/WaterPaths/'
dv_filename = dv_directory + 'NC_dvs_all_noheader.csv'
dvs = np.loadtxt(dv_filename, delimiter=",")

'''
2 - Get bounds for DU factors and DVs
'''
duf_bounds = find_bounds(duf_numpy)
dv_bounds = find_bounds(dvs)

'''
3 - Load robustness file and objectives file
'''
# change to your own filepath
main_dir = '/yourFilePath/WaterPaths/post_processing/'

robustness_filename = main_dir + 'robustness_100_SOWs.csv'
robustness_arr = np.loadtxt(robustness_filename, delimiter=",")
robustness_df = pd.DataFrame(robustness_arr, columns=utilities)

objs_mean_rdm_filename = main_dir + 'meanObjs_acrossRDM.csv'
objs_mean_rdm_arr = np.loadtxt(objs_mean_rdm_filename, delimiter=",")
objs_mean_rdm_df = pd.DataFrame(objs_mean_rdm_arr, columns=obj_names)

objs_mean_soln_filename = main_dir + 'meanObjs_acrossSoln.csv'
objs_mean_soln_arr = np.loadtxt(objs_mean_soln_filename, delimiter=",")
objs_mean_soln_df = pd.DataFrame(objs_mean_soln_arr, columns=obj_names)

# to change  depending on whether DV or DUF is being analyzed
dec_vars = dvs
measured_outcomes = objs_mean_rdm_df
names = dv_names
mo_names = obj_names
bounds = dv_bounds
rdm = 'DV'
mode = 'objs'
###

delta_sensitivity(dec_vars, measured_outcomes, names, mo_names, bounds, rdm, mode)

The code above identifies the sensitivity of the average values of all sixty-nine performance objective sets over all 100 deeply-uncertain SOWs to the decision variables. This is why we use the meanObjs_acrossRDM.csv file – this file contains sixty-nine mean values of the performance objectives, where each set of performance objectives inversely maps to their corresponding portfolio of actions.

To identify the sensitivity of performance objectives to the DU factors, change lines 121 to 127 to the following:

# to change  depending on whether DV or DUF is being analyzed
dec_vars = duf_numpy[:100,:]
measured_outcomes = objs_mean_soln_df
names = duf_names
mo_names = obj_names
bounds = duf_bounds[:100,:]
rdm = 'DUF'
mode = 'objs'
###

The code above identifies the sensitivity of the average values of all twenty performance objectives over each of the sixty-nine different portfolios to the set of deeply uncertain hydroclimatic and demand scenarios. This is why we use the meanObjs_acrossSoln.csv file – this file contains one hundred mean values of the performance objectives, where each set of performance objectives inversely maps to their corresponding DU SOW.

Great job so far! Now let’s visualize the sensitivity of our output to our input parameters using heatmaps. Before being able to visualize each utility’s performance sensitivity, we must first organize the sensitivity indices of the decision variables into a file containing the indices over all objectives for each utility. The gather_delta.py script does this. Simply change the value of mode on line 11 to ‘DUF‘ to gather the indices for the DU factors.

"""
Created on Tue April 26 2022 16:12

@author: Lillian Bei Jia Lau

Gathers the delta sensitivity indices into files per utility
"""
import numpy as np
import pandas as pd

mode = 'DV'
main_dir = '/yourFilePath/WaterPaths/post_processing/delta_output_' + mode + '/'
utilities = ['_C', '_D', '_R', '_reg']
objs = ['REL', 'RF', 'INF_NPC', 'PFC', 'WCC']
utilities_full = ['Cary', 'Durham', 'Raleigh', 'Regional']

dv_names = ['RT_C', 'RT_D', 'RT_R', 'TT_D', 'TT_R', 'LMA_C', 'LMA_D', 'LMA_R',\
            'RC_C', 'RC_D', 'RC_R', 'IT_C', 'IT_D', 'IT_R', 'IP_C', 'IP_D', \
            'IP_R', 'INF_C', 'INF_D', 'INF_R']

duf_names = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff', 'Demand growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

s1_dv_cary = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_durham = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_raleigh = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_regional = np.zeros((len(objs), len(dv_names)), dtype=float)

s1_dv_dict = {
    '_C': s1_dv_cary,
    '_D': s1_dv_durham,
    '_R': s1_dv_raleigh,
    '_reg': s1_dv_regional
}

s1_duf_cary = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_durham = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_raleigh = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_regional = np.zeros((len(objs), len(duf_names)), dtype=float)

s1_duf_dict = {
    '_C': s1_duf_cary,
    '_D': s1_duf_durham,
    '_R': s1_duf_raleigh,
    '_reg': s1_duf_regional
}

for i in range(len(utilities)):
    s1_util = []
    hdrs = []
    if mode == 'DV':
        s1_util = s1_dv_dict[utilities[i]]
        hdrs = dv_names
    elif mode == 'DUF':
        s1_util = s1_duf_dict[utilities[i]]
        hdrs = duf_names

    for j in range(len(objs)):
        curr_file = main_dir + 'S1_' + objs[j] + utilities[i] + '.csv'
        s1_util[j, :] = pd.read_csv(curr_file, sep=',', skiprows=2, header=None).iloc[0,1:]

    s1_util_df = pd.DataFrame(s1_util, columns=hdrs)
    out_filepath = main_dir + utilities_full[i] + '.csv'

    s1_util_df.to_csv(out_filepath, sep=',', index=False)

Now, let’s plot our heatmaps! The code to do so can be found in sensitivity_heatmap.py, and should look similar to the code provided below:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import AxesGrid

sns.set_theme()

# change depending on compromise solution and whether its sensitivity to DUF or DVs
mode = 'DUF'
rot = 90    # if DV use 0; if DUF use 45
main_dir = '/YourFilePath/WaterPaths/post_processing/delta_output_' + mode + '/'
c_filepath = main_dir + 'Cary.csv'
d_filepath = main_dir + 'Durham.csv'
r_filepath = main_dir + 'Raleigh.csv'
reg_filepath = main_dir + 'Regional.csv'

cary = pd.read_csv(c_filepath, index_col=False, header=0)
durham = pd.read_csv(d_filepath, index_col=False, header=0)
raleigh = pd.read_csv(r_filepath, index_col=False, header=0)
regional = pd.read_csv(reg_filepath, index_col=False, header=0)

savefig_dir = '/YourFilePath/WaterPaths/post_processing/'
savefig_name = savefig_dir + 'dmi_heatmap_' + mode + '.svg'

grid_kws = {"height_ratios": (0.20, 0.20, 0.20, 0.20, .02), "hspace": 0.5}
f, (ax1, ax2, ax3, ax4, cbar_ax) = plt.subplots(5, figsize=(15, 20), gridspec_kw=grid_kws)
plt.subplots_adjust(top = 0.95, bottom = 0.05,
            hspace = 0, wspace = 0.05)

y_objs=['REL', 'RF', 'INPC', 'PFC', 'WCC']

x_dvs=['$RT_{C}$', '$RT_{D}$', '$RT_{R}$', '$TT_{D}$', '$TT_{R}$', '$LM_{C}$', '$LM_{D}$', '$LM_{R}$',\
                '$RC_{C}$', '$RC_{D}$', '$RC_{R}$', '$IT_{C}$', '$IT_{D}$', '$IT_{R}$', '$IP_{C}$', \
                '$IP_{D}$', '$IP_{R}$','$INF_{C}$', '$INF_{D}$', '$INF_{R}$']
x_dufs = ['Cary\nrestr. eff', 'Durham\nrestr. eff', 'Raleigh\nrestr. eff', 'Dem. growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infra. interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

x_labs = []
if mode == 'DV':
    x_labs = x_dvs
elif mode == 'DUF':
    x_labs = x_dufs

plt.rc('xtick', labelsize=1)
plt.rc('ytick', labelsize=3)
plt.rc('axes', labelsize=5)
plt.rc('axes', titlesize=14)

#ax1 = fig.add_subplot(411)
ax1.set_title("Cary")
sns.heatmap(cary, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax1, cbar=False)
ax1.set_yticklabels(y_objs, rotation=0)

#ax2 = fig.add_subplot(412)
ax2.set_title("Durham")
sns.heatmap(durham, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax2, cbar=False)
ax2.set_yticklabels(y_objs, rotation=0)

#ax3 = fig.add_subplot(413)
ax3.set_title("Raleigh")
sns.heatmap(raleigh, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax3, cbar=False)
ax3.set_yticklabels(y_objs, rotation=0)

#ax4 = fig.add_subplot(414, fontsize=10)
ax4.set_title("Regional")
ax4 = sns.heatmap(regional, linewidths=.05, cmap="YlOrBr", xticklabels=x_labs,
    yticklabels=y_objs, ax=ax4, cbar=True, cbar_ax=cbar_ax,
    cbar_kws={'orientation': 'horizontal'})     # change depending on whether analyzing DUF or DV
ax4.set_xticklabels(x_labs, rotation=rot, fontsize=10)
ax4.set_yticklabels(y_objs, rotation=0)

plt.savefig(savefig_name)

Running this for the sensitivity to decision variables and DU factors will generate the following images:

Sensitivity of performance objectives to decision variables.

In the figure above, the color of each box represents the sensitivity of a performance objective (y-axis) to a specific decision variable (x-axis). It is interesting to note that the restriction triggers (RT) of all utilities strongly influence each of their individual and regional reliability and restriction frequency. This indicates the potential for regional conflict, as possible errors is operating one utility’s restriction trigger may adversely affect other utilities’ reliabilities and ability to maintain full control over their own use of water-use restrictions. Furthermore, Raleigh’s performance is sensitive to more decision variables than its remaining two counterparts, with it’s worst-case cost (WCC) being affected most by Cary’s infrastructure investments. This observation highlights the importance of careful cooperation between a region’s member utilities to ensure that all partners play their part in maintaining both their own and their counterparts’ performance.

Sensitivity of performance objectives to DU factors.

In this next figure, we observe that uncertainty in demand growth is the only DU factor that significantly drives changes in individual and regional performance. This finding can thus help utilities to focus on demand management programs, or formulate operations and management policies that enable them to more quickly adapt to changes in consumer and industrial demand growth.

Overall, in this post, we have performed a simple sensitivity analysis to identify uncertainties in the decision variables and DU factors that control regional and individual performance. All the code for processing the output data can be found in this GitHub repository here. In the next post, we will end the MORDM blogpost series by performing scenario discovery to map regions of success and failure as defined by our robustness metrics.

References

Borgonovo, E. (2007). A new uncertainty importance measure. Reliability Engineering &Amp; System Safety, 92(6), 771-784. doi: 10.1016/j.ress.2006.04.015

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!)

We are pleased to announce that we have published our free eBook on Addressing Uncertainty in Multisector Dynamics Research, a primer on uncertainty characterization methods for research on sustainability, climate, and human-natural systems: https://lnkd.in/d6AMPnM3

This eBook is the product of years of collaboration across several teams that brings in perspectives from natural sciences, decision making under (deep) uncertainty, software engineering, statistics, and risk analysis. The breakdown of the book is as follows (summarized by Antonia Hadjimichael):

Chapter 1 uses the IM3 project as a living lab to encapsulate the challenges that emerge in bridging disciplines to make consequential model-based insights while acknowledging the tremendous array of uncertainties that shape them.

Chapter 2 helps the reader to better understand the importance of using diagnostic modeling and the diverse disciplinary perspectives that exist on how best to pursue consequential model-based insights.

Chapter 3 is a technical tools-focused primer for readers on the key elements of uncertainty characterization that includes ensemble-based design of experiments, quantitative methods for computing global sensitivities, and a summary of existing software packages.

Chapter 4 narrates for readers how and why the tools from the previous chapter can be applied in a range of tasks from diagnosing model performance to formal exploratory modeling methods for making consequential model-based discoveries.

The supplemental appendices provide a glossary, a brief summary of uncertainty quantification tools, and a suite of Jupyter notebook tutorials that provide hands-on training tied to the contents of Chapters 3 and 4.

A central theme of this eBook is that it is a living document that will be actively maintained through its GitHub repository. If you work on exploratory modeling, sustainability, management of human-natural systems, complex systems or adjacent areas we’d love your feedback on how the document could be improved! You can open an issue here.

Over the next few months, we will be posting more about the process of making a book and extending out some of the Jupyter Notebook coding exercises to get readers more comfortable with adjusting the source code for their own use. 

In this post, Chris Vernon and I will be discussing how you can create a similar style of eBook or documentation using Python and Sphinx. Subsequent posts will contain more information on how the book is hosted and managed in GitHub. Please note that this is a brief overview of some of the aspects of the eBook, but for full details, the source code is available here. You can find the sourcefiles for this tutorial in this repo and download the index.html file to click through the HTML page interactively.

Introduction to Sphinx

Sphinx is a tool written in Python that can facilitate the creation of a variety of documentation. Sphinx will translate a set of reStructuredText (reST) files that are linked in a hierarchy defined in an index file. First download and install Sphinx as follows (Note: the following pip commands assume that your version of pip is pointing to the instance of Python you want to use):

$pip install -U sphinx

To quickly set up a directory, execute the following from your terminal or command line while in your desired parent directory:

$sphinx-quickstart

This will create a build and source directory along with a makefile (a Windows Batch File: make.bat), and within the source directory, you will find an index.rst file and a file called conf.py. The index.rst file is termed the root documentation and houses the table of contents which organizes the structure of the document.

Changing the Theme

The default Sphinx theme is “alabaster” which has minimalist look. There are many other built-in themes and examples of documentation that use these themes can be found here. We are using an external theme for the book (found here) that has a sleek design and more of a “book feel”. To download this theme, simply execute the following in your terminal or command line:

$pip install sphinx-book-theme

Then change the theme which is located in conf.py.

html_theme = ‘sphinx_book_theme’

Adding Content

To add content to our book, we need to write our text files using reStructuredText, which is the default plaintext markup language used by Sphinx. The syntax is similar to Latex and easy to pick up as you get used to it. We will first add front matter so that when the HTML page is accessed, the front of the book is not empty. We place the front matter in index.rst as follows using a “directive”. Directives begin with an explicit markup start (two periods and a space), followed by the directive type and two colons (collectively, the “directive marker”).  Here we use an epigraph directive to render a classic block quote.

Creating front matter

Let’s build our documentation to see what our book looks like so far. I use a terminal called Cygwin since I am working on a Windows computer and run the following:

$ sphinx-build source/ build/

I can then open the resulting index.html located in the build directory by dragging and dropping it into my favorite browser or simply double clicking the file. 

eBook homepage

Now let’s add more sections to my book. I’m going to create a preface, a single section of material, and a bibliography. I first create a text file in my source directory that I call preface.rst. I add text to this file as shown below.

Preface Content

Then I add a section to the index.rst file that tells Sphinx where the preface should be placed in the book through the toctree directive (“table of contents tree”). We customize the table of contents to contain multiple sections and thus adjust the default toctree using the raw directive.

Updated Table of Contents

Next we create a section of content. In this section, I will demonstrate paragraph text, citations, images, notes, and equations, which cover the gamut of what you’ll likely want to include in various parts of your book. Let’s first install the necessary extensions for the bibliography as follows:

$pip install sphinxcontrib-bibtex

We then must add the extension into config.py. I also place the name of the file that contains my references.

Adding bibliography extensions and file names to config.py

Below is an example of what refs.bib looks like along with the Bibliography file.

Biblography .rst file and associated Bibtex

R.Bibliography.rst and refs.bib must written in the source directory and R.Bibliography.rst should be referenced in the index.rst file as follows.

Updated Table of Contents

Now let’s start writing our section and then place the reference to it in our index.rst file as shown above. In this section, note the first line that denotes the reference to the section that is used in the index.rst file. We show how to place a figure in lines 14-20 and the syntax for in-text citations is shown in the figure caption.   

If we render the webpage (e.g., sphinx-build source/ build/) it looks like this:

Updated webpage

The bibliography has the one reference so far:

Bibliography as seen on the webpage

The last two components that I will demonstrate are equations and notes, which may come in handy for scientific communications. In order to write equations, we use the MathJax Sphinx extension which allows for the best display of math in HTML.

Updated extensions

In our same section, we add additional text to show the equation and note functionality.

Adding equations and notes

We can then render the webpage one last time.

Updated webpage

Well there you go, after completing this tutorial, you will create the foundations of a book. You can feel free to use these files as a base for your own project! Enjoy!   

Running a Python script using Excel macros

Why run a Python script through Excel? Why bother with the middleman when environments such as Spyder and Jupyter Notebook exists?

Something that I have been learning of late is the importance of diversifying methods of presenting one’s work in the spirit of open science, communicability and inclusion. In that line of thinking, running a Python script via an Excel ‘user interface’ addresses two issues:

  1. Excel VBA’s slower reading and writing of data
  2. The steep learning curve associated with learning how to code in Python

In short, executing Python via Excel provides those with sufficient experience in the language with an avenue to efficiently communicate and visualize their data in a way that most people can see and understand. This blog post will demonstrate this point by extracting the mean and standard deviation of the sepal length and width, as well as the petal length and width, of three different iris subspecies available from the famed Iris dataset, that can be found here.

Download the dataset, called ‘iris.data’ into a folder or location of your choice. Next, change the extension of the file to ‘.csv’. Let’s start processing this in Python!

1. Writing the Python script

Here, we will be writing a Python script to generate two files: the means (iris_means.csv) and standard deviations (iris_std.csv) of each iris subspecies’ attributes. We will first write the Python script to do so:

import pandas as pd

# the types of data within the iris dataset
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "subspecies"]

iris_data = pd.read_csv('C:/your-preferred-location/iris.csv', 
                        sep=',', header=None, names=col_names, index_col=None)

#%%
iris_setosa = iris_data[iris_data['subspecies']=='Iris-setosa']
iris_setosa=iris_setosa.drop(['subspecies'], axis=1)

iris_versicolor = iris_data[iris_data['subspecies']=='Iris-versicolor']
iris_versicolor=iris_versicolor.drop(['subspecies'], axis=1)

iris_virginica = iris_data[iris_data['subspecies']=='Iris-virginica']
iris_virginica=iris_virginica.drop(['subspecies'], axis=1)

#%%
mean_setosa = iris_setosa.mean(axis=0)
std_setosa = iris_setosa.std(axis=0)

mean_versicolor = iris_versicolor.mean(axis=0)
std_versicolor = iris_versicolor.std(axis=0)

mean_virginica = iris_virginica.mean(axis=0)
std_virginica = iris_virginica.std(axis=0)

subspecies = ['Setosa', 'Versicolor', 'Virginica']

mean_vals = pd.concat([mean_setosa, mean_versicolor, mean_virginica], axis=1)
mean_vals.columns=subspecies

std_vals = pd.concat([std_setosa, std_versicolor, std_virginica], axis=1)
std_vals.columns=subspecies

mean_vals.to_csv('C:/Users/your-preferred-location/iris_means.csv', 
                 sep=',', header=True, index=True)
std_vals.to_csv('C:/Users/your-preferred-location/iris_std.csv', 
                sep=',', header=True, index=True)

2. Write the Excel VBA macro

We will first set up the Excel document that will execute the Excel macro. Create an Excel worksheet file. Mine is named ‘iris_GUI.xlsx’. Next, navigate to the ‘File’ tab and select ‘Options’. Go to ‘Customize Ribbon’ and make sure that ‘Developer’ is checked:

Figure 1: Enabling the Developer tab.

Click ‘OK’. The developer tab should now be visible in your Toolbar Ribbon:

Figure 2: Where the Developer tab is located.

Let’s get to the macro! Under the Developer tab, identify the ‘Macros’ tool on the far-left side of the toolbar. Select it and give your macro a suitable name. I called mine ‘link_python_excel’.

Figure 3: Name and create your macro.

Once this is done, click ‘Create’. Next, you should see a window like this pop up:

Figure 4: This is where you will write your macro.

Within the provided space, first initialize the macro using Sub link_python_excel(). This tells Excel VBA (Excel’s programming language) that you are about to write a macro called ‘link_python_excel’.

Next, declare your macro as an object, and your Python executable and Python script as strings. This is to enable VBA to locate the Python executable and use it to run the script as intended.

Dim objShell As Object
Dim PythonExe, PythonScript As String

You will then want to assign a macro object to its declaration:

Set objShell = VBA.CreateObject("Wscript.shell")

Please do not tamper with the “Wscript.shell” term. This assignment is the portion of the code that enables the macro to interact with Windows PowerShell, thus enabling VBA to execute the Python script. More information on this matter can be found at this website.

Following this, provide the filepath to the Python executable and the Python script:

PythonExe = """C:\Users\lbl59\AppData\Local\Programs\Python\Python39\python.exe"""
PythonScript = "C:\Users\lbl59\Desktop\run_python_in_excel\process_iris_data.py"

Note the use of the triple quotation marks. This method of assigning a string in VBA is used when the string potentially contains spaces. It is generally considered good practice to use “””…””” for file paths.

Finally, run your Python script and activate your workbook. The activation is necessary if you would like to run the script via a button in Excel, which we shall be going through in a bit.

objShell.Run PythonExe & PythonScript
Application.Goto Reference:="link_python_excel"

Finally, don’t forget to end the macro using End Sub.

Overall, your script should look as such:

Sub link_python_excel()

' link_python_excel Macro
' Declare all variables
Dim objShell As Object
Dim PythonExe, PythonScript As String
    
    'Create a new Shell Object
    Set objShell = VBA.CreateObject("Wscript.shell")
        
    'Provide the file path to the Python Exe
    PythonExe = """C:\Users\lbl59\AppData\Local\Programs\Python\Python39\python.exe"""
        
    'Provide the file path to the Python script
    PythonScript = "C:\Users\lbl59\Desktop\run_python_in_excel\process_iris_data.py"
        
    'Run the Python script
    objShell.Run PythonExe & PythonScript
    Application.Goto Reference:="link_python_excel"
    
End Sub

3. Run the Python script for Excel

Save the macro. Note that you will have to save the Excel workbook as a ‘.xlsm’ file to enable macro functionality. Once this is done, navigate to the ‘Developer’ tab and select ‘Insert’ and click on the button icon.

Figure 5: Inserting a button into the Excel workbook.

Draw the button the same way you would a rectangular shape. Rename the button, if you so prefer. In this exercise, the button is labeled ‘Run Code’. Next, right click on the button and select ‘Assign Macro’.

Figure 6: Assign the macro.

Once this is selected, you should be able to see the option to add the ‘link_python_excel’ macro to the button. Select the macro, and you are done! The two output files should have been output into the same location where you stored your iris.csv dataset.

Summary

In this post, we walked through the steps of writing a Python script to be run using an Excel macro. First, a Python script was written to process the iris dataset and output two files. Next, the Excel macro was written to execute this script. Finally, the macro was assigned to a button in the Excel sheet, where the Python script can be executed when the button is clicked.

Hope you found this useful!

Clustering basics and a demonstration in clustering infrastructure pathways

Introduction

When approaching a new dataset, the complexity may be daunting. If patterns in the data are not readily apparent, one potential first-step would be to cluster the data and search for groupings. Clustering data, or cluster analysis, can reveal relationships between the observations and provide insight into the data.

Figure: Graphical representation of how clustering can be used to identify patterns in a data set.

Motivation

The outcome of a cluster analysis can be highly dependent upon the clustering algorithm being used, the clustering criteria, and the specified number of clusters to be found. Anticipating these influences and how to manipulate them will set you up for success in your data analysis.

In this post, I introduce some fundamental clustering algorithms, the hierarchical and K-Means clustering algorithms. Then I will share some important considerations, such as how to select a suitable number of clusters. Finally, I will demonstrate the power of clustering by applying the methods to an infrastructure pathways dataset, to cluster infrastructure pathways generated for 1,000 different states of the world into different sets of ‘light’, ‘moderate’, and ‘heavy’ infrastructure development.

Let’s begin!

Methods

Hierarchical Clustering

Hierarchical clustering can be performed “bottom-up”, through an agglomerative approach, which starts by placing each observation in its own cluster and then successively links the observations into increasingly larger clusters. Alternatively, the clustering can take a “top-down”, or divisive, approach which begins with all observations within a single cluster and partitions the set into smaller clusters until each observation exists within its own cluster.

Figure: A graphical representation of the hierarchical clustering process. Red circles denote clusters. If read from left to right, then the figure is representative of the agglomerative approach. If read from right to left, then the figure is representative of the divisive approach.

Specifying how clusters are preferentially assigned will influence the final shape and contents of the clusters. When performing hierarchical clustering, the user must specify the linkage criteria, which determines how the distance between clusters is measured, and consequently which clusters will be group during the next iteration. Three common linkage criteria are:

Linkage TypeDescriptionFormulation
Complete (max)the maximum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Single (min)the minimum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Averagethe average distance between a group and
observations in another group.
average{ d(a, b) : a in A, b in B}

Additionally, when calculating the linkage criteria, different methods of measuring the distance can be used. For example, when working with correlated data, it may be preferable to use the Mahalanobis distance over the regular Euclidean norm.

The results of the clustering analysis can be viewed through a dendrogram, which displays a tree-like representation of the cluster structure. For more detail on dendrograms, and a step-by-step tutorial for creating color-coded dendrograms in R, see Rohini’s (2018) post here.

K-Means Clustering

The K-means algorithm follows three simple steps.

  1. Selection of k centroids within the sample space.

In the traditional naïve K-Means algorithms, centroids do not correspond to specific observations within the space. They can be randomly selected, or chosen through more informed methods (see, for example, K-Means++)

2. Assignment of each observation to the cluster with the nearest centroid.

3. Re-define each cluster centroid as the mean of the data points assigned to that centroid.

Steps 2 and 3 of the algorithm are repeated until the centroids begin to stabilize, such that subsequent iterations produce centroids located within some small, tolerable distance from one another.

Figure: A graphical representation of the k-means clustering process, with 3 clusters. X’s denote the centroids for that respective cluster. The dashed arrows denote the shift in the location of the centroid during the next iteration. Points are colored according to their assigned cluster during that iteration.

The centroids toward which the algorithm converges may be a locally optimal clustering solution, and can be highly dependent upon the initial centroid selection. Often, this sensitivity is handled by performing the clustering processes several times, and choosing the set of clusters which yield the smallest within-cluster sum of squares.

Selecting the number of clusters

In some contexts, it is possible to know how many clusters are required prior to the analysis. For example, when working with Fisher’s famous iris flower data set, three clusters are needed because there are three types of flowers observed within the data.

In an exploratory context, however, the appropriate number of clusters may not be readily apparent. One method of determining the preferred number of clusters is by testing multiple different values and evaluating the separation distances between the clusters. If the number of clusters is well-chosen, then the clusters will be both well-spaced and the intra-cluster distance will be small.

The Silhouette Coefficient, or silhouette score, is one method of measuring the goodness of fit for a set of clusters, and takes into consideration both the spacing between clusters (inter-cluster distance) and the distance between points within a cluster (intra-cluster spacing). The Silhouette scores is defined as:

Where:

a = average intra-cluster distance,

b = average inter-cluster distance.

The silhouette score ranges from [-1, 1]. A value near +1 suggests that the clusters are far from one another, with a small distances within the cluster, and thus the chosen number of clusters is well-suited for the data. A value near -1, however, suggests that the clusters are poorly describing the patterns in the data.

Example: Clustering Infrastructure Pathways

In water supply planning contexts, infrastructure pathways describe the sequencing of infrastructure investment decisions throughout time, in response to changes in system (construction of new reservoirs, expansions to water treatment capacity etc.). Adaptive infrastructure pathways use a risk-of-failure (ROF) based set of policy rules to trigger new infrastructure development. Adaptive infrastructure pathways have been shown to effectively coordinate water infrastructure decisions in complex multi-actor systems (Zeff et al., 2016). While the benefits of this adaptive strategy are clear, ROF based pathways also bring new challenges for decision makers – rather than specifying an single sequence of infrastructure investments, an adaptive rule system will generate a unique sequence of investments for every future state of the world. To understand how a rule system behaves, decision makers must have tools to visualize an ensemble of infrastructure pathways. This is where cluster analysis comes in. Using clustering, we can group similar infrastructure sequences together, allowing decision makers to summarize how a candidate infrastructure investment policy will behave across many future states of the world.

Here, I analyze a set of infrastructure pathways generated for a hypothetical water utility with four infrastructure options:

  • small water treatment plant expansion
  • large water treatment plant expansion
  • construction of a small new run of river intake
  • construction of a large new run of river intake

The utility is examining a candidate policy and has generated pathways for 1,000 different states of the world, or realizations, which are each characterized by unique hydrologic and population dynamics. Given the 1,000 different realizations, each with a unique infrastructure pathway, I am interested in clustering the pathways according the the timing of the infrastructure construction decisions to understand how the policy behaves.

The dataset used for this demonstration is courtesy of David Gold, who is also responsible for providing significant portions of the following code and assistance along the way. Thanks David!

Follow along with this demonstration by downloading .txt data file HERE!

Let’s begin by loading in the data.

import pandas as pd

# Load in data
pathways_df = pd.read_csv('./ClusterData.txt', sep = '\t', header = None)

The data contains information describing the timing of infrastructure construction for 1,000 different simulated realizations. Each column represents a different infrastructure option, and each row is 1 of the 1,000 different simulated realizations. The contents of each cell denote a standardized value corresponding to the timing of the infrastructure construction within each realization, ranging from [0, 1]. Infrastructure options containing the value 1 were not built during the 45-year simulation period, in that specific realization.

RealizationInfra. Opt. #1Infra. Opt. #2Infra. Opt. #M
1W1,1W1,2W1,M
2W2,1W2,2W2,M
NWN,2WN,2WN,M
Table: Example output data from the infrastructure pathways optimization process; within each realization (state of the world (SOW)) infrastructure construction decisions are made during different weeks within the simulation period (45-years), considering a subset of infrastructure options for each utility. Wi,j denotes a standardized value corresponding to the timing within realization i that infrastructure option j is built.

With the data available, we can begin clustering!

Here, I take advantage of the scikit-learn.cluster module, which has both hierarchical and K-Means clustering capabilities.

For the purposes of this demonstration, I am going to only show the process of producing the hierarchical clusters… the curious reader may choose to explore alternative clustering methods available in the scikit-learn toolbox.

### Perform hierarchical clustering with 3 clusters ###

# Import the necessary tools
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import pairwise_distances_argmin

# Begin by assuming 3 clusters
num_clusters = 3

# Initialize the hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters = 3)

# Cluster the data
hierarchical.fit(cluster_input)  

# Produce a array which specifies the cluster of each pathway
hierarchical_labels = hierarchical.fit_predict(cluster_input)

In order to interpret the significance of these clusters, we want find a way to visualize differences in cluster behavior through the infrastructure pathways.

To do this, I am going to consider the median timing of each infrastructure option within each cluster, across all 1,000 realizations. Additionally, I want to specify sort the clusters according to some qualitative characteristic. In this case, I define pathways with early infrastructure investments as “heavy”, and pathways within infrastructure investment later in the period as “light”. The “moderate” infrastructure pathways are contained within the cluster between these these.

The following script calculates the median of the clustered pathways, and sorts them according to their infrastructure timing.

### Interpreting clusters via pathways ###

# Group realizations by pathway
# Calculate the median week each infra. opt. is constructed in each cluster
import numpy as np
cluster_pathways = []
cluster_medians = []

# Loop through clusters 
for i in range(0,num_clusters):
    
    # Select realizations contained within i-th cluster
    current_cluster = cluster_input[hierarchical_labels ==i, :]
    
    # Multiply by 2344 weeks to convert from [0,1] to [0,2344] weeks
    current_cluster = current_cluster*2344
    
    cluster_pathways.append(current_cluster)
    
    # Find the median infrastructure build week in each realization
    current_medians = np.zeros(len(current_cluster[0,:]))
    
    # Loop through infrastructure options and calculate medians
    for j in range(0, len(current_cluster[0,:])):
        current_medians[j] = np.median(current_cluster[:,j])
        
    cluster_medians.append(current_medians)

# Sort clusters by average of median build week 
# to get light, moderate, and heavy infrastructure clusters
cluster_means = np.mean(cluster_medians, axis = 1)

sorted_indices = np.argsort(cluster_means)

# Re-order cluster medians according to sorted mean build week
cluster_medians = np.vstack((cluster_medians[sorted_indices[2]], 
                            cluster_medians[sorted_indices[1]],
                            cluster_medians[sorted_indices[0]]))

With the median timing of each infrastructure option specified for each cluster, I can begin the process of visualizing the behavior. The function below plots a single infrastructure pathway:

import numpy as np
import matplotlib.pyplot as plt

def plot_single_pathway(cluster_medians, cluster_pathways, inf_options_idx,
                        c, inf_names, y_offset, ylabeling, xlabeling, 
                        plot_legend, ax):
    
    """
    Makes a plot of an infrastructure pathway.
    
    PARAMETERS:
        cluster_medians: an array with median weeks each option is built
        cluster_pathways: an array with every pathway in the cluster
        inf_options: an array with numbers representing each option (y-axis vals)
        should start at zero to represent the "baseline"
        c: color to plot pathway
        y_offset: distance to offset each pathway from 0 (easier to visualize stacked paths)
        ylabeling: labeling for infrastructure options
        xlabeling: labeling x axis
        plot_legend: (bool) to turn on or off
        ax: axes object to plot pathway
    """
    
    # Extract all non-baseline infrastructure options
    inf_options_idx_no_baseline=inf_options_idx[1:]
    sorted_inf = np.argsort(cluster_medians)
    
    # Plot Pathways
    # create arrays to plot the pathway lines. To ensure pathways have corners 
    # we need an array to have length 2*num_inf_options
    pathway_x = np.zeros(len(cluster_medians)*2+2)
    pathway_y = np.zeros(len(cluster_medians)*2+2)
    
    # To have corners, each inf. opt. must be located
    # in both the cell it is triggered, and adjacent cell
    
    cluster_medians = np.rint(cluster_medians/45)
    for i in range(0,len(cluster_medians)):
        for j in [1,2]:
            pathway_x[i*2+j] = cluster_medians[sorted_inf[i]]
            pathway_y[i*2+j+1] = inf_options_idx_no_baseline[sorted_inf[i]]

    # Set the end of the pathway at year 45
    pathway_x[-1] = 45

    # plot the pathway line    
    ax.plot(pathway_x, pathway_y + y_offset, color=c, linewidth=5, 
            alpha = .9, zorder=1)

    # Formatting plot elements 
    ax.set_xlim([0,44])
    inf_name_lables = inf_names
    inf_options_idx = np.arange(len(inf_name_lables))
    ax.set_yticks(inf_options_idx)
    ax.set_xlabel('Week')
    
    if ylabeling:
        ax.set_yticklabels(inf_name_lables)
    else:
        ax.set_yticklabels([])
    
    ax.set_ylim([-0.5,len(cluster_medians)+.5])

    plt.show()

The above script can be used to plot the median pathway for a single cluster, but it would be much more worthwhile to compare the median pathways of the different clusters against one another. To do this, we can define another function which will use the plot_single_pathway() iteratively to plot each of the median cluster pathways on the same figure.

import numpy as np
import matplotlib.pyplot as plt
from plot_single_pathway import plot_single_pathway

def plot_many_paths(clust_meds, clust_pathways, n_clusters, cluster_colors, fig, 
                    gspec, fig_col, ylabeling, plot_legend):
    
    
    y_offsets = [-.15, 0, .15]
    
    # Specific information for Utility
    inf_options_idx = np.array([0, 1,2, 3, 4])
    utility_inf = ['Baseline', 'Small intake expans.', 'Large intake expans.',
                    'Small WTP expans.', 
                    'Large WTP expans.']
    
    ax = fig.add_subplot(gspec[0,0])
    
    for i in np.arange(n_clusters):
        plot_single_pathway(clust_meds[i], clust_pathways[i], inf_options_idx, cluster_colors[i],
                            utility_inf, y_offsets[i], ylabeling, False, True, ax)
        plt.show()
    
    if plot_legend and (n_clusters == 3):
        ax.legend(['Light inf.', 'Moderat inf.', 'Heavy inf.'], 
                  loc='upper left')
    elif plot_legend and (n_clusters == 2):
        ax.legend(['Light inf.', 'Heavy inf.'], loc='upper left')
        
    ax.tick_params(axis = "y", which = "both", left = False, right = False)
    
    plt.show()
    return fig

Time for the reveal! Execution of this final portion of the script will plot each of the three clusters, and color-code them according to classifications as “heavy”, “moderate”, or “light” infrastructure.

### Plotting ###
from plot_many_paths import plot_many_paths

fig = plt.figure(figsize = (5,5))
gspec = fig.add_gridspec(ncols=1, nrows=1)

# Colorscale
heavy_inf_color = '#132a13'
mid_inf_color = '#4f772d'
light_inf_color = '#90a955'
cluster_colors = [light_inf_color, mid_inf_color, heavy_inf_color]

plot_many_paths(cluster_medians, cluster_pathways, 3, cluster_colors, 
                fig, gspec, 0, True, True)
Figure: Median infrastructure pathways for clustered pathways.

Voilà!

The figure above shows a clear contrast in the behavior of the light and heavy infrastructure pathways; the heavy infrastructure pathway is not only characterized by early infrastructure construction, but also by more frequent construction throughout the simulation period, in comparison to the median of the light infrastructure cluster which only requires a single infrastructure expansion within the same simulation period.

From the plot above, it is not apparent that three clusters are necessarily appropriate for this data set. Although the heavy and light infrastructure clusters are clearly demarcated from one another, the moderate infrastructure cluster has characteristics of both: a late first-expansion similar to the light infrastructure cluster, but a high frequency of construction throughout the period similar to the heavy infrastructure. Should the moderate infrastructure really be a separate cluster?

Let’s take advantage of the knowledge we developed earlier, and consider the silhouette scores corresponding to different numbers of clusters! Fortunately, scikit-learn has a suite of tools that made the silhouette analysis very easy.

In the following script, I evaluate the silhouette score associated with clustering for 2, 3, and 4 different clusters.

### Performing silhouette analysis ###
from sklearn.metrics import silhouette_score

num_clust = [2, 3, 4]

# Set up hierarchical clustering with different numbers of clusters
ac2 = AgglomerativeClustering(n_clusters = num_clust[0])
ac3 = AgglomerativeClustering(n_clusters = num_clust[1])
ac4 = AgglomerativeClustering(n_clusters = num_clust[2])

# Extract the silhouette score for each hierarchical grouping
silhouette_scores = []
silhouette_scores.append(silhouette_score(cluster_input, ac2.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac3.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac4.fit_predict(cluster_input)))
                                                  
plt.bar(num_clust, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 14)
plt.ylabel('Silhouette Score', fontsize = 14)
plt.xticks(num_clust)
plt.title("Silhouette analysis")
plt.show()
Figure: Silhouette analysis for the hierarchical clustering of infrastructure pathways.

Interestingly, the silhouette score was highest when the clustering was performed using only two clusters, suggesting that the moderate infrastructure class may not have been sufficiently unique to warrant a separate cluster.

Another method of visualizing the cluster structure is through the use of dendrograms. Below, I show dendrograms color-coded with only two clusters (left) and using three clusters (right).

Figure: Dendrograms showing the cluster structure of infrastructure pathways for a utility, using 2 (left) and 3 (right) clusters.

The dendrogram containing only two clusters (left) reveals that the moderate and light clusters can be combined into a single cluster, because the moderate cluster is less-different from the light infrastructure cluster than it is from the heavy infrastructure cluster. Although, it is important to note that dendrograms are not a reliable method of determining the number of clusters on their own. In fact, given that the difference in silhouette scores between the 2-cluster and 3-cluster hierarchies is relatively modest (~0.05) there is not sufficiently strong justification to choose the 2 clusters over 3 clusters.

For the sake of comparison, here are the 2-cluster and 3-cluster median pathway plots:

Conclusion

I hope you’ve gained some insight into clustering, infrastructure pathways (or preferably both) in this post.

Analysis of infrastructure pathways via statistical clustering presents a unique application of a very common statistical tool. In this analysis I clustered the pathways with respect to the timing of the infrastructure expansions, however there is potential for a much more in-depth analysis of how the hydrologic or urban contexts of each realization shape these infrastructure pathways… possibly justifying another blog post at a later date. Stay tuned!

Thanks again to David Gold, for providing the data and the foundational python code.

References

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research52(9), 7327-7346.

MORDM VII: Optimality, robustness, and reevaluation under deep uncertainty

In the previous MORDM post, we visualized the reference set of performance objectives for the North Carolina Research Triangle and conducted a preliminary multi-criterion robustness analysis using two criteria: (1) regional reliability should be at least 98%, and (2) regional restriction frequency should be not more than 20%. Using these metrics, we found that Pareto-optimality does not guarantee satisfactory robustness, a statement that is justified by showing that not all portfolios within the reference set satisfied the robustness criteria.

In this post, we will explain the differences between optimality and robustness, and justify the importance of robust optimization instead of sole reliance on a set of optimal solutions (aka an optimal portfolio). To demonstrate the differences better, we will also reevaluate the Pareto optimal set of solutions under a more challenging set of states of the world (SOWs), a method first used in Herman et al (2013, 2014) and Zeff et al (2014). The formal term for this method, Deeply Uncertain (DU) Re-evaluation was coined in a 2019 paper by Trindade et al.

Optimality vs robustness

The descriptions of optimality are many. From a purely technical perspective, a Pareto optimal set is a set of decision variables or solutions that maps to a Pareto front, or a set of performance objectives where improving one objective cannot be improved without causing performance degradation in another. For the purposes of this blog post, we shall use the definition of optimality as laid out by Beyer and Sendoff in their 2007 paper:

The global optimal design…depends on the…(objective) functions and constraints…however, these functions always represent models and/or approximations of the real world.

Beyer and Sendhoff (2007)

In other words, a Pareto reference set is only optimal within the bounds of the model it was generated from. This makes sense; models are only approximations of the real world. It is difficult and computationally expensive to have bounds on the degree of certainty to which the model optimum maps to the true optimum due to uncertainties driven by human action, natural variability, and incomplete knowledge. Optimization is static in relation to reality – the set of solutions found do not change with time, and only account for the conditions within the model itself. Any deviation from this set of solutions or unaccounted differences between the actual system and model may result in failure (Herman et al, 2015; Read et al 2014).

This is why searching the set of optimal solutions for robust solutions is important. Herman et al (2015) quotes an earlier works by Matalas and Fiering (1977) that defines robustness as the insensitivity a system’s portfolio to uncertainty. Within the MORDM context, robustness was found to be best defined using the multi-criterion satisficing robustness measure (Herman et al, 2015), which refers to the ability of a solution to meet one or more requirements (or criteria) set by the decision-makers when evaluated under a set of challenging scenarios. More information on alternative robustness measures can be found here.

In this blog post, we will begin to explore this concept of robustness by conducting DU Re-evaluation, where we will perform the following steps:

Generate a set of ROF tables from a more challenging set of SOWs

Recall that we previously stored our Pareto-optimal solution set in a .csv file names ‘NC_refset.csv’ (find the original Git Repository here). Now, we will write a quick Python script (called rof_tables_reeval.py in the Git Repository) using MPI4PY that will parallelize and speed up the ROF table generation and the bash script to submit the job. More information on parallelization using MPI4PY can be found in this handy blog post by Dave Gold.

First, create a Python virtual environment within the folder where all your sourcecode is kept and activate the virtual environment. I called mine python3_venv:

python3 -m venv python_venv
source python_venv/bin/activate

Next, install the numpy and mpi4py libraries:

pip install numpy mpi4py

Then write the Python script as follows:

# -*- coding: utf-8 -*-
"""
Created on Tues March 1 2022 16:16
@author: Lillian Bei Jia Lau
"""

from mpi4py import MPI
import numpy as np
import subprocess, sys, time
import os

# 5 nodes, 50 RDMs per node
# 16 tasks per node
# 5 RDMs per task
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # up to 20 processes
print('rank = ', rank)

N_RDMs_needed = 100
N_REALIZATIONS = 100
N_RDM_PER_NODE = 20
N_TASKS_PER_NODE = 10 
N_RDM_PER_TASK = 2 # each task handles two RDMs
N_TASKS = 50 # rank ranges from 0 to 50

DATA_DIR = "/scratch/lbl59/blog/WaterPaths/"
SOLS_FILE_NAME = "NC_dvs_all_noheader.csv"
N_SOLS = 1

OMP_NUM_THREADS = 32

for i in range(N_RDM_PER_TASK):
    current_RDM = rank + (N_TASKS * i)

    command_gen_tables = "./waterpaths -T {} -t 2344 -r {} -d {} -C 1 -O rof_tables_reeval/rdm_{} -e 0 \
            -U TestFiles/rdm_utilities_test_problem_reeval.csv \
            -W TestFiles/rdm_water_sources_test_problem_reeval.csv \
            -P TestFiles/rdm_dmp_test_problem_reeval.csv \
            -s {} -f 0 -l {} -R {}\
            -p false".format(OMP_NUM_THREADS, N_REALIZATIONS, DATA_DIR, current_RDM, SOLS_FILE_NAME, N_SOLS, current_RDM)

    print(command_gen_tables)
    os.system(command_gen_tables)

comm.Barrier()

Before proceeding, a quick explanation on what all this means:

  • Line 12: We are parallelizing this job across 5 nodes on Cornell’s THECUBE (The Cube) computing cluster.
  • Lines 19-28: On each node, we are submitting 10 tasks to each of the 5 nodes requested. Each task, in turn, is handling 2 robust decision-making (RDM) multiplier files that scale up or scale down a hydroclimatic realization make a state of the world more (or less) challenging. In this submission, we are creating 400 different variations of each hydroclimatic scenario using the 400 RDM files, and running it across only one solution
  • Line 16 and 31: The ‘rank’ is the order of the tasks in which there are submitted. Since there are 10 tasks over 5 nodes, there will be a total of 50 tasks being submitted. Note and understand how the current_RDM is calculated.
  • Lines 32 to 37: This is the command that you are going to submit to The Cube. Note the -C and -O flags; a value of 1 for the -C flag tells WaterPaths to generate ROF tables, and the -O tells WaterPaths to output each ROF table file into a folder within rof_tables_reeval/rdm_{}for each RDM. Feel free to change the filenames as you see fit.

To accompany this script, first create the following folders: output, out_reeval, and rof_tables_reeval. The output folder will contain the simulation results from running the 1 solution across the 100 hydroclimatic realizations. The out_reeval folder will store any output or error messages such as script runtime.

Then, write the following bash submission script:

#!/bin/bash
#SBATCH -n 50 -N 5 -p normal
#SBATCH --job-name=rof_tables_reeval
#SBATCH --output=out_reeval/rof_tables_reeval.out
#SBATCH --error=out_reeval/rof_tables_reeval.err
#SBATCH --time=200:00:00
#SBATCH --mail-user=lbl59@cornell.edu
#SBATCH --mail-type=all

export OMP_NUM_THREADS=32

module load openmpi3/3.1.4
module spider py3-mpi4py
module spider py3-numpy/1.15.3

START="$(date +%s)"

mpirun python3 rof_tables_reeval.py

DURATION=$[ $(date +%s) - ${START} ]

echo ${DURATION}

You can find the bash script under the filename rof_table_gen_reeval.sh. Finally, submit the script using the following line:

sbatch ./rof_table_gen_reeval.sh

The run should take roughly 5 hours. We’re good for some time!

Re-evaluate your solutions (and possibly your life choices, while you’re at it)

Once the ROF tables are generated, it’s time to get a little hands-on with the underlying WaterPaths code. Navigate to the following PaperTestProblem.cpp file using:
cd /yourfilepath/WaterPaths/src/Problem/

Follow the next steps carefully.

  1. Delete PaperTestProblem.cpp and replace it with the file PaperTestProblem-reeval.cpp. It can be found in the the main Git Repository.
  2. Rename the latter file PaperTestProblem.cpp – it will be the new PaperTestProblem file that will be able to individually read each RDM scenario’s ROF tables.
  3. Re-make WaterPaths by calling make clean and then make gcc in the command line. This will ensure that WaterPaths has no problems running the new PaperTestProblem.cpp file.

Next, write the following Python script (called run_du_reeval.py in the Git repository):

# -*- coding: utf-8 -*-
"""
Created on Tues March 1 2022 16:16

@author: Lillian Bei Jia Lau
"""

from mpi4py import MPI
import numpy as np
import subprocess, sys, time
import os

N_RDMs_needed = 100  # N_TASKS_PER_NODE * N_RDM_PER_TASK * num nodes
N_REALIZATIONS = 100
N_RDM_PER_NODE = 20
N_TASKS_PER_NODE = 10 # rank ranges from 0 to 15
N_RDM_PER_TASK = 2 # each task handles five RDMs
N_TASKS = 50

comm = MPI.COMM_WORLD
rank = comm.Get_rank() # up to 20 processes
print('rank = ', rank)

DATA_DIR = "/scratch/lbl59/blog/WaterPaths/"
SOLS_FILE_NAME = "NC_dvs_all_noheader.csv"
N_SOLS = 69
OMP_NUM_THREADS = 32

for i in range(N_RDM_PER_TASK):
    current_RDM = rank + (N_TASKS * i)

    command_run_rdm = "./waterpaths -T {} -t 2344 -r {} -d {} -C -1 -O rof_tables_reeval/rdm_{} -e 0 \
            -U TestFiles/rdm_utilities_test_problem_reeval.csv \
            -W TestFiles/rdm_water_sources_test_problem_reeval.csv \
            -P TestFiles/rdm_dmp_test_problem_reeval.csv \
            -s {} -R {} -f 0 -l 69\
            -p false".format(OMP_NUM_THREADS, N_REALIZATIONS, DATA_DIR, \
                    current_RDM , SOLS_FILE_NAME, current_RDM)

    print(command_run_rdm)
    os.system(command_run_rdm)

comm.Barrier()

Note the change in the -C flag; its value is now -1, telling WaterPaths that it should import the ROF table values from the folder indicated by the -O flag. The resulting objective values for each RDM will be saved in the output folder we previously made.

The accompanying bash script, named du_reeval.sh is as follows:

#!/bin/bash
#SBATCH -n 50 -N 5 -p normal
#SBATCH --job-name=mordm_training_du_reeval
#SBATCH --output=out_reeval/mordm_training_du_reeval.out
#SBATCH --error=out_reeval/mordm_training_du_reeval.err
#SBATCH --time=200:00:00
#SBATCH --mail-user=lbl59@cornell.edu
#SBATCH --mail-type=all

export OMP_NUM_THREADS=32
module load openmpi3/3.1.4
module spider py3-numpy/1.15.3

START="$(date +%s)"

mpirun python3 run_du_reeval.py

DURATION=$[ $(date +%s) - ${START} ]

echo ${DURATION}

This run should take approximately three to four days. After that, you will have 1000 files containing 69 objective value sets resulting from running the 69 solutions across 1000 deeply-uncertain states of the world.

Summary

In this post, we defined optimality and robustness. We demonstrated how to run a DU re-evaluation across 100 challenging SOWs to observe how these ‘optimal’ solutions perform in more extreme scenarios. This is done to show that optimality is bound by current model states, and any deviation from the expected circumstances as defined by the model may lead to degradations in performance.

In the next blog post, we will be visualizing these changes in performance using a combination of sensitivity analysis, scenario discovery, and tradeoff analysis.

References

Beyer, H. and Sendhoff, B., 2007. Robust optimization – A comprehensive survey. Computer Methods in Applied Mechanics and Engineering, 196(33-34), pp.3190-3218.

Herman, J., Reed, P., Zeff, H. and Characklis, G., 2015. How Should Robustness Be Defined for Water Systems Planning under Change?. Journal of Water Resources Planning and Management, 141(10), p.04015012.

Herman, J., Zeff, H., Reed, P. and Characklis, G., 2014. Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty. Water Resources Research, 50(10), pp.7692-7713.

Matalas, N. C., and Fiering, M. B. (1977). “Water-resource systems planning.” Climate, climatic change, and water supply, studies in geophysics, National Academy of Sciences, Washington, DC, 99–110.

Read, L., Madani, K. and Inanloo, B., 2014. Optimality versus stability in water resource allocation. Journal of Environmental Management, 133, pp.343-354.

Zeff, H., Kasprzyk, J., Herman, J., Reed, P. and Characklis, G., 2014. Navigating financial and supply reliability tradeoffs in regional drought management portfolios. Water Resources Research, 50(6), pp.4906-4923.

A non-intimidating introduction to parallel computing with Numba

This blog post is adapted from material I learned during the 2021 San Diego Supercomputer Center (SDSC) Summer Institute. This was an introductory boot camp to high-performance computing (HPC), and one of the modules taught the application of Numba for in-line parallelization and speeding up of Python code.

What is Numba?

According to its official web page, Numba is a just-in-time (JIT) compiler that translates subsets of Python and NumPy code into fast machine code, enabling it to run at speeds approaching that of C or Fortran. This is becuase JIT compilation enables specific lines of code to be compiled or activated only when necessary. Numba also makes use of cache memory to generate and store the compiled version of all data types entered to a specific function, which eliminates the need for recompilation every time the same data type is called when a function is run.

This blog post will demonstrate a simple examples of using Numba and its most commonly-used decorator, @jit, via Jupyter Notebook. The Binder file containing all the executable code can be found here.

Note: The ‘@‘ flag is used to indicate the use of a decorator

Installing Numba and Setting up the Jupyter Notebook

First, in your command prompt, enter:

pip install numba

Alternatively, you can also use:

conda install numba

Next, import Numba:

import numpy as np
import numba
from numba import jit
from numba import vectorize

Great! Now let’s move onto using the @jit decorator.

Using @jit for executing functions on the CPU

The @jit decorator works best on numerical functions that use NumPy. It has two modes: nopython mode and object mode. Setting nopython=True tell the compiler to overlook the involvement of the Python interpreter when running the entire decorated function. This setting leads to the best performance. However, in the case when:

  1. nopython=True fails
  2. nopython=False, or
  3. nopython is not set at all

the compiler defaults to object mode. Then, Numba will manually identify loops that it can compile into functions to be run in machine code, and will run the remaining code in the interpreter.

Here, @jit is demonstrated on a simple matrix multiplication function:

# a function that does multiple matrix multiplication
@jit(nopython=True)
def matrix_multiplication(A, x):
    b = np.empty(shape=(x.shape[0],1), dtype=np.float64)
    for i in range(x.shape[0]):
        b[i] = np.dot(A[i,:], x)
    return b

Remember – the use of @jit means that this function has not been compiled yet! Compilation only happens when you call the function:

A = np.random.rand(10, 10)
x = np.random.rand(10, 1)
a_complicated_function(A,x)

But how much faster is Numba really? To find out, some benchmarking is in order. Jupyter Notebook has a handy function called %timeit that runs simple functions many times in a loop to get their average execution time, that can be used as follows:

%timeit matrix_multiplication(A,x)

# 11.4 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba has a special .py_func attribute that effectively allows the decorated function to run as the original uncompiled Python function. Using this to compare its runtime to that of the decorated version,

%timeit matrix_multiplication.py_func(A,x)

# 35.5 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

From here, you can see that the Numba version runs about 3 times faster than using only NumPy arrays. In addition to this, Numba also supports tuples, integers, floats, and Python lists. All other Python features supported by Numba can be found here.

Besides explicitly declaring @jit at the start of a function, Numba makes it simple to turn a NumPy function into a Numba function by attaching jit(nopython=True) to the original function. This essentially uses the @jit decorator as a function. The function to calculate absolute percentage relative error demonstrates how this is done:

# Calculate percentage relative error
def numpy_re(x, true):
    return np.abs(((x - true)/true))*100

numba_re = jit(nopython=True)(numpy_re)

And we can see how the Number version is faster:

%timeit numpy_re(x, 0.66)
%timeit numba_re(x, 0.66)

where the NumPy version takes approximately 2.61 microseconds to run, while the Numba version takes 687 nanoseconds.

Inline parallelization with Numba

The @jit decorator can also be used to enable inline parallelization by setting its parallelization pass parallel=True. Parallelization in Numba is done via multi-threading, which essentially creates threads of code that are distributed over all the available CPU cores. An example of this can be seen in the code snippet below, describing a function that calculates the normal distribution of a set of data with a given mean and standard deviation:

SQRT_2PI = np.sqrt(2 * np.pi)

@jit(nopython=True, parallel=True)
def normals(x, means, sds):
    n = means.shape[0]
    result = np.exp(-0.5*((x - means)/sds)**2)
    return (1 / (sds * np.sqrt(2*np.pi))) * result

As usual, the function must be compiled:

means = np.random.uniform(-1,1, size=10**8)
sds = np.random.uniform(0.1, 0.2, size=10**8)

normals(0.6, means, sds)

To appreciate the speed-up that Numba’s multi-threading provides, compare the runtime for this with:

  1. A decorated version of the function with a disabled parallel pass
  2. The uncompiled, original NumPy function

The first example can be timed by:

normals_deco_nothread = jit(nopython=True)(normals.py_func)
%timeit normals_deco_nothread(0.6, means, sds)

# 3.24 s ± 757 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The first line of the code snippet first makes an uncompiled copy of the normals function, and then applies the @jit decorator to it. This effectively creates a version of normals that uses @jit, but is not multi-threaded. This run of the function took approximately 3.3 seconds.

For the second example, simply:

%timeit normals.py_func(0.6, means, sds)

# 7.38 s ± 759 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, compare both these examples to the runtime of the decorated and multi-threaded normals function:

%timeit normals(0.6, means, sds)

# 933 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The decorated, multi-threaded function is significantly faster (933 ms) than the decorated function without multi-threading (3.24 s), which in turn is faster than the uncompiled original NumPy function (7.38 s). However, the degree of speed-up may vary depending on the number of CPUs that the machine has available.

Summary

In general, the improvements achieved by using Numba on top of NumPy functions are marginal for simple, few-loop functions. Nevertheless, Numba is particularly useful for large datasets or high-dimensional arrays that require a large number of loops, and would benefit from the one-and-done compilation that it enables. For more information on using Numba, please refer to its official web page.

Basics of data visualization with ggplot2

Basics of data visualization with ggplot2

In my previous post, I showed how wonderful the ggplot2 library in R is for visualizing complex networks. I realized that while there are several posts on this blog going over the advanced visualization capabilities of the ggplot2 library, there isn’t a primer on structuring code for creating graphs in R…yet. In this post, I will go over the syntax for creating pretty ggplot2 graphs and tweaking various parameters. I am a self-declared Python aficionado, but love using ggplot2 because it is intuitive to use, beginner-friendly, and highly customizable all at the same time.

Dataset and setup

For this tutorial, I will be using one of the built-in datasets in R called mtcars which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Further documentation on this dataset can be found here. We import the data into our RStudio workspace.

# import the library into our workspace
library(ggplot2)

# import dataset
data(mtcars)
head(mtcars)

The resultant dataset looks something like this.

Basic plot

Now that we have the data, we can get to plotting with ggplot2. We can declaratively create graphics using this library. We just have to provide the data, specify how to map properties to graph aesthetics, and the library takes care of the rest for us! We need to specify three things for each ggplot — 1) the data, 2) the aesthetics, and 3) the geometry.

Let us start by creating a basic scatterplot of the mileage (mpg) of each car as a function of its horsepower (hp). In this case the data is our dataframe mtcars, and the aesthetics x and y will be defined as the names of the columns we wish to plot along each axis — hp and mpg. We can also set the color aesthetic to indicate the number of cylinders (cyl) in each car. One of the reasons ggplot2 is so user-friendly is because each graph property can be tacked on to the same line of code with a + sign. Since we want a scatterplot, the geometry will be defined using geom_point().

# basic scatterplot
g <- ggplot(data = mtcars, aes(x = hp, y = mpg, color=cyl))
g + geom_point()

Excellent! The library automatically assigns the column names as axis labels, and uses the default theme and colors, but all of this can be modified to suit our tastes and to create pretty graphs. It is also important to note that we could have visualized the same data (less helpfully) as a line plot instead of a scatterplot, just by tweaking the geometry function.

# basic line plot
g + geom_line()

Well, this looks unpleasant. But wait, we can do so much more. We can also layer multiple geometries on the same graph to make more interesting plots.

# basic scatter+line plot
g + geom_line() + geom_point()

Additionally, we can tweak the geometry properties in each graph. Here is how we can transform the lines to dotted, and specify line widths and marker shape and size.

# change properties of geometry
g + geom_point(shape = "diamond", size = 3) +
  geom_line(color = "black", linetype = "dotted", size = .3) 

While our graph looks much neater now, using a line plot is actually pretty unhelpful for our dataset since each data point is a separate car. We will stick with a scatterplot for the rest of this tutorial. However, the above sort of graph would work great for time series data or other data that measures change in one variable.

Axis labels

One of the cardinal rules of good data visualization is to add axis labels to your graphs. While R automatically set the axis labels to be column headers, we can override this to make the axis labels more informative with just one extra function.

# change axis titles
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)")

Title

This graph is in serious need of a title to provide a reader some idea of what they’re looking at. There are actually multiple ways to add a graph title here, but I find it easiest to use ggtitle().

# add title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") 

Alright, having a title is helpful, but I don’t love it’s placement on the graph. R automatically left-aligns the title, where it clashes with the y-axis. I would much rather have the title right-aligned, in a bigger font, and bolded. Here is how to do that.

# change position of title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Theme

There are ways to manually change the background and gridlines of ggplot2 graphs using theme(), but an easy workaround is to use the built-in themes. Which theme you use depends greatly on the graph type and formatting guidelines, but I personally like a white background, faint gridlines, and a bounding box. One thing to note here though is that theme_bw() overrides theme() so the order of these two matters.

# add theme
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also use the theme() function to change the base font size and font family. Shown below is how to increase the base font size to 15 and change the base font family to Courier.

# use theme to change base font family and font size
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme_bw(base_size = 15, base_family = "Courier") +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Legend

It has been bothering me for the last seven paragraphs that my legend title still uses the column name. However, this is an easy fix. All I have to do is add a label to the color aesthetic in the labs() function.

# change legend title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also change the position of the legend. R automatically places legends on the right, and while I like having it to the right in this case, I could also place the legend at the bottom of the graph. This automatically changes the aspect ratio of the graph.

# change legend position
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "bottom")

Margins

The theme() function is of endless use in ggplot2, and can be used to manually adjust the graph margins and add/remove white space padding. The order of arguments in margin() is counterclockwise — top, right, bottom, left (helpfully remembered by the pneumonic TRouBLe).

# add plot margins
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "right") +
  theme(plot.margin = margin(t = 1, r = 1, b = 1, l = 2, unit = "cm"))

Conclusion

I have barely scratched the surface of what can be achieved using ggplot2 in this post. There are hundreds of excellent tutorials online that dive deeper into ggplot2, like this blog post by Cedric Scherer. I have yet to learn so much about this library and data visualization in general, but have hopefully made a solid case for using ggplot2 to create clean and aesthetically-pleasing data visualizations.

MORDM Basics V: WaterPaths Tutorial

The MORDM framework poses four fundamental questions that each corresponds to a section within the framework shown in Figure 1:

  1. What actions can we take to address the problem?
  2. What worlds are we implementing those actions in?
  3. What are acceptable levels of performance for those actions in all worlds?
  4. What controls failures across those worlds?
Figure 1: The MORDM framework (Herman et. al., 2013).

In the previous blog post, we used state-aware ROF triggers to implement drought mitigation and supply management actions for one hydroclimatic and demand scenario. In the first MORDM blog post, we generated multiple deeply-uncertain synthetic realizations of hydroclimatic and demand scenarios. The next logical question would be: how do these actions fare across all worlds and across time?

Setting up the system

To explore this question, we will expand the scope of our test case to include Cary’s two neighboring utilities – Durham and Raleigh – within the Research Triangle. The three utilities aim to form a cooperative water supply and management agreement in which they would like to optimize the following objectives, as stated in Trindade et al (2019):

  1. Maximize supply reliability, REL
  2. Minimize the frequency of implementing water use restrictions, RT
  3. Minimize the net present value of infrastructure investment, NPV
  4. Minimize the financial cost of drought mitigation and debt repayment, FC
  5. Minimize the worst first-percentile cost of the FC

In this example, each objective is a function of short-term risks of failure triggers (stROF) and long term risks of failure triggers (ltROF). The stROFs trigger short-term action, typically on a weekly or monthly basis. The ltROFs trigger action on a multi-year, sometimes decadal, timescale. Recall from a previous post that ROF triggers are state-aware, probabilistic decision rules that dictate how a system responds to risk. Here, we optimize for a Pareto-approximate set of ROF triggers (or risk thresholds) that will result in a range of performance objective tradeoffs. An example of a stROF is the restriction ROF trigger we explored in the post prior to this one.

In addition, an example of an ltROF would be the infrastructure construction ltROF. When this ltROF threshold is crossed, an infrastructure project is ‘built’. Potential infrastructure projects are ordered in a development pathway (Zeff et al 2016), and the ltROF triggers the next infrastructure option in the sequence. The term ‘pathway’ is used as these infrastructure sequences are not fixed, but are state-dependent and can be shuffled to allow the optimization process to discover multiple potential pathways.

Over the next two blog posts, we will explore the interaction between the water-use restriction stROF and the infrastructure ltROF, and how this affects system performance. For now, we will simulate the Research Triangle test case and optimize for the ‘best’ set of ROF triggers using WaterPaths and Borg MOEA.

Using WaterPaths and Borg MOEA

We will be using the WaterPaths utility planning and management tool (Trindade et al, 2020) to simulate the performance of the Research Triangle test case. For clarification, the default simulation within WaterPaths is that of the Research Triangle. The folder that you will be downloading from GitHub already contains all the input, uncertainty and decision variable files required. This tool will be paired with the Borg MOEA (Hadka and Reed, 2013) to optimize the performance objectives in each simulation to obtain a set of Pareto-optimal long- and short-term ROF triggers that result in a non-dominated set of tradeoffs. Jazmin made a few posts that walks through compiling Borg and the installation of different parallel and series versions of Borg that might be helpful to try out before attempting this exercise.

Once you have Borg downloaded and set up, begin by downloading the GitHub repository into a file location on your machine of choice:

git clone https://github.com/lbl59/WaterPaths.git

Once all the files are downloaded, compile WaterPaths:

make gcc

To optimize the WaterPaths simulation with Borg, first move the Borg files into the main WaterPaths/borg folder:

mv -r Borg WaterPaths/borg/

This line of code will make automatically make folder called borg within the WaterPaths folder, and copy all the external Borg files into it.

Next, cd into the /borg folder and run make in your terminal. This should a generate a file called libborgms.a. Make a folder called lib within the WaterPaths folder, and move this file into the WaterPaths/lib folder

cp libborgms.a ../lib/

Next, cd back into the main folder and use the Makefile to compile the WaterPaths executable with Borg:

make borg

Great! Now, notice that the /rof_tables_test_problem folder is empty. You will need to generate ROF tables within the WaterPaths environment. To do so, run the generate_rof_tables.sh script provided in the GitHub repository into your terminal. The script provided should look like this:

#!/bin/bash
#SBATCH -n 16 -N 2 -p normal
#SBATCH --job-name=gen_rof_tables
#SBATCH --output=output/gen_rof_tables.out
#SBATCH --error=error/gen_rof_tables.err
#SBATCH --time=04:00:00
#SBATCH --mail-user=YOURUSERNAME@cornell.edu
#SBATCH --mail-type=all

export OMP_NUM_THREADS=16
cd $SLURM_SUBMIT_DIR
./waterpaths\
	-T ${OMP_NUM_THREADS}\
       	-t 2344\
       	-r 1000\
       	-d /YOURCURRENTDIRECTORY/\
       	-C 1\ 
	-m 0\
	-s sample_solutions.csv\
       	-O rof_tables_test_problem/\
       	-e 0\
       	-U TestFiles/rdm_utilities_test_problem_opt.csv\
       	-W TestFiles/rdm_water_sources_test_problem_opt.csv\
       	-P TestFiles/rdm_dmp_test_problem_opt.csv\
	-p false

Replace all the ‘YOUR…’ parameters with your system-specific details. Make two new folders: output/ and error/. Then run the script above by entering

sbatch ./generate_rof_tables.sh

into your terminal. This should take approximately 50 minutes. Once the ROF tables have been generated, run the optimize_sedento_valley.sh script provided. It should look like this:

#!/bin/bash
#SBATCH -n 16 -N 3 -p normal
#SBATCH --job-name=sedento_valley_optimization
#SBATCH --output=output/sedento_valley_optimization.out
#SBATCH --error=error/sedento_valley_optimization.err
#SBATCH --time=04:00:00
#SBATCH --mail-user=YOURUSERNAME@cornell.edu
#SBATCH --mail-type=all

DATA_DIR=/YOURDIRECTORYPATH/
N_REALIZATIONS=1000
export OMP_NUM_THREADS=16
cd $SLURM_SUBMIT_DIR

mpirun ./waterpaths -T ${OMP_NUM_THREADS}\
  -t 2344 -r ${N_REALIZATIONS} -d ${DATA_DIR}\
  -C -1 -O rof_tables_test_problem/ -e 3\
  -U TestFiles/rdm_utilities_test_problem_opt.csv\
  -W TestFiles/rdm_water_sources_test_problem_opt.csv\
  -P TestFiles/rdm_dmp_test_problem_opt.csv\
  -b true -o 200 -n 1000

As usual, replace all the ‘YOUR…’ parameters with your system-specific details. Run this script by entering

sbatch ./optimize_sedento_valley.sh

into the terminal. This script runs the Borg MOEA optimization for 1,000 function evaluations, and will output a .set file every 200 function evaluations. At the end of the run, you should have two files within your main folder:

  1. NC_output_MS_S3_N1000.set contains the Pareto-optimal set of decision variables and the performance objective values for the individual utilities and the overall region.
  2. NC_runtime_MS_S3_N1000.runtime contains information on the time it took for 1000 simulations of the optimization of the Research Triangle to complete.

The process should take approximately 1 hour and 40 minutes.

Summary

Congratulations, you are officially the Dr Strange of the Research Triangle! You have successfully downloaded WaterPaths and Borg MOEA, as well as run a simulation-optimization of the North Carolina Research Triangle test case across 1000 possible futures, in which you were Pareto-optimal in more than one. You obtained the .set files containing the Pareto-optimal decision variables and their respective performance objective values. Now that we have optimized the performance of the Research Triangle system, we are ready to examine the performance objectives and the Pareto-optimal ROF trigger values that result in this optimal set of tradeoffs.

In the next blog post, we will process the output of the .set files to visualize the objective space, decision variable space, and the tradeoff space. We will also conduct robustness analysis on the Pareto-optimal set to delve further into the question of “What are acceptable levels of performance for those actions in all worlds?”. Finally, we will explore the temporal interactions between the water use restrictions stROF and the infrastructure construction ltROF, and how supply and demand are both impacted by – and have an effect on – these decision variables.

References

Hadka, D., & Reed, P. (2013). Borg: An auto-adaptive Many-objective evolutionary computing framework. Evolutionary Computation, 21(2), 231–259. https://doi.org/10.1162/evco_a_00075

Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772. https://doi.org/10.1016/j.envsoft.2020.104772

Trindade, B. C., Reed, P. M., & Characklis, G. W. (2019). Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management. Advances in Water Resources, 134, 103442. https://doi.org/10.1016/j.advwatres.2019.103442

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research, 52(9), 7327–7346. https://doi.org/10.1002/2016wr018771

Fitting and Simulating from NHMMs in R

The purpose of this post is to give a workflow that one could follow if your goal is to fit Non-Homogeneous Hidden Markov Models (NHMMs) and to simulate sequences of states. Julie Quinn wrote a series of great posts on fitting Hidden Markov Models (HMMs) here but the goal of this post is to discuss the non-homogeneous counterparts of HMM. NHMMs can be distinguished from the former because they involve non-stationary transition probabilities between states. These dynamic transition probability matrices are conditioned on one or more external covariates that influences transitions between states. One example of the use of NHMMs is to model and simulate precipitation. Precipitation models that simulate without using atmospheric information cannot be expected to perform well on conditions that deviate from those on which the model was fit. Thus using a covariate that shows changes in atmospheric circulation (such as geopotential height), can help capture some of the nonstationarity that a solely precipitation-based model alone could not capture.

I have had a lot of time to work with NHMMs in R; however at the beginning, I found overwhelmingly few examples to work off of, so this post is meant to outline a workflow that I have settled on that you can apply to your own application. First off, there are tons of packages available for HMMs, but very few that handle the dynamic transition matrices of NHMMs. My favorite is depmixS4 found here. This package has all the tools needed to fit your NHMM, but it takes some experimenting to understand the syntax and to piece together the functions to create a full workflow.

First we want to fit a depmix model. In the first line of the code, I am creating a model called modNHMM. The data that I am fitting the NHMM with are the first 9 principal components of geopotential height. It is important that you list these components in this syntax and they should be the column titles of your dataset, which is synoptic.pcs in my case. The number of states is how many states you want to fit your model with. For a precipitation/streamflow application, you could fit a two-state NHMM to represent wet/dry conditions, or if you are trying to identify multiple regimes, you may have more.

library(depmixs4)

modNHMM <- depmix(list(PC1~1,PC2~1,PC3~1, PC4~1, PC5~1, PC6~1, PC7~1, PC8~1,
PC9~1),
nstates = num.states,
               family=list(gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),
                              gaussian()),
ntimes =  nrow(synoptic.pcs),
data = data.frame(synoptic.pcs),transition = ~predictor$PC1+predictor$PC2+predictor$PC3+predictor$PC4)

fit.modNHMM.depmix.paleo.check <- fit(modNHMM)

synoptic.state.assignments<- posterior(fit.modHMMs.depmix.paleo.check)$state # state sequence using the Viterbi algorithm

Then we choose a distribution for our responses, which I choose to be a Gaussian distribution. The argument, ntimes, refers to the length of our time series, which is the number of rows in our dataset. Next we specify the dataset that contains the PC data and the transition which is dictated by our external covariates. In this case, my covariates are in a dataframe called predictor and each column corresponds to the 4 covariates (which happen to be PCs of a different dataset) that will influence my transitions. Then we fit the model with our prescribed attributes using the fit function. Finally, we want to calculate the Viterbi sequence, which is the most probable path or sequence of states over the period that the model is fit. This step (last line of code) will return a vector of the length of the dataset with each day classified into one of num.states.

Now we need to locate the transition probability matrices. We use the following command:

fit.modNHMM.depmix.paleo.check@transition

If we were fitting an HMM, we would get one transition probability matrix. However, we get an output that looks like this:

Transition probability matrix coefficients

Now instead of having constant values for transitions, we have equations of the form: Intercept+b1*PC1+b2*PC2+b3*PC3+b4*PC4 where the b values are the coefficient values listed in the table. If were were to look at the first block [[1]], this block dictates transitions from State 1 to the other 5 states. The transitions from the states are as follows:

State 1 to State 1: 0+0+0+0+0 (State 1 is used as a reference variable in this case and the probability would be found by subtracting the other probabilities from 1 at the end)

State 1 to State 2: -3.11+-0.22*PC1+0.22*PC2+0.014*PC3-0.13*PC4

and so on. You have to remember to divide each value by the total across the row in order to return probabilities.

Once you have these symbolic equations for the transition probability matrices, you can create a list of matrices which will allow you to simulate sequences of states for new sets of the PC1, PC2, PC3, PC4 covariates. You can get a sense how you might create n different transition matrices if you have times series of length n of the covariates. Below I am just representing those symbolic equations in code using the getpars function to acquire the coefficients and store the resulting daily matrices in a list called mm. Depending on the number of covariates or states, you will need to adjust the indices accordingly.

n=dim(df)[[1]]
mm<-matrix(list(), 1,n)


for (j in 1:n){
  transition_matrix=matrix(, nrow = 5, ncol = 5)
  for (i in 6:10){
    transition_matrix[1,i-5]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
     }
  denominator=sum(exp(transition_matrix[1,]))
  for (i in 6:10){
    transition_matrix[1,i-5]=exp(transition_matrix[1,i-5])/denominator
  }
  
  for (i in 31:35){
    transition_matrix[2,i-30]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
  }
  denominator=sum(exp(transition_matrix[2,]))
  for (i in 31:35){
    transition_matrix[2,i-30]=exp(transition_matrix[2,i-30])/denominator
  }
  for (i in 56:60){
    transition_matrix[3,i-55]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[3,]))
  for (i in 56:60){
    transition_matrix[3,i-55]=exp(transition_matrix[3,i-55])/denominator
    
  }
  for (i in 81:85){
    transition_matrix[4,i-80]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[4,]))
  for (i in 81:85){
    transition_matrix[4,i-80]=exp(transition_matrix[4,i-80])/denominator
    
  }
  
  for (i in 106:110){
    transition_matrix[5,i-105]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[5,]))
  for (i in 106:110){
    transition_matrix[5,i-105]=exp(transition_matrix[5,i-105])/denominator
    
  }
  mm[[j]]=transition_matrix

}

Once we have these matrices, we can then simulate state sequences that can result from the chain of transition matrices. For this part, we need to create markov lists with our transition matrices:

library(markovchain)
mcObject <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(j){ 
  
  mcObject.time.varying <- mclapply(X=1:n.sim,mc.preschedule=TRUE,mc.cores=1,FUN=function(t){
    tr.prob=as.matrix(mm[[t]])
    mcObject.time.varying.out <- new("markovchain", states = c("1","2","3","4","5"),
                                             transitionMatrix = tr.prob, name = paste("McObject",t,sep=""))    
    return(McObject.time.varying.out)
  }
  )
  mcObject.final <- new("markovchainList",markovchains = mcObject.time.varying, name = "mcObject.nh")
  return(

mcObject.final

)
  
}

Finally we simulate using the following:

simulate.mc <- function(mcObject,num.states,dates.sim,last.month,last.day,n.sim,iter) {
  
  #this function will simulate the Markov chain iter times
  
  #Arguments:
  #mcObject = a Markov chain object from the markovchain package
  #num.states = the number of states 
  #date.sim = a time series of dates for the simulation period
  #last.month = last month of the season
  #last.day = last day of the last month of the season
  #iter = the number of iterations
  
  #day and month sequences for the simulation period
  days.sim <- as.numeric(format(dates.sim,"%d"))
  months.sim <- as.numeric(format(dates.sim,"%m"))
  n.sim <- length(dates.sim)  #length of simulation
  
  final.mc.sim <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(i){  
    
    mc.sim <- as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[1]]))
    end.state <- paste(mc.sim[length(mc.sim)])
    for (t in 1:n.sim) {
      mc.sim <- c(mc.sim,as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[t]],t0=end.state)))
      end.state <- paste(mc.sim[length(mc.sim)])
      if(months.sim[t]==last.month & days.sim[t]==last.day) {end.state <- paste(sample(1:num.states,size=1))}
    }    
    
    #here is the final mc simulation
    final.mc.sim.iter <- mc.sim[2:(n.sim+1)]
    return(final.mc.sim.iter)
  }
  )
  return(final.mc.sim)
  
}



simulations=matrix(list(), 1,1000)
for (i in 1:1000){

  simulations[[i]] <- simulate.mc(mcObject=mcWeather.Regime,num.states=num.states,
                                  dates.sim=dates.sim,last.month=last.month,last.day=last.day,iter=iter)
}

And that’s it! You can simulate for many different iterations (1000 in my case) and you will be returned a large list with your 1000 sequence of states over the simulation period.