Constructing interactive Ipywidgets: demonstration using the HYMOD model

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Last week, Lillian and I published the first post in a series of training post studying the “Fisheries Game, which is a decision making problem within a complex, non-linear, and uncertain ecological context.

In preparing for that post, I learned about the Ipywidgets python library for widget construction. It stood out to me as a tool for highlighting the influence of parametric uncertainty on model performance. More broadly, I think it has great as an educational or data-narrative device.

This blogpost is designed to highlight this potential, and provide a basic introduction to the library. A tutorial demonstration of how an interactive widget is constructed is provided, this time using the HYMOD rainfall-runoff model.

This post is intended to be viewed through a Jupyter Notebook for interaction, which can be accessed through a Binder at this link!

The Binder was built with an internal environment specification, so it should not be necessary to install any packages on your local machine! Because of this, it may take a minute to load the page.

Alternatively, you can pull the source code and run the Jupyter Notebook from your local machine. All of the source code is available in a GitHub repository: Ipywidget_Demo_Interactive_HYMOD.

If using your local machine, you will first need to install the Ipywidget library:

pip install ipywidgets

Let’s begin!

HYMOD Introduction

HYMOD is a conceptual rainfall-runoff model. Given some observed precipitation and evaporation, a parameterized HYMOD model simulates the resulting down-basin runoff.

This post does not focus on specific properties or performance of the HYMOD model, but rather uses the model as a demonstration of the utility of the Ipywidget library.

I chose to use the HYMOD model for this, because the HYMOD model is commonly taught in introductory hydrologic modeling courses. This demonstration shows how an Ipywidget can be used in an educational context. The resulting widget can allow students to interact in real-time with the model behavior, by adjusting parameter values and visualizing the changes in the resulted streamflow.

If you are interested in the technical details of implementing the HYMOD model, you can dig into the source code, available (and throughly commented/descriptive) in the repository for this post: Ipywidget_Demo_Interactive_HYMOD.

HYMOD represents surface flow as a series of several quick-flow reservoirs. Groundwater flow is represented as a single slow-flow reservoir. The reservoirs have constant flow rates, with the quick-flow reservoir rate, Kq, being greater than the slow-flow reservoir rate, Ks.

Image source: Sun, Wenchao & Ishidaira, Hiroshi & Bastola, Satish. (2010)

HYMOD Parameters:

Like any hydrologic model, the performance of HYMOD will be dependent upon the specified parameter values. There are several parameters that can be adjusted:

  • Cmax: Max soil moisture storage (mm) [10-2000]
  • B: Distribution of soil stores [0.0 – 7.0]
  • Alpha: Division between quick/slow routing [0.0 – 1.0]
  • Kq: Quick flow reservoir rate constant (day^-1) [0.15 – 1.0]
  • Ks: Slow flow reservoir rate constant. (day^-1) [0.0 – 0.15]
  • N: The number of quick-flow reservoirs.

Interactive widget demonstration

I’ve constructed an Ipywidets object which allows a user to visualize the impact of the HYMOD model parameters on the resulting simulation timeseries. The user also has the option to select from three different error metrics, which display in the plot, and toggle the observed timeseries plot on and off.

Later in this post, I will give detail on how the widget was created.

Before provided the detail, I want to show the widget in action so that you know the expectation for the final product.

The gif below shows the widget in-use:

Demonstration of the HYMOD widget.

Ipywidgets Introduction

The Ipywdiget library allows for highly customized widgets, like the one above. As with any new tool, I’d recommend you check out the documentation here.

Below, I walk through the process of generating the widget shown above.

Lets begin!

Import the library

# Import the library
import ipywidgets as widgets

Basic widget components

Consider an Ipywidget as being an arrangement of modular components.

The tutorial walks through the construction of five key widget components:

  1. Variable slider
  2. Drop-down selectors
  3. Toggle buttons
  4. Label objects
  5. Interactive outputs (used to connect the plot to the other three components)

In the last section, I show how all of these components can be arranged together to construct the unified widget.

Sliders

Sliders are one of the most common ipywidet tools. They allow for manual manipulation of a variable value. The slider is an object that can be passed to the interactive widget (more on this further down).

For my HYMOD widget, I would like to be able to manipulate each of the model parameters listed above. I begin by constructing a slider object for each of the variables.

Here is an example, for the C_max variable:

# Construct the slider
Cmax_slider = widgets.FloatSlider(value = 500, min = 10, max = 2000, step = 1.0, description = "C_max",
                                  disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')


# Display the slider
display(Cmax_slider)

Notice that each slider recieves a specified minmax, and step corresponding to the possible values. For the HYMOD demo, I am using the parameter ranges specified in Herman, J.D., P.M. Reed, and T. Wagener (2013), Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior.

I will construct the sliders for the remaining parameters below. Notice that I don’t assign the description parameter in any of these sliders… this is intentional. Later in this tutorial I will show how to arrange the sliders with Label() objects for a cleaner widget design.

# Construct remaining sliders
Cmax_slider = widgets.FloatSlider(value = 100, min = 10, max = 2000, step = 1.0, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
B_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Alpha_slider = widgets.FloatSlider(value = 0.30, min = 0.00, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Kq_slider = widgets.FloatSlider(value = 0.33, min = 0.15, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Ks_slider = widgets.FloatSlider(value = 0.07, min = 0.00, max = 0.15, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
N_slider = widgets.IntSlider(value = 3, min = 2, max = 7, disabled = False, continuous_update = False, readout = True)

# Place all sliders in a list
list_of_sliders = [Kq_slider, Ks_slider, Cmax_slider, B_slider, Alpha_slider, N_slider]

The Dropdown() allows the user to select from a set of discrete variable options. Here, I want to give the user options on which error metric to use when comparing simulated and observed timeseries.

I provide three options:

  1. RMSE: Root mean square error
  2. NSE: Nash Sutcliffe efficiency
  3. ROCE: Runoff coefficient error

See the calculate_error_by_type inside the HYMOD_components.py script to see how these are calculated.

To provide this functionality, I define the Dropdown() object, as below, with a list of options and the initial value:

# Construct the drop-down to select from different error metrics
drop_down = widgets.Dropdown(options=['RMSE','NSE','ROCE'], description='',
                                value = 'RMSE', disabled = False)

# Display the drop-down
display(drop_down)

ToggleButton

The ToggleButton() allows for a bool variable to be toggled between True and False. For my streamflow plot function, I have an option plot_observed = False which determines if the observed streamflow timeseries is shown in the figure.

# Construct the button to toggle observed data On/Off
plot_button = widgets.ToggleButton(value = False, description = 'Toggle', disabled=False, button_style='', tooltip='Description')

# Display the button
display(plot_button)

Labels

As mentioned above, I choose to not include the description argument within the slider, drop-down, or toggle objects. This is because it is common for these labels to get cut-off when displaying the widget object.

For example, take a look at this slider below, with a long description argument:

# Make a slider with a long label
long_title_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, description = 'This slider has a long label!', readout = True)

# Display: Notice how the label is cut-off!
display(long_title_slider)

The ipywidgets.Label() function provides a way of avoiding this while allowing for long descriptions. Using Label() will ultimately provide you with a lot more control over your widget layout (last section of the tutorial).

The Label() function generates a separate object. Below, I create a unique Label() object for each HYMOD parameter.

# Import the Label() function
from ipywidgets import Label

# Make a list of label strings
param_labs = ['Kq : Quick flow reservoir rate constant (1/day)',
            'Ks : Slow flow reservoir rate constant (1/day)',
            'C_max : Maximum soil moisture storage (mm)',
            'B : Distribution of soil stores',
            'Alpha : Division between quick/slow routing',
            'N : Number of quick-flow reservoirs']

# Make a list of Label() objects
list_of_labels = [Label(i) for i in param_labs]

# Display the first label, for example.
list_of_labels[0]

Interactive_output

Now that we have constructed interactive

The interactive_output function takes two inputs, the function to interact with, and a dictionary of variable assignments:

interactive_output( function, {‘variable_name’ : variable_widget, …} )

I have created a custome function plot_HYMOD_results which:

  1. Loads 1-year of precipitation and evaporation data for the Leaf River catchment.
  2. Runs the HYMOD simulation using the provided parameter values.
  3. Calculates the error of the simulated vs. observed data.
  4. Plots the timeseries of runoff.

The source code for this function can be found in the GitHub repository for this post, or specifically here.

The function receives parameter values for each of the HYMOD parameters discussed above, a bool indicator if observed data should be plotted, and a specified error metric.

plot_HYMOD_results(C_max, B, Alpha, Ks, Kq, N_reservoirs, plot_observed = False, error_type = ‘RMSE’):

I have already generated widget components corresponding to each of these variables! (If you are on the Jupyter Notebook version of this post, make sure to have Run every cell before this, or else the following code wont work.

I can now use the interactive_output function to link the widget components generated earlier with the function inputs:

# Import the interactive_output function
from ipywidgets import interactive_output

# Import my custom plotting function
from HYMOD_plots import plot_HYMOD_results

result_comparison_plot = interactive_output(plot_HYMOD_results, {'C_max' : Cmax_slider, 'B': B_slider, 'Alpha':Alpha_slider, 
                                                                 'Ks':Ks_slider, 'Kq':Kq_slider,'N_reservoirs':N_slider, 
                                                                 'plot_observed' : plot_button,'error_type': drop_down})

# Show the output
result_comparison_plot
Output generated by the interactive_output().

Displaying the interactive_output reveals only the plot, but does not include any of the widget functionality…

Despite this, the plot is still linked to the widget components generated earlier. If you don’t believe me (and are reading the Jupyter Notebook version of this post), scroll up and click the ToggleButton a few cells up, then come back and look at the plot again.

Using the interactive_output() function, rather than other variations of the interact() functions available, allows for cleaner widgets to be produced, because now the arrangment of the widget components can be entirely customizable.

Keep reading for more detail on this!

Arranging widget components

Rather than using widget features one-at-a-time, Ipywidgets allow for several widgets to be arranged in a unified layout. Think of everything that has been generated previously as being a cell within the a gridded widget; the best part is that each cell is linked with one another.

Once the individual widget features (e.g., sliders, buttons, drop-downs, and output plots) are defined, they can be grouped using the VBox() (vertical box) and HBox() (horizontal box) functions.

I’ve constructed a visual representation of my intended widget layout, shown below. The dashed orange boxes show those components grouped by the HBox() function, and the blue boxes show those grouped by the VBox() function.

Visual representation of the final widget layout.

Before getting started, import some of the basic layout functions:

# Import the various 
from ipywidgets import HBox, VBox, Layout

Before constructing the entire widget, it is good to get familiar with the basic HBox() and VBox() functionality.

Remember the list of sliders and list of labels that we created earlier?

# Stack the list of label objects vertically:
VBox(list_of_labels)

# Try the same thing with the sliders (remove comment #):
#VBox(list_of_sliders)

In the final widget, I want the column of labels to be located on the left of the column of sliders. HBox() allows for these two columns to be arrange next to one another:

# Putting the columns side-by-side
HBox([VBox(list_of_labels), VBox(list_of_sliders)])

Generating the final widget

Using the basic HBox() and VBox() functions shown above, I arrange all of the widget components I’ve defined previously. I first define each row of the widget using HBox(), and finally stack the rows using VBox().

The script below will complete the arrangement, and call the final widget!

# Define secifications for the widgets: center and wrap 
box_layout = widgets.Layout(display='flex', flex_flow = 'row', align_items ='center', justify_content = 'center')

# Create the rows of the widets
title_row = Label('Select parameter values for the HYMOD model:')
slider_row = HBox([VBox(list_of_labels), VBox(list_of_sliders)], layout = box_layout)
error_menu_row = HBox([Label('Choose error metric:'), drop_down], layout = box_layout)
observed_toggle_row = HBox([Label('Click to show observed flow'), plot_button], layout = box_layout)
plot_row = HBox([result_comparison_plot], layout = box_layout)


# Combine label and slider box (row_one) with plot for the final widget
HYMOD_widget = VBox([title_row, slider_row, plot_row, error_menu_row, observed_toggle_row])


# Call the widget and have fun!
HYMOD_widget

Concluding remarks

If you’ve made it this far, thank you for reading!

I hope that you are able to find some fun/interesting/educational use for the Ipywidget skills learned in this post.

Clustering basics and a demonstration in clustering infrastructure pathways

Introduction

When approaching a new dataset, the complexity may be daunting. If patterns in the data are not readily apparent, one potential first-step would be to cluster the data and search for groupings. Clustering data, or cluster analysis, can reveal relationships between the observations and provide insight into the data.

Figure: Graphical representation of how clustering can be used to identify patterns in a data set.

Motivation

The outcome of a cluster analysis can be highly dependent upon the clustering algorithm being used, the clustering criteria, and the specified number of clusters to be found. Anticipating these influences and how to manipulate them will set you up for success in your data analysis.

In this post, I introduce some fundamental clustering algorithms, the hierarchical and K-Means clustering algorithms. Then I will share some important considerations, such as how to select a suitable number of clusters. Finally, I will demonstrate the power of clustering by applying the methods to an infrastructure pathways dataset, to cluster infrastructure pathways generated for 1,000 different states of the world into different sets of ‘light’, ‘moderate’, and ‘heavy’ infrastructure development.

Let’s begin!

Methods

Hierarchical Clustering

Hierarchical clustering can be performed “bottom-up”, through an agglomerative approach, which starts by placing each observation in its own cluster and then successively links the observations into increasingly larger clusters. Alternatively, the clustering can take a “top-down”, or divisive, approach which begins with all observations within a single cluster and partitions the set into smaller clusters until each observation exists within its own cluster.

Figure: A graphical representation of the hierarchical clustering process. Red circles denote clusters. If read from left to right, then the figure is representative of the agglomerative approach. If read from right to left, then the figure is representative of the divisive approach.

Specifying how clusters are preferentially assigned will influence the final shape and contents of the clusters. When performing hierarchical clustering, the user must specify the linkage criteria, which determines how the distance between clusters is measured, and consequently which clusters will be group during the next iteration. Three common linkage criteria are:

Linkage TypeDescriptionFormulation
Complete (max)the maximum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Single (min)the minimum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Averagethe average distance between a group and
observations in another group.
average{ d(a, b) : a in A, b in B}

Additionally, when calculating the linkage criteria, different methods of measuring the distance can be used. For example, when working with correlated data, it may be preferable to use the Mahalanobis distance over the regular Euclidean norm.

The results of the clustering analysis can be viewed through a dendrogram, which displays a tree-like representation of the cluster structure. For more detail on dendrograms, and a step-by-step tutorial for creating color-coded dendrograms in R, see Rohini’s (2018) post here.

K-Means Clustering

The K-means algorithm follows three simple steps.

  1. Selection of k centroids within the sample space.

In the traditional naïve K-Means algorithms, centroids do not correspond to specific observations within the space. They can be randomly selected, or chosen through more informed methods (see, for example, K-Means++)

2. Assignment of each observation to the cluster with the nearest centroid.

3. Re-define each cluster centroid as the mean of the data points assigned to that centroid.

Steps 2 and 3 of the algorithm are repeated until the centroids begin to stabilize, such that subsequent iterations produce centroids located within some small, tolerable distance from one another.

Figure: A graphical representation of the k-means clustering process, with 3 clusters. X’s denote the centroids for that respective cluster. The dashed arrows denote the shift in the location of the centroid during the next iteration. Points are colored according to their assigned cluster during that iteration.

The centroids toward which the algorithm converges may be a locally optimal clustering solution, and can be highly dependent upon the initial centroid selection. Often, this sensitivity is handled by performing the clustering processes several times, and choosing the set of clusters which yield the smallest within-cluster sum of squares.

Selecting the number of clusters

In some contexts, it is possible to know how many clusters are required prior to the analysis. For example, when working with Fisher’s famous iris flower data set, three clusters are needed because there are three types of flowers observed within the data.

In an exploratory context, however, the appropriate number of clusters may not be readily apparent. One method of determining the preferred number of clusters is by testing multiple different values and evaluating the separation distances between the clusters. If the number of clusters is well-chosen, then the clusters will be both well-spaced and the intra-cluster distance will be small.

The Silhouette Coefficient, or silhouette score, is one method of measuring the goodness of fit for a set of clusters, and takes into consideration both the spacing between clusters (inter-cluster distance) and the distance between points within a cluster (intra-cluster spacing). The Silhouette scores is defined as:

Where:

a = average intra-cluster distance,

b = average inter-cluster distance.

The silhouette score ranges from [-1, 1]. A value near +1 suggests that the clusters are far from one another, with a small distances within the cluster, and thus the chosen number of clusters is well-suited for the data. A value near -1, however, suggests that the clusters are poorly describing the patterns in the data.

Example: Clustering Infrastructure Pathways

In water supply planning contexts, infrastructure pathways describe the sequencing of infrastructure investment decisions throughout time, in response to changes in system (construction of new reservoirs, expansions to water treatment capacity etc.). Adaptive infrastructure pathways use a risk-of-failure (ROF) based set of policy rules to trigger new infrastructure development. Adaptive infrastructure pathways have been shown to effectively coordinate water infrastructure decisions in complex multi-actor systems (Zeff et al., 2016). While the benefits of this adaptive strategy are clear, ROF based pathways also bring new challenges for decision makers – rather than specifying an single sequence of infrastructure investments, an adaptive rule system will generate a unique sequence of investments for every future state of the world. To understand how a rule system behaves, decision makers must have tools to visualize an ensemble of infrastructure pathways. This is where cluster analysis comes in. Using clustering, we can group similar infrastructure sequences together, allowing decision makers to summarize how a candidate infrastructure investment policy will behave across many future states of the world.

Here, I analyze a set of infrastructure pathways generated for a hypothetical water utility with four infrastructure options:

  • small water treatment plant expansion
  • large water treatment plant expansion
  • construction of a small new run of river intake
  • construction of a large new run of river intake

The utility is examining a candidate policy and has generated pathways for 1,000 different states of the world, or realizations, which are each characterized by unique hydrologic and population dynamics. Given the 1,000 different realizations, each with a unique infrastructure pathway, I am interested in clustering the pathways according the the timing of the infrastructure construction decisions to understand how the policy behaves.

The dataset used for this demonstration is courtesy of David Gold, who is also responsible for providing significant portions of the following code and assistance along the way. Thanks David!

Follow along with this demonstration by downloading .txt data file HERE!

Let’s begin by loading in the data.

import pandas as pd

# Load in data
pathways_df = pd.read_csv('./ClusterData.txt', sep = '\t', header = None)

The data contains information describing the timing of infrastructure construction for 1,000 different simulated realizations. Each column represents a different infrastructure option, and each row is 1 of the 1,000 different simulated realizations. The contents of each cell denote a standardized value corresponding to the timing of the infrastructure construction within each realization, ranging from [0, 1]. Infrastructure options containing the value 1 were not built during the 45-year simulation period, in that specific realization.

RealizationInfra. Opt. #1Infra. Opt. #2Infra. Opt. #M
1W1,1W1,2W1,M
2W2,1W2,2W2,M
NWN,2WN,2WN,M
Table: Example output data from the infrastructure pathways optimization process; within each realization (state of the world (SOW)) infrastructure construction decisions are made during different weeks within the simulation period (45-years), considering a subset of infrastructure options for each utility. Wi,j denotes a standardized value corresponding to the timing within realization i that infrastructure option j is built.

With the data available, we can begin clustering!

Here, I take advantage of the scikit-learn.cluster module, which has both hierarchical and K-Means clustering capabilities.

For the purposes of this demonstration, I am going to only show the process of producing the hierarchical clusters… the curious reader may choose to explore alternative clustering methods available in the scikit-learn toolbox.

### Perform hierarchical clustering with 3 clusters ###

# Import the necessary tools
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import pairwise_distances_argmin

# Begin by assuming 3 clusters
num_clusters = 3

# Initialize the hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters = 3)

# Cluster the data
hierarchical.fit(cluster_input)  

# Produce a array which specifies the cluster of each pathway
hierarchical_labels = hierarchical.fit_predict(cluster_input)

In order to interpret the significance of these clusters, we want find a way to visualize differences in cluster behavior through the infrastructure pathways.

To do this, I am going to consider the median timing of each infrastructure option within each cluster, across all 1,000 realizations. Additionally, I want to specify sort the clusters according to some qualitative characteristic. In this case, I define pathways with early infrastructure investments as “heavy”, and pathways within infrastructure investment later in the period as “light”. The “moderate” infrastructure pathways are contained within the cluster between these these.

The following script calculates the median of the clustered pathways, and sorts them according to their infrastructure timing.

### Interpreting clusters via pathways ###

# Group realizations by pathway
# Calculate the median week each infra. opt. is constructed in each cluster
import numpy as np
cluster_pathways = []
cluster_medians = []

# Loop through clusters 
for i in range(0,num_clusters):
    
    # Select realizations contained within i-th cluster
    current_cluster = cluster_input[hierarchical_labels ==i, :]
    
    # Multiply by 2344 weeks to convert from [0,1] to [0,2344] weeks
    current_cluster = current_cluster*2344
    
    cluster_pathways.append(current_cluster)
    
    # Find the median infrastructure build week in each realization
    current_medians = np.zeros(len(current_cluster[0,:]))
    
    # Loop through infrastructure options and calculate medians
    for j in range(0, len(current_cluster[0,:])):
        current_medians[j] = np.median(current_cluster[:,j])
        
    cluster_medians.append(current_medians)

# Sort clusters by average of median build week 
# to get light, moderate, and heavy infrastructure clusters
cluster_means = np.mean(cluster_medians, axis = 1)

sorted_indices = np.argsort(cluster_means)

# Re-order cluster medians according to sorted mean build week
cluster_medians = np.vstack((cluster_medians[sorted_indices[2]], 
                            cluster_medians[sorted_indices[1]],
                            cluster_medians[sorted_indices[0]]))

With the median timing of each infrastructure option specified for each cluster, I can begin the process of visualizing the behavior. The function below plots a single infrastructure pathway:

import numpy as np
import matplotlib.pyplot as plt

def plot_single_pathway(cluster_medians, cluster_pathways, inf_options_idx,
                        c, inf_names, y_offset, ylabeling, xlabeling, 
                        plot_legend, ax):
    
    """
    Makes a plot of an infrastructure pathway.
    
    PARAMETERS:
        cluster_medians: an array with median weeks each option is built
        cluster_pathways: an array with every pathway in the cluster
        inf_options: an array with numbers representing each option (y-axis vals)
        should start at zero to represent the "baseline"
        c: color to plot pathway
        y_offset: distance to offset each pathway from 0 (easier to visualize stacked paths)
        ylabeling: labeling for infrastructure options
        xlabeling: labeling x axis
        plot_legend: (bool) to turn on or off
        ax: axes object to plot pathway
    """
    
    # Extract all non-baseline infrastructure options
    inf_options_idx_no_baseline=inf_options_idx[1:]
    sorted_inf = np.argsort(cluster_medians)
    
    # Plot Pathways
    # create arrays to plot the pathway lines. To ensure pathways have corners 
    # we need an array to have length 2*num_inf_options
    pathway_x = np.zeros(len(cluster_medians)*2+2)
    pathway_y = np.zeros(len(cluster_medians)*2+2)
    
    # To have corners, each inf. opt. must be located
    # in both the cell it is triggered, and adjacent cell
    
    cluster_medians = np.rint(cluster_medians/45)
    for i in range(0,len(cluster_medians)):
        for j in [1,2]:
            pathway_x[i*2+j] = cluster_medians[sorted_inf[i]]
            pathway_y[i*2+j+1] = inf_options_idx_no_baseline[sorted_inf[i]]

    # Set the end of the pathway at year 45
    pathway_x[-1] = 45

    # plot the pathway line    
    ax.plot(pathway_x, pathway_y + y_offset, color=c, linewidth=5, 
            alpha = .9, zorder=1)

    # Formatting plot elements 
    ax.set_xlim([0,44])
    inf_name_lables = inf_names
    inf_options_idx = np.arange(len(inf_name_lables))
    ax.set_yticks(inf_options_idx)
    ax.set_xlabel('Week')
    
    if ylabeling:
        ax.set_yticklabels(inf_name_lables)
    else:
        ax.set_yticklabels([])
    
    ax.set_ylim([-0.5,len(cluster_medians)+.5])

    plt.show()

The above script can be used to plot the median pathway for a single cluster, but it would be much more worthwhile to compare the median pathways of the different clusters against one another. To do this, we can define another function which will use the plot_single_pathway() iteratively to plot each of the median cluster pathways on the same figure.

import numpy as np
import matplotlib.pyplot as plt
from plot_single_pathway import plot_single_pathway

def plot_many_paths(clust_meds, clust_pathways, n_clusters, cluster_colors, fig, 
                    gspec, fig_col, ylabeling, plot_legend):
    
    
    y_offsets = [-.15, 0, .15]
    
    # Specific information for Utility
    inf_options_idx = np.array([0, 1,2, 3, 4])
    utility_inf = ['Baseline', 'Small intake expans.', 'Large intake expans.',
                    'Small WTP expans.', 
                    'Large WTP expans.']
    
    ax = fig.add_subplot(gspec[0,0])
    
    for i in np.arange(n_clusters):
        plot_single_pathway(clust_meds[i], clust_pathways[i], inf_options_idx, cluster_colors[i],
                            utility_inf, y_offsets[i], ylabeling, False, True, ax)
        plt.show()
    
    if plot_legend and (n_clusters == 3):
        ax.legend(['Light inf.', 'Moderat inf.', 'Heavy inf.'], 
                  loc='upper left')
    elif plot_legend and (n_clusters == 2):
        ax.legend(['Light inf.', 'Heavy inf.'], loc='upper left')
        
    ax.tick_params(axis = "y", which = "both", left = False, right = False)
    
    plt.show()
    return fig

Time for the reveal! Execution of this final portion of the script will plot each of the three clusters, and color-code them according to classifications as “heavy”, “moderate”, or “light” infrastructure.

### Plotting ###
from plot_many_paths import plot_many_paths

fig = plt.figure(figsize = (5,5))
gspec = fig.add_gridspec(ncols=1, nrows=1)

# Colorscale
heavy_inf_color = '#132a13'
mid_inf_color = '#4f772d'
light_inf_color = '#90a955'
cluster_colors = [light_inf_color, mid_inf_color, heavy_inf_color]

plot_many_paths(cluster_medians, cluster_pathways, 3, cluster_colors, 
                fig, gspec, 0, True, True)
Figure: Median infrastructure pathways for clustered pathways.

Voilà!

The figure above shows a clear contrast in the behavior of the light and heavy infrastructure pathways; the heavy infrastructure pathway is not only characterized by early infrastructure construction, but also by more frequent construction throughout the simulation period, in comparison to the median of the light infrastructure cluster which only requires a single infrastructure expansion within the same simulation period.

From the plot above, it is not apparent that three clusters are necessarily appropriate for this data set. Although the heavy and light infrastructure clusters are clearly demarcated from one another, the moderate infrastructure cluster has characteristics of both: a late first-expansion similar to the light infrastructure cluster, but a high frequency of construction throughout the period similar to the heavy infrastructure. Should the moderate infrastructure really be a separate cluster?

Let’s take advantage of the knowledge we developed earlier, and consider the silhouette scores corresponding to different numbers of clusters! Fortunately, scikit-learn has a suite of tools that made the silhouette analysis very easy.

In the following script, I evaluate the silhouette score associated with clustering for 2, 3, and 4 different clusters.

### Performing silhouette analysis ###
from sklearn.metrics import silhouette_score

num_clust = [2, 3, 4]

# Set up hierarchical clustering with different numbers of clusters
ac2 = AgglomerativeClustering(n_clusters = num_clust[0])
ac3 = AgglomerativeClustering(n_clusters = num_clust[1])
ac4 = AgglomerativeClustering(n_clusters = num_clust[2])

# Extract the silhouette score for each hierarchical grouping
silhouette_scores = []
silhouette_scores.append(silhouette_score(cluster_input, ac2.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac3.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac4.fit_predict(cluster_input)))
                                                  
plt.bar(num_clust, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 14)
plt.ylabel('Silhouette Score', fontsize = 14)
plt.xticks(num_clust)
plt.title("Silhouette analysis")
plt.show()
Figure: Silhouette analysis for the hierarchical clustering of infrastructure pathways.

Interestingly, the silhouette score was highest when the clustering was performed using only two clusters, suggesting that the moderate infrastructure class may not have been sufficiently unique to warrant a separate cluster.

Another method of visualizing the cluster structure is through the use of dendrograms. Below, I show dendrograms color-coded with only two clusters (left) and using three clusters (right).

Figure: Dendrograms showing the cluster structure of infrastructure pathways for a utility, using 2 (left) and 3 (right) clusters.

The dendrogram containing only two clusters (left) reveals that the moderate and light clusters can be combined into a single cluster, because the moderate cluster is less-different from the light infrastructure cluster than it is from the heavy infrastructure cluster. Although, it is important to note that dendrograms are not a reliable method of determining the number of clusters on their own. In fact, given that the difference in silhouette scores between the 2-cluster and 3-cluster hierarchies is relatively modest (~0.05) there is not sufficiently strong justification to choose the 2 clusters over 3 clusters.

For the sake of comparison, here are the 2-cluster and 3-cluster median pathway plots:

Conclusion

I hope you’ve gained some insight into clustering, infrastructure pathways (or preferably both) in this post.

Analysis of infrastructure pathways via statistical clustering presents a unique application of a very common statistical tool. In this analysis I clustered the pathways with respect to the timing of the infrastructure expansions, however there is potential for a much more in-depth analysis of how the hydrologic or urban contexts of each realization shape these infrastructure pathways… possibly justifying another blog post at a later date. Stay tuned!

Thanks again to David Gold, for providing the data and the foundational python code.

References

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research52(9), 7327-7346.

Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

  1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
  2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
  3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()

Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

  1. Write demoScript.m so that it reads the sample number from the input.
  2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
  3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory
sample_out = fileread(strcat("input_sample_", sample, "/sample_data.txt"));

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)

2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

Reads in the total tasks and number of samples per task from command line.
'''
# load necessary libraries
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes
N_SAMPLES_PER_TASK = int(sys.argv[2]) # number of runs per/task

# loop through samples assigned to current rank
for i in range(N_SAMPLES_PER_TASK):
	sample= rank + TOTAL_TASKS * i

	# write the command that will be sent to the terminal (here RUN will replace the {})
	terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

	# write the terminal command to the process
	os.system(terminal_command)

	# sleep before submitting the next command
	time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()

submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
SAMPLES_PER_TASK=5 # number of runs for each MPI task
N_NODES=2 # number of nodes
TASKS_PER_NODE=2 # number of tasks per node

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks

# Submit the parallel job
#!/bin/bash
#SBATCH -n $(TOTAL_TASKS) -N $(N_NODES)
#SBATCH --time=0:01:00
#SBATCH --job-name=demoMPI4py
#SBATCH --output=output/demo.out
#SBATCH --error=output/demo.err
#SBATCH --exclusive
module load py3-mpi4py
module load py3-numpy
module load octave/6.3.0

mpirun -np $TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS $SAMPLES_PER_TASK > demoOutput.txt

Additional resources

Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/

Writing sharable Python code part II: Unit Testing

When writing Python functions that may be used by others, it is important to demonstrate that your functions work as intended. In my last post I discussed how a proper function specification establishes a sort of contract between developers and users of functions that delineates errors in implementation (user error) from bugs in the code (developer error). While this contract is provides an important chain of accountability, it does not actually contain any proof that the function works as advertised. This is where unit testing comes in. A unit test simply runs a function over a suite of test cases (set of inputs that produce known results) to verify performance. As its name implies, a single unit test is meant to test one basic component of a code. Large or complex codes will have many sets of unit tests, each testing different elements (usually individual functions). Unit testing provides users with proof that your code works as intended, and serves as a powerful tool for finding and removing any errors in your code.

In this post I’ll provide guidance on the development of unit tests and demonstrate an example for a simple python function. Material in this post represents my interpretation of content taught by Professor Walker White at Cornell, in a Python Fundamentals course that I proctored in the summer of 2020.

An example script

To illustrate unit testing I’ll examine the a function called “check_satisficing.py” which tests whether a set of performance objectives meets a set of satisficing criteria (for background on satisficing, see this post). The function returns a boolean and has three parameters:

  • objs: a numpy array containing the two objective values
  • criteria: a numpy array two criteria,
  • directions: a list of strings that specify whether the criteria is meant to be a lower or upper bound.

Note that the actual code for this function only takes eight lines, the rest is the function specification and precondition checks.

def check_satisficing(objs, criteria, directions):
    """
    Returns whether a set of objectives meets a set of satisficing criteria
    
    Value return has type bool
    
    Parameters:
        objs: objective values from a given solution
        criteria: satisficing criteria for each objective
        directions: a list of strings containing "ge" or "le" for each 
            criteria, where ge indicates satisfication for values greater than
            or equal to the criteria and le indicates satisfaction for values 
            less than or equal to the criteria
    
    
    Examples:
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'ge'] 
            returns True
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'le'] 
            returns False     
        objs = [.4, .4], criteria = [0.5, 0.5], directions = ['le', 'le'] 
            returns True
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'le'] 
            returns False
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'ge'] 
            returns False
        objs = [.5, .5], criteria = [0.5, 0.5], directions = ['ge', 'ge'] 
            returns True
        objs = [.5, .5], criteria = [0.5, 0.5], directions = ['le', 'le'] 
            returns True
    
    Preconditions:
        objs is a numpy array of floats with length 2
        criteria is a numpy array of floats with length 2
        directions is a list of strings with length 2 containing either 
            "ge" or "le"
    """
    
    # check preconditions
    assert len(objs) == 2, 'objs has length ' + repr(len(objs)) + ', should be 2'
    assert len(criteria) == 2, 'criteria has length ' + repr(len(criteria)) + \
    ', should be 2'
    assert len(directions) == 2, 'directions has length ' + \
    repr(len(directions)) + ', should be 2'
    
    # check to make sure
    for i in range(2):
        assert type(objs[i])== np.float64, 'objs element ' + str(i) + ': ' + \
        repr(objs[i]) + ', is not a numpy array of floats'
        assert type(criteria[i])== np.float64, 'criteria element ' + str(i) + \
        ': ' + repr(criteria[i]) + ', is not a numpy array of floats'
        assert type(directions[i])== str, 'directions element ' + str(i) + \
        ': ' + repr(directions[i]) + ', is not a string'
        assert directions[i] == 'ge' or directions[i] == 'le', 'directions ' + \
        str(i) + ' is ' + repr(directions[i]) + ', should be either "ge" or "le"' 
    
    
    # loop through objectives and check if each meets the criteria
    meets_criteria = True
    for i in range(2):
        if directions[i] == 'ge':
            meets_criteria = meets_criteria and (objs[i] >= criteria[i])
        else:
            meets_criteria = meets_criteria and (objs[i] <= criteria[i])
    
    return meets_criteria

Developing test cases

If you read my last post, you may notice that this specification includes an extra section called “Examples”. These examples show the user how the function is supposed to perform. They also represent the suite of test cases used to validate the function. Creating test cases is more of an art than a science, and test cases will be unique to each function you write. However, there is a set of basic rule you can follow to guide your implementation of unit testing which I’ll outline below.

  1. Pick the simplest cases first. In this example, the simplest cases are when both objectives are both above or below the criteria
  2. Move to more complex cases. In this example, a more complex case could be when one objective is above and the other below, or vice versa
  3. Think about “weird” possibilities. One “weird” case for this code could be when one or both objectives are equal to the criteria
  4. Never test a precondition violation. Precondition violations are outside the scope of the function and should not be included in unit tests

Test cases should be exhaustive and even simple codes may have many test cases. In my example above I provide seven, can you think of any more that are applicable to this code?

Implementing a testing strategy

After you’ve developed your test cases, it’s time to implement your unit test. For demonstration purposes I’ve written my own unit test code which can be used to test the cases developed above. This function simply utilizes assert statements to check if each test input generates the correct output. A danger of writing your own testing function is that the test function itself may have errors. In practice, it’s easiest to use an established tool such as PyTest to perform unit testing (for in-depth coverage of PyTest see Bernardo’s post from 2019).

def test_check_satisficing():
    """
    Unit test for the function check_satisficing
    """
    import numpy as np
    from check_satisficing import check_satisficing
    
    print("Testing check_satisficing")
    
    
    # test case 1:
    objs1 = np.array([0.5, 0.5])
    criteria1 = np.array([0.4, 0.4])
    directions1 = ['ge','ge']
    result1 = True
    
    assert (check_satisficing(objs1, criteria1, directions1)) == result1, \
    'Test 1 failed ' + repr(objs1) + ', ' + repr(criteria1) + ', ' + \
    repr(directions1) + ' returned False, should be True'
    
    # test case 2:
    objs2 = np.array([0.5, 0.5])
    criteria2 = np.array([0.4, 0.4])
    directions2 = ['ge','le']
    result2 = False
    
    assert (check_satisficing(objs2, criteria2, directions2)) == result2, \
    'Test 2 failed ' + repr(objs2) + ', ' + repr(criteria2) + ', ' + \
    repr(directions2) + ' returned True, should be False'
    
    
     # test case 3:
    objs3 = np.array([0.4, 0.4])
    criteria3 = np.array([0.5, 0.5])
    directions3 = ['le','le']
    result3= True
    
    assert (check_satisficing(objs3, criteria3, directions3)) == result3, \
    'Test 3 failed ' + repr(objs3) + ', ' + repr(criteria3) + ', ' + \
    repr(directions3) + ' returned False, should be True'
    
    
     # test case 4:    
    objs4 = np.array([0.5, 0.5])
    criteria4 = np.array([0.4, 0.4])
    directions4 = ['ge','le']
    result4 = False
    
    assert (check_satisficing(objs4, criteria4, directions4)) == result4, \
    'Test 4 failed ' + repr(objs4) + ', ' + repr(criteria4) + ', ' + \
    repr(directions4) + ' returned True, should be False'
    
    
    # test case 5    
    objs5 = np.array([0.5, 0.5])
    criteria5 = np.array([0.4, 0.4])
    directions5 = ['le','ge']
    result5 = False
    
    assert (check_satisficing(objs5, criteria5, directions5)) == result5, \
    'Test 5 failed ' + repr(objs5) + ', ' + repr(criteria5) + ', ' + \
    repr(directions5) + ' returned True, should be False'
    
    
    # test case 6: 
    objs6 = np.array([0.5, 0.5])
    criteria6 = np.array([0.5, 0.5])
    directions6 = ['ge','ge']
    result6 = True
    
    assert (check_satisficing(objs6, criteria6, directions6)) == result6, \
    'Test 6 failed ' + repr(objs6) + ', ' + repr(criteria6) + ', ' + \
    repr(directions6) + ' returned False, should be True'
    
    # test case 7: 
    objs7 = np.array([0.5, 0.5])
    criteria7 = np.array([0.5, 0.5])
    directions7 = ['le','le']
    result7 = True
    
    assert (check_satisficing(objs7, criteria7, directions7)) == result7, \
    'Test 7 failed ' + repr(objs7) + ', ' + repr(criteria7) + ', ' + \
    repr(directions7) + ' returned False, should be True'
    
    
    print("check_satisficing has passed all tests!")

    return    

Concluding thoughts

When developing a new Python function, it’s good practice to code the test cases while you’re writing the function and test each component during the development cycle. Coding in this matter will allow you to identify and fix errors early an can save a lot of time and headache. Creating test cases during code development may also improve the quality of your code by helping you conceptualize and avoid potential bugs.

Writing sharable Python code

Most of us in academia do not have formal training in computer science, yet for many of us writing and sharing code is an essential part of our research. While the quality of code produced by many graduate students can be quite impressive, many of us never learned the basic CS principles that allow programs to be sharable and inheritable by other programmers. Last summer I proctored an online class in Python fundamentals. The course was for beginner programmers and, though it covered material simpler than the scripts I write for my PhD, I learned a lot about how to properly document and structure Python functions to make them usable by others. In this post I’ll discuss two important concepts I took away from this course: writing proper function specifications and enforcing preconditions using assert statements.

Function Specifications

One of the the most important aspects of writing a quality Python function is proper specification. While the term may sound generic, a specification actually has a very precise definition and implementation for a Python function. In practice, a specification is a docstring, a “string literal” that occurs as the first statement in a function, module, class or method, formed by a bracketed set of “””. Here is an example of a simple function I wrote with a specification:


def radians_to_degrees(theta):
    """
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float
    """
    return theta * (180.0/3.14159)

The function specification is everything between the sets of “””. When Python sees this docstring at the front of a function definition, it automatically is stored as the “__doc__” associated with the function. With this specification in place, any user that loads this function can access its __doc__ by typing help(radians_to_degrees), which will print the following to the terminal:

Help on function radians_to_degrees in module __main__:

radians_to_degrees(theta)
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float

The help function will print anything in the docstring at the start of a function, but a it is good practice for the specification to have the following elements:

  1. A one-line summary of the function at the very beginning, if the function is a “fruitful function” (meaning it returns something), this line should tell what it returns. In my example I note that my function returns theta converted to degrees.
  2. A more detailed description of the function. In my example this simply provides the type of the return value.
  3. A description of the function’s parameter(s)
  4. Any preconditions that are necessary for the code to run properly (more on this later)

I should note that officially the Python programming language has a more flexible set of requirements for function specifications, which can be found here, but the attributes above are a good starting point for writing clear specifications.

Properly specifying a Python function will clarify the function’s intended use and provide instructions for how new users can utilize it. It will also help you document your code for formal release if you ever publish it. Google any of your favorite Python functions and you’ll likely be brought to a page that has a fancy looking version of the function’s specification. These pages can be automatically generated by tools such as Spinx that create them right from the function’s definition.

Aside from clarifying and providing instructions for your function, specifications provide a means of creating a chain of accountability for any problems with your code. This chain of accountability is created through precondition statements (element four above). A precondition statement dictates requirements for the function to run properly. Preconditions may specify the type of parameter input (i.e. x is a float) or a general statement about the parameter (x < 0).

For large teams of many developers and users of functions, precondition statements create a chain of accountability for code problems. If the preconditions are violated and a code crashes, then it is the responsibility of the user, who did not use the code properly. On the other hand, if the preconditions were met and the code crashes, it is the responsibility of the developer, who did not properly specify the code.

Asserting preconditions to catch errors early

One way to improve the usability of a code is to catch precondition violations with statements that throw specific errors. In Python, this may be done using assert statements. Assert statements evaluate a boolean expression and can display a custom error message if the expression returns false. For example, I can modify my radians_to_degrees function with the following assert statement (line 10):

def radians_to_degrees(theta):
    """
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float
    """
    assert type(theta) == float, repr(theta) + ' is not float'
    return theta * (180.0/3.14159)

A helpful function in my assert statement above is “repr”, which will return a printable representation of a given object (i.e. for a string it will keep the quotation marks around it). Now if I were to enter ‘a’ into my function, I would get the following return:

AssertionError: 'a' is not float

This is trivial for my example, but can save a lot of time when running large and complex functions. Rather than seeing 30 lines of “traceback…”, the user will be immediately brought to the source of the error. In general, you should try to make assert statements as precise as possible to pinpoint the violation.

It’s important to note that not every conceivable precondition violation requires an assert statement. There is a tradeoff between code efficiency and number of assert statements (checking a precondition takes time). Determining which preconditions to strictly enforce with assert statements is a balancing act and will be different for each application. It’s important to remember though, that even if you do not have an assert statement for a precondition, you should still list all preconditions in the function specification to preserve the chain of accountability.

Further resources

This post concerns specifications for Python functions. However, the use of docstrings is also important when coding modules, classes and public methods. Guidance on how these docstrings may be formatted can be found here.

While this post focused on Python, informative function specifications are important for all programming languages. Some resources on documentation in other languages can be found below:

More on simple Bash Shell scripts (examples of “find” and “sed”)

When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs.  You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model.  For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”

find

Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].

find . -type f -name "218*"

Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:

find . -type f -name "218*" |  wc -l

You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:

find . -type f -name "218_wheat*" -exec rm {} \;

If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:

find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;

Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:

find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;

 “d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.

If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”

find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;

Or we can look for files that do not contain “success.”

find . -type f -name 218cs*.err" -exec grep -L "success" {} \;

sed

“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:

sed -i 's/1295/1360/' 218cs.txt

Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:

sed -i '5s/1295/1360/' 218cs.txt

There might be more modifications that have to be done, so we can add them in one line using “-e”:

sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt

find + sed

If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:

find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/'  {} \;

Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:

“FORCING1         /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”

For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].

path_new="FORCING1	/home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|$path_new|" global_param_default.txt {} +

Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt).

find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1" {} \; 

Now, all of the “global_param_default” files have a new line with this content: “OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1”.

It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path:

for P in $(find . -type d -name "output"); do (new_name="$(echo "${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "${P}" && mv output__46.34375_-119.90625 $ new_name); done

With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding $ and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions.

With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring

“&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in $P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in $new_name.

Profiling your Python script using cProfile

Profiling refers to performing dynamic analysis on a script to measure its execution time, the execution time of its subcomponents, as well as how many times each subcomponent is being called. This produces data on where the script program is spending the most time, and can help with optimizing your script to minimize its execution time. This blog has had two past posts on profiling, one on C++ using Callgrind and one on Python using PyCharm. PyCharm is a Python IDE that’s very useful but unfortunately not free, so if you’re looking for some freeware profiling functionality in Python, this post is for you.

Python has a module called cProfile. A simple example on timing the multiplication of two matrices with cProfile:

import cProfile
import numpy as np

mat1 = ([1, 6, 3],[3, 6, 3],[2, 8, 3]) 
mat2 = ([2, 7, 6],[5, 4, 7],[6, 1, 9]) 
  
cProfile.run('np.dot(mat1,mat2)')

this should print out something like the following:

         4 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I have recently used cProfile on one of my own scripts which I’ll be using here to demonstrate how it can be used in your own work. I have a function called fish_game, which contained my model and took vars as input. This function also calls function hrvSTR which represented my action policy function (it’s extraneous to this post what these functions do exactly, one represents the system and the other represents a policy that we use to act on the system, you can see the full model here). The fish_game function was called by my optimization algorithm during optimization. Running cProfile on it produces this:

cProfile.run('fish_game(vars)',sort='cumulative')
         282814 function calls in 0.698 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.698    0.698 {built-in method builtins.exec}
        1    0.000    0.000    0.698    0.698 <string>:1(<module>)
        1    0.166    0.166    0.698    0.698 <ipython-input-4-df258d5a749f>:55(fish_game)
    20200    0.385    0.000    0.531    0.000 <ipython-input-4-df258d5a749f>:1(hrvSTR)
    20200    0.016    0.000    0.089    0.000 fromnumeric.py:1821(sum)
    20200    0.021    0.000    0.069    0.000 fromnumeric.py:64(_wrapreduction)
   121208    0.055    0.000    0.055    0.000 {built-in method numpy.core.multiarray.zeros}
    20200    0.046    0.000    0.046    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    20200    0.003    0.000    0.003    0.000 {built-in method builtins.isinstance}
    40400    0.003    0.000    0.003    0.000 {built-in method builtins.len}
    20200    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This tells me one run of my function takes 0.698 seconds in total (this might vary slightly every time, depending on your processor usage at the time and other factors), and that most of that time, 0.531 seconds, are consumed by the hrvSTR function. Even though 0.7 seconds might not seem long, in the context of optimization, where this function would need to be evaluated tens of thousands of times, an additional 0.1 seconds might add hours of process time to your workflow. Trying to bring that down is probably a worthwhile investment of time that will result in time savings later on. As a result I figured there could be something I could do to reduce the time hrvSTR took. I particularly intrigued by the fact that some numpy process numpy.core.multiarray.zeros was called 121208 times, an order of magnitude more than every other method in my script, which prompted me to think that I might be unnecessarily repeating a process.

Looking at my code more closely (this is a script I have been using for more than a year now), I realized that I was ordering and normalizing and creating arrays for my action policy every single time it was called, something that was unnecessary, as the same policy was used for all time steps. I could instead perform all those steps once, save the outputs and use them for every time step instead of recalculating every time. I spent some time to adjust my script to do that and running cProfile again, produced this:

 cProfile.run('fish_game(vars)',sort='cumulative')
         79282 function calls in 0.379 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.379    0.379 {built-in method builtins.exec}
        1    0.000    0.000    0.379    0.379 <string>:1(<module>)
        1    0.150    0.150    0.379    0.379 <ipython-input-16-3c528334eb55>:62(fish_game)
    19800    0.193    0.000    0.229    0.000 <ipython-input-16-3c528334eb55>:35(hrvSTR)
    59414    0.036    0.000    0.036    0.000 {built-in method numpy.core.multiarray.zeros}
        4    0.000    0.000    0.000    0.000 fromnumeric.py:2817(mean)
        4    0.000    0.000    0.000    0.000 _methods.py:58(_mean)
        2    0.000    0.000    0.000    0.000 <ipython-input-16-3c528334eb55>:4(generate_policy)
        6    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        2    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}
        2    0.000    0.000    0.000    0.000 fromnumeric.py:1821(sum)
        2    0.000    0.000    0.000    0.000 fromnumeric.py:64(_wrapreduction)
        4    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        4    0.000    0.000    0.000    0.000 _methods.py:48(_count_reduce_items)
        2    0.000    0.000    0.000    0.000 {method 'clip' of 'numpy.generic' objects}
        4    0.000    0.000    0.000    0.000 numeric.py:504(asanyarray)
       10    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        8    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I basically shaved 0.3 seconds off my function evaluation time by investing some time to look more closely at my script. If optimizing with 30000 function evaluations, this translates to some 2+ hours of processing time that I am saving (I spent far less figuring this out for the first time).

You can also run cProfile directly from the command line like so:

python -m cProfile -s cumtime fish_game.py

Parallelization of C/C++ and Python on Clusters

There are multiple ways of parallelizing C/C++ and Python codes to be used on clusters. The quick and dirty way, which often gets the job done, is to parallelize the code with MPI and launch one process per core across multiple nodes. Though easy to implement, this strategy may result in a fair waste of results because:

  • With too many processes on one node, inefficient use of memory may make the code memory bottle-necked, meaning that a lot of time will be spent accessing memory and the cores will be sitting twiddling their thumbs.
  • If you have a 48-core node with 192 GB of memory (as the Skylake nodes on Stampede 2) and your code uses 10 GB of RAM per process, you will not be able to have more than 19 out of 48 cores working at the same time because more than that would exceed the node’s memory, which would crash the node, which would get you kicked out of the cluster for a while without a warning — if you do not know how much RAM your code requires, you should look into the Remora tool.

If you are working on big runs on systems that charge per usage time or if you are creating a code you will have to run multiple times, it may be worth looking into more clever parallelization strategies. Some of the techniques described in the next subsections will also make your code two to four times faster on your desktop/laptop and are rather straight-forward to implement. Most of the code in this blog post can be freely downloaded from this repository. Lastly, if the #include statements in the examples below are not displayed as followed by a library between “<>”, please refer to the mentioned online repository.

Hybrid Parallelization in C/C++

The word hybrid refers in this context to a mixed use of shared and distributed memory parallelization techniques. In distributed parallelization techniques, such as the Message Passing Interface (MPI), each process has its own dedicated chunk of memory within a RAM chip. This means that each process will have its own copy of the input data (repeated across processes), variables within the program, etc. Conversely, shared memory parallelization techniques such as OpenMP allow all threads (parallel unit of computation) to access the same block of memory, which can make a mess in your code with variables being changed seemingly out of nowhere by various parallel runs, but which lessens the amount of RAM needed and in some cases allow for more efficient memory utilization.

If running the Borg Master-Slave multiobjective evolutionary optimization algorithm (Borg MS) with hybrid parallelization, Borg MS will have each function evaluation performed by one MPI process, which could run in parallel with OpenMP (if your model is setup for that) across the cores designated for that process. As soon as a function evaluation is complete, the slave MPI process in charge of it will submit the results back to Borg MS’s master process and be given another set of decision variables to work on. The figure below is a schematic drawing of such hybrid parallelization scheme:

BorgHybridParallelization.png

OpenMP

Parallelizing a for-loop in C or C++ in which each iteration is independent — such as in Monte Carlo simulations — is straight-forward:

#include
#include "path/to/my_model.h"

int main(int argc, char *argv[])
{
    int n_monte_carlo_simulations = omp_get_num_threads(); // One model run per core.
    vector system_inputs = ;
    vector results(n_monte_carlo_simulations, 0);
#pragma omp parallel for
    for (int i = 0; i < n_monte_carlo_simulations; ++i)
    {
        results[i] = RunMonteCarloSimulation(system_inputs[i]);
    }
}

The only caveat is that you have to be sure that the independent simulations are not writing over each other. For example, if by mistake you had written results[0] instead of results[i], the final value on results[0] would be that of whichever thread (parallel simulation) finished last, while the rest of the vector would be of zeros. It is important to notice here that the vector system_inputs is read only once by the parallel code and all n_monte_carlo_simulations threads would have access to it in memory.

Though this example is not rocket science, there is a lot more to OpenMP that can be found here and here.

MPI

MPI stands for Message Passing Interface. It creates processes which pass messages (information packages) to each other. The most basic MPI code is one that creates independent processes — each with its own allocated chunk of memory that cannot be accessed by any other process — and just has them running independently in parallel without communicating with each other. This is great to distribute Monte Carlo runs across various nodes, although it may lead to the limitations mentioned in the beginning of this post. Below is an example of a very basic MPI code to run Monte Carlo simulations in parallel:

#include
#include "path/to/my_model.h"

int main(int argc, char *argv[])
{
    MPI_Init(NULL, NULL);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    vector system_inputs = ;
    RunMonteCarloSimulation(system_inputs[world_rank]);

    MPI_Finalize();
    return 0;
}

I will not cover in this post how to pass the output of the RunMonteCarloSimulation function back to a results vector, as in the previous example, because details of MPI is outside the scope of this post — a good MPI tutorial can be found here and great examples, including ones for Monte Carlo runs, can be found here. To run this code, you would have to compile it with mpicc mpi_demo.c -o MPIDemo and call it with the mpirun or mpiexec commands, which should be followed by the -n flag to specify the number of processes to be created, or mpirun -n 4 ./MPIDemo. The code above would run as many MonteCarlo simulations as there are processes (four processes if -n 4 was used), each with a rank (here a type of ID), and can be spread across as many nodes as you can get on a cluster. Each process of the code above would run on a core, which may result in the memory and/or processing limitations listed in the beginning of this post.

Mixing OpenMP and MPI

The ideal situation for most cases is when you use MPI to distribute processes across nodes and OpenMP to make sure that as many cores within a node as it makes sense to use are being used. To do this, you can use the structure of the code below.

#include
#include
#include
#include "path/to/my_model.cpp"

int main(int argc, char *argv[])
{
    MPI_Init(NULL, NULL);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    int n_cores = omp_get_num_threads(); // get number of cores available for MPI process.
    vector system_inputs = ;
    vector results(n_cores , 0);
#pragma omp parallel for
    for (int i = 0; i < n_cores ; ++i)
    {
        // printf("C hybrid. Process-thread: %s-%d\n", world_rank, omp_get_thread_num());
        results[i] = RunMonteCarloSimulation(system_inputs[i]);
    }

    MPI_Finalize();
    return 0;

}

If you comment lines 4 and 20 and uncomment line 19, compile the code with mpicc hybrid.c -o Hybrid -fopenmp, set the number of threads in the terminal to eight by running set OMP_NUM_THREADS=8, and call the resulting executable with the command mpirun -n 4 ./Hybrid, the executable will spawn four processes, each of which spawning eight threads and printing the following output:

C hybrid. Process-thread: 0-0
C hybrid. Process-thread: 0-1
C hybrid. Process-thread: 0-4
C hybrid. Process-thread: 0-5
C hybrid. Process-thread: 0-7
C hybrid. Process-thread: 0-6
C hybrid. Process-thread: 0-2
C hybrid. Process-thread: 0-3
C hybrid. Process-thread: 1-6
C hybrid. Process-thread: 1-0
C hybrid. Process-thread: 1-1
C hybrid. Process-thread: 1-3
C hybrid. Process-thread: 1-5
C hybrid. Process-thread: 1-4
C hybrid. Process-thread: 1-7
C hybrid. Process-thread: 1-2
C hybrid. Process-thread: 2-6
C hybrid. Process-thread: 2-0
C hybrid. Process-thread: 2-3
C hybrid. Process-thread: 2-7
C hybrid. Process-thread: 2-2
C hybrid. Process-thread: 2-4
C hybrid. Process-thread: 2-5
C hybrid. Process-thread: 2-1
C hybrid. Process-thread: 3-0
C hybrid. Process-thread: 3-1
C hybrid. Process-thread: 3-3
C hybrid. Process-thread: 3-4
C hybrid. Process-thread: 3-6
C hybrid. Process-thread: 3-7
C hybrid. Process-thread: 3-2
C hybrid. Process-thread: 3-5

The code above can be used to have one process per node (and therefore one copy of system_inputs per node) using as many threads as there are cores in that node — depending on the number of cores, it is sometimes more efficient to have two or three MPI processes per node, each using half or a third of the cores in that node. Below is a PBS job submission script that can be used to generate the output above — job submission scripts for Slurm to be used on Stampede 2 can be found here.

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=:00:30
#PBS -o ./output/hybrid_c.out
#PBS -e ./error/hybrid_c.err
module load openmpi-1.10.7-gnu-x86_64
export OMP_NUM_THREADS=8

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
export TASKS_PER_NODE=2
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS ./Hybrid

Parallelizing Python on a Cluster

Python was not designed with shared-memory parallelization in mind, so its native parallelization library, the Multiprocessing library, does distributed-memory parallelization instead of shared. The issue with the Multiprocessing library is that it does not work across nodes, so you still need something like MPI. Here, we will use MPI4Py.

The code below parallelizes the function call across cores within the same node — please refer to this blog post for details about and other ways to use the Multiprocessing library.

from multiprocessing import Process, cpu_count

def do_something_useful(rank, shared_process_number):
    # Do something useful here.
    print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)

# Create node-local processes
shared_processes = []
#for i in range(cpu_count()):
for i in range(8):
    p = Process(target=do_something_useful, args=(i,))
    shared_processes.append(p)

# Start processes
for sp in shared_processes:
    sp.start()

# Wait for all processes to finish
for sp in shared_processes:
    sp.join()

By adding MPI4Py to the code above, following the same logic of the C/C++ example, we get the following result:

from mpi4py import MPI
from multiprocessing import Process, cpu_count

def do_something_useful(rank, shared_process_number):
    # Do something useful here.
    print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print 'Calling Python multiprocessing process from Python MPI rank {}'.format(rank)

# Create shared-ish processes
shared_processes = []
#for i in range(cpu_count()):
for i in range(8):
    p = Process(target=do_something_useful, args=(rank, i))
    shared_processes.append(p)

# Start processes
for sp in shared_processes:
    sp.start()

# Wait for all processes to finish
for sp in shared_processes:
    sp.join()

comm.Barrier()

Calling the code above with mpirun -n 4 python hybrid_pure_python.py will first result in four MPI processes being spawned with ranks 1 through 4, each spawning eight local processes. The output should look similar to the one below:

Calling Python multiprocessing process from Python MPI rank 0
Calling Python multiprocessing process from Python MPI rank 1
Python hybrid, MPI_Process-local_process (not quite a thread): 1-0
Python hybrid, MPI_Process-local_process (not quite a thread): 0-0
Python hybrid, MPI_Process-local_process (not quite a thread): 1-1
Python hybrid, MPI_Process-local_process (not quite a thread): 0-1
Python hybrid, MPI_Process-local_process (not quite a thread): 1-2
Python hybrid, MPI_Process-local_process (not quite a thread): 0-2
Python hybrid, MPI_Process-local_process (not quite a thread): 0-3
Python hybrid, MPI_Process-local_process (not quite a thread): 1-3
Calling Python multiprocessing process from Python MPI rank 2
Calling Python multiprocessing process from Python MPI rank 3
Python hybrid, MPI_Process-local_process (not quite a thread): 1-4
Python hybrid, MPI_Process-local_process (not quite a thread): 0-4
Python hybrid, MPI_Process-local_process (not quite a thread): 0-5
Python hybrid, MPI_Process-local_process (not quite a thread): 1-5
Python hybrid, MPI_Process-local_process (not quite a thread): 1-6
Python hybrid, MPI_Process-local_process (not quite a thread): 0-6
Python hybrid, MPI_Process-local_process (not quite a thread): 1-7
Python hybrid, MPI_Process-local_process (not quite a thread): 0-7
Python hybrid, MPI_Process-local_process (not quite a thread): 2-0
Python hybrid, MPI_Process-local_process (not quite a thread): 3-0
Python hybrid, MPI_Process-local_process (not quite a thread): 3-1
Python hybrid, MPI_Process-local_process (not quite a thread): 2-1
Python hybrid, MPI_Process-local_process (not quite a thread): 3-2
Python hybrid, MPI_Process-local_process (not quite a thread): 2-2
Python hybrid, MPI_Process-local_process (not quite a thread): 3-3
Python hybrid, MPI_Process-local_process (not quite a thread): 2-3
Python hybrid, MPI_Process-local_process (not quite a thread): 3-4
Python hybrid, MPI_Process-local_process (not quite a thread): 2-4
Python hybrid, MPI_Process-local_process (not quite a thread): 2-5
Python hybrid, MPI_Process-local_process (not quite a thread): 3-5
Python hybrid, MPI_Process-local_process (not quite a thread): 3-6
Python hybrid, MPI_Process-local_process (not quite a thread): 3-7
Python hybrid, MPI_Process-local_process (not quite a thread): 2-6
Python hybrid, MPI_Process-local_process (not quite a thread): 2-7

Below is a PBS submission script that can be used with the code above:

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=00:01:00
#PBS -o ./output/hybrid_python.out
#PBS -e ./error/hybrid_python.err
module load python-2.7.5
export OMP_NUM_THREADS=8

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
export TASKS_PER_NODE=2
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_pure_python.py

Parallelizing Monte Carlo Runs of C/C++ (or anything else) with Python

Very often we have to run the same model dozens to thousands of times for Monte-Carlo-style analyses. One option is to go on a cluster and submit dozens to thousand of jobs, but this becomes a mess for the scheduler to manage and cluster managers tend to not be very fond of such approach. A cleaner way of performing Monte Carlo runs in parallel is to write a code in Python or other language that spawns MPI processes which then call the model to be run. This can be done in Python with the subprocess library, as in the example below:

from mpi4py import MPI
import subprocess

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print 'Calling OpenmpTest from Python MPI rank {}'.format(rank)
function_call = ['./OpenmpTest', '{}'.format(rank)]
subprocess.call(function_call)

comm.Barrier()

Here, OpenmpTest is an executable compiled from the C-OpenMP code below:

#include
#include 

int main(int argc, char *argv[])
{
#pragma omp parallel for
    for (int i = 0; i < omp_get_num_threads(); ++i)
    {
        printf("C OpenMP. Process-thread: %s-%d\n", argv[1], i);
    }
    return 0;
}

Submitting the code above with following PBS job submission script should result in the output below.

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=00:01:00
#PBS -o ./output/hybrid_python_c.out
#PBS -e ./error/hybrid_python_c.err
module load python-2.7.5
export OMP_NUM_THREADS=8

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
export TASKS_PER_NODE=2
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec –hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_calls_c.py

Output:

Calling OpenmpTest from Python MPI rank 0
Calling OpenmpTest from Python MPI rank 1
Calling OpenmpTest from Python MPI rank 2
Calling OpenmpTest from Python MPI rank 3
C OpenMP. Process-thread: 0-7
C OpenMP. Process-thread: 0-5
C OpenMP. Process-thread: 0-0
C OpenMP. Process-thread: 0-2
C OpenMP. Process-thread: 0-4
C OpenMP. Process-thread: 0-1
C OpenMP. Process-thread: 0-3
C OpenMP. Process-thread: 0-6
C OpenMP. Process-thread: 1-1
C OpenMP. Process-thread: 1-2
C OpenMP. Process-thread: 1-6
C OpenMP. Process-thread: 1-4
C OpenMP. Process-thread: 1-5
C OpenMP. Process-thread: 1-0
C OpenMP. Process-thread: 1-3
C OpenMP. Process-thread: 1-7
C OpenMP. Process-thread: 2-3
C OpenMP. Process-thread: 2-7
C OpenMP. Process-thread: 2-0
C OpenMP. Process-thread: 2-4
C OpenMP. Process-thread: 2-6
C OpenMP. Process-thread: 2-5
C OpenMP. Process-thread: 2-1
C OpenMP. Process-thread: 2-2
C OpenMP. Process-thread: 3-5
C OpenMP. Process-thread: 3-1
C OpenMP. Process-thread: 3-4
C OpenMP. Process-thread: 3-2
C OpenMP. Process-thread: 3-7
C OpenMP. Process-thread: 3-3
C OpenMP. Process-thread: 3-0
C OpenMP. Process-thread: 3-6

End of the Story

As highlighted in the examples above (all of which can be download from here), creating parallel code is not rocket science and requires very few lines of code. The OpenMP and Multiprocessing parallelization tricks can save hours if you have to run a model several times on your personal computer, and paired with MPI these tricks can save you from hours to days in time and thousands of node-hours on clusters. I hope you make good use of it.

Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

raw_points

Figure 1: Example training set

 

Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

P(y|x)=\frac{P(x|y)P(y)}{P(x)}                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

P(x|y) ~ N(\mu_y, \Sigma_y)                                                                                   (2)

Where \Sigma_y is a diagonal covariance matrix with [\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y} for each feature \alpha.

For each feature, $\alpha$, and each class, c we can then model P(x_\alpha|y) as:

P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}                              (3)

We can then estimate model parameters:

\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}                                                                   (4)

\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2                                                  (5)

Where:

n_c = \sum^{n}_{i=1}I(y_i=c)                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1'); 
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')

 

 

compare_classification

Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1'); 
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')

compare_probs

Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

Further reading

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743

Magnitude-varying sensitivity analysis and visualization (Part 2)

In my last post, I talked about producing these flow-duration-curve-type figures for an output time-series one might be interested in, and talked about their potential use in an exploratory approach for the purpose of robust decision making. Again, the codes to perform the analysis and visualization are in this Github repository.

experiment_data_range

Fig. 1: Historical data vs. range of experiment outputs

As already discussed, there are multiple benefits for visualizing the output in such manner: we are often concerned with the levels and frequencies of extremes when making decisions about systems (e.g. “how bad is the worst case?”, “how rare is the worst case?”), or we might like to know how often we exceed a certain threshold (e.g. “how many years exceed an annual shortage of 1000 af?“). The various percentiles tell a different part of the story of how a system operates, the 5th percentile tells as that its level is exceeded 95% of the time, the 99th tells as that its level is only reached once in every 100 years in our records. These might seem obvious to the readers of this blog, but often times we perform our analyses for only some of these percentiles, “the worst event”, “the average”, etc., which is certainly very informative, but can potentially miss part of the bigger picture.

In this post I’m going to walk the reader through performing a sensitivity analysis using the output of an experiment using multiple Latin Hypercube Samples. The analysis will be magnitude-varying, i.e., it will be performed at different magnitudes of our output of interest. For this particular example, we aim to see what are the most significant drivers of shortage at the different levels it’s experienced by this user. In other words, if some factors appear to be driving the frequent small shortages experienced, are those factors the same for the rare large shortages?

To perform the sensitivity analysis, I am going to use SALib (featured in this blog multiple times already), to perform a Delta Moment-Independent Analysis [1] (also produces a first order Sobol sensitivity index [2]). You’ll probably need to install SALib if it’s not a package you’ve used already. I’m also going to use statsmodels, to perform a simple linear regression on the outputs and look at their R2 values. But, why, you might ask, perform not one, not two, but three sensitivity analyses for this? There are nuanced, yet potentially important differences between what the three methods capture:

Delta method: Look for parameters most significantly affecting the density function of observed shortages. This method is moment-independent, i.e., it looks at differences in the entire distribution of the output we’re interested in.
First order Sobol (S1): Look for parameters that most significantly affect the variance of observed outputs, including non-linear effects.
R2: Look for parameters best able to describe the variance of observed outputs, limited to linear effects.

Another important thing to note is that using the First order Sobol index, the total variance resulting from the parameters should equal 1. This means that if we sum up the S1’s we get from our analysis, the sum represents the variance described by the first order effects of our parameters, leaving whatever is left to interactions between our variables (that S1 cannot capture). The same holds using R2, as we are repeatedly fitting our parameters and scoring them on how much of the output variance they describe as a sole linear predictor (with no interactions or other relationships).

The following Python script will produce all three as well as confidence intervals for the Delta index and S1. The script essentially loops through all percentiles in the time-series and performs the two analyses for each one. In other words, we’re are looking at how sensitive each magnitude percentile is to each of the sampled parameters.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from SALib.analyze import delta
# Load parameter samples
LHsamples = np.loadtxt('./LHsamples.txt')
params_no = len(LHsamples[0,:])
param_bounds=np.loadtxt('./uncertain_params.txt', usecols=(1,2))
# Parameter names
param_names=['IWRmultiplier','RESloss','TBDmultiplier','M_Imultiplier',
'Shoshone','ENVflows','EVAdelta','XBM_mu0','XBM_sigma0',
'XBM_mu1','XBM_sigma1','XBM_p00','XBM_p11']
# Define problem class
problem = {
'num_vars': params_no,
'names': param_names,
'bounds': param_bounds.tolist()
}
# Percentiles for analysis to loop over
percentiles = np.arange(0,100)
# Function to fit regression with Ordinary Least Squares using statsmodels
def fitOLS(dta, predictors):
# concatenate intercept column of 1s
dta['Intercept'] = np.ones(np.shape(dta)[0])
# get columns of predictors
cols = dta.columns.tolist()[-1:] + predictors
#fit OLS regression
ols = sm.OLS(dta['Shortage'], dta[cols])
result = ols.fit()
return result
# Create empty dataframes to store results
DELTA = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
DELTA_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
S1 = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
S1_conf = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
R2_scores = pd.DataFrame(np.zeros((params_no, len(percentiles))), columns = percentiles)
DELTA.index=DELTA_conf.index=S1.index=S1_conf.index = R2_scores.index = param_names
# Read in experiment data
expData = np.loadtxt('./experiment_data.txt')
# Identify magnitude at each percentiles
syn_magnitude = np.zeros([len(percentiles),len(LHsamples[:,0])])
for j in range(len(LHsamples[:,0])):
syn_magnitude[:,j]=[np.percentile(expData[:,j], i) for i in percentiles]
# Delta Method analysis
for i in range(len(percentiles)):
if syn_magnitude[i,:].any():
try:
result= delta.analyze(problem, LHsamples, syn_magnitude[i,:], print_to_console=False, num_resamples=2)
DELTA[percentiles[i]]= result['delta']
DELTA_conf[percentiles[i]] = result['delta_conf']
S1[percentiles[i]]=result['S1']
S1_conf[percentiles[i]]=result['S1_conf']
except:
pass
S1.to_csv('./S1_scores.csv')
S1_conf.to_csv('./S1_conf_scores.csv')
DELTA.to_csv('./DELTA_scores.csv')
DELTA_conf.to_csv('./DELTA_conf_scores.csv')
# OLS regression analysis
dta = pd.DataFrame(data = LHsamples, columns=param_names)
# fig = plt.figure()
for i in range(len(percentiles)):
shortage = np.zeros(len(LHsamples[:,0]))
for k in range(len(LHsamples[:,0])):
shortage[k]=syn_magnitude[i,k]
dta['Shortage']=shortage
for m in range(params_no):
predictors = dta.columns.tolist()[m😦m+1)]
result = fitOLS(dta, predictors)
R2_scores.at[param_names[m],percentiles[i]]=result.rsquared
R2_scores.to_csv('./R2_scores.csv')

The script produces the sensitivity analysis indices for each magnitude percentile and stores them as .csv files.

I will now present a way of visualizing these outputs, using the curves from Fig. 1 as context.  The code below reads in the values for each sensitivity index, normalizes them to the range of magnitude at each percentile, and then plots them using matplotlib’s stackplot fuction, which stacks the contribution of each parameter to the sum (in this case the maximum of the resulting range)

I’ll go through what the code does in more detail:

First, we take the range boundaries (globalmax and globalmin) which give us the max and min values for each percentile. We then read in the values for each sensitivity index and normalize them to that range (i.e. globalmaxglobalmin for each percentile). The script also adds two more arrays (rows in the pandas dataframe), one representing interaction and one representing the globalmin, upon which we’re going to stack the rest of the values. [Note: This is a bit of a roundabout way of getting the figures how we like them, but it’s essentially creating a pseudo-stack for the globalmin, that we’re plotting in white.] 

The interaction array is only used when normalizing the S1 and R2 values, where we attribute to it the difference between 1 and the sum of the calculated indices (i.e. we’re attributing the rest to interaction between the parameters). We don’t need to do this for the delta method indices (if you run the code the array remains empty), but the reason I had to put it there was to make it simpler to create labels and a single legend later.

The plotting simply creates three subplots and for each one uses stackplot to plot the normalized values and then the edges in black. It is important to note that the colorblocks in each figure do not represent the volume of shortage attributed to each parameter at each percentile, but rather the contribution of each parameter to the change in the metric, namely, the density distribution (Delta Method), and the variance (S1 and R2). The code for this visualization is provided at the bottom of the post.

experiment_sensitivity_curves.png

Fig. 2: Magnitude sensitivity curves using three sensitivity indeces

The first thing that pops out from this figure is the large blob of peach, which represents the irrigation demand multiplier in our experiment. The user of interest here was an irrigation user, which would suggest that their shortages are primarily driven by increases in their own demands and of other irrigation users. This is important, because irrigation demand is an uncertainty for which we could potentially have direct or indirect control over, e.g. through conservation efforts.

Looking at the other factors, performing the analysis in a magnitude-varying manner, allowed us to explore the vulnerabilities of this metric across its different levels. For example, dark blue and dark green represent the mean flow of dry and wet years, respectively. Across the three figures we can see that the contribution of mean wet-year flow is larger in the low-magnitude percentiles (left hand side) and diminishes as we move towards the larger-magnitude percentiles.

Another thing that I thought was interesting to note was the difference between the S1 and the R2 plots. They are both variance-based metrics, with R2 limited to linear effects in this case. In this particular case, the plots are fairly similar which would suggest that a lot of the parameter effects on the output variance are linear. Larger differences between the two would point to non-linearities between changes in parameter values and the output.

The code to produce Fig. 2:

# Percentiles for analysis to loop over
percentiles = np.arange(0,100)
# Estimate upper and lower bounds
globalmax = [np.percentile(np.max(expData_sort[:,:],1),p) for p in percentiles]
globalmin = [np.percentile(np.min(expData_sort[:,:],1),p) for p in percentiles]
delta_values = pd.read_csv('./DELTA_scores.csv')
delta_values.set_index(list(delta_values)[0],inplace=True)
delta_values = delta_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(delta_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(delta_values.columns.values))
delta_values = pd.concat([top_row,delta_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(delta_values[str(percentiles[p])])-delta_values.at['Min',str(percentiles[p])]
if total!=0:
for param in param_names:
value = (globalmax[p]-globalmin[p])*delta_values.at[param,str(percentiles[p])]/total
delta_values.set_value(param,str(percentiles[p]),value)
delta_values = delta_values.round(decimals = 2)
delta_values_to_plot = delta_values.values.tolist()
S1_values = pd.read_csv('./S1_scores.csv')
S1_values.set_index(list(S1_values)[0],inplace=True)
S1_values = S1_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(S1_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(S1_values.columns.values))
S1_values = pd.concat([top_row,S1_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(S1_values[str(percentiles[p])])-S1_values.at['Min',str(percentiles[p])]
if total!=0:
diff = 1-total
S1_values.set_value('Interaction',str(percentiles[p]),diff)
for param in param_names+['Interaction']:
value = (globalmax[p]-globalmin[p])*S1_values.at[param,str(percentiles[p])]
S1_values.set_value(param,str(percentiles[p]),value)
S1_values = S1_values.round(decimals = 2)
S1_values_to_plot = S1_values.values.tolist()
R2_values = pd.read_csv('./R2_scores.csv')
R2_values.set_index(list(R2_values)[0],inplace=True)
R2_values = R2_values.clip(lower=0)
bottom_row = pd.DataFrame(data=np.array([np.zeros(100)]), index= ['Interaction'], columns=list(R2_values.columns.values))
top_row = pd.DataFrame(data=np.array([globalmin]), index= ['Min'], columns=list(R2_values.columns.values))
R2_values = pd.concat([top_row,R2_values.loc[:],bottom_row])
for p in range(len(percentiles)):
total = np.sum(R2_values[str(percentiles[p])])-R2_values.at['Min',str(percentiles[p])]
if total!=0:
diff = 1-total
R2_values.set_value('Interaction',str(percentiles[p]),diff)
for param in param_names+['Interaction']:
value = (globalmax[p]-globalmin[p])*R2_values.at[param,str(percentiles[p])]
R2_values.set_value(param,str(percentiles[p]),value)
R2_values = R2_values.round(decimals = 2)
R2_values_to_plot = R2_values.values.tolist()
color_list = ["white", "#F18670", "#E24D3F", "#CF233E", "#681E33", "#676572", "#F3BE22", "#59DEBA", "#14015C", "#DAF8A3", "#0B7A0A", "#F8FFA2", "#578DC0", "#4E4AD8", "#F77632"]
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(14.5,8))
ax1.stackplot(percentiles, delta_values_to_plot, colors = color_list, labels=parameter_names_long)
l1 = ax1.plot(percentiles, globalmax, color='black', linewidth=2)
l2 = ax1.plot(percentiles, globalmin, color='black', linewidth=2)
ax1.set_title("Delta index")
ax1.set_xlim(0,100)
ax2.stackplot(np.arange(0,100), S1_values_to_plot, colors = color_list, labels=parameter_names_long)
ax2.plot(percentiles, globalmax, color='black', linewidth=2)
ax2.plot(percentiles, globalmin, color='black', linewidth=2)
ax2.set_title("S1")
ax2.set_xlim(0,100)
ax3.stackplot(np.arange(0,100), R2_values_to_plot, colors = color_list, labels=parameter_names_long)
ax3.plot(percentiles, globalmax, color='black', linewidth=2)
ax3.plot(percentiles, globalmin, color='black', linewidth=2)
ax3.set_title("R^2")
ax3.set_xlim(0,100)
handles, labels = ax3.get_legend_handles_labels()
ax1.set_ylabel('Annual shortage (af)', fontsize=12)
ax2.set_xlabel('Shortage magnitude percentile', fontsize=12)
ax1.legend((l1), ('Global ensemble',), fontsize=10, loc='upper left')
fig.legend(handles[1:], labels[1:], fontsize=10, loc='lower center',ncol = 5)
plt.subplots_adjust(bottom=0.2)
fig.savefig('./experiment_sensitivity_curves.png')

References:

[1]: Borgonovo, E. “A New Uncertainty Importance Measure.” Reliability Engineering & System Safety 92, no. 6 (June 1, 2007): 771–84. https://doi.org/10.1016/j.ress.2006.04.015.

[2]: Sobol, I. M. (2001). “Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates.” Mathematics and Computers in Simulation, 55(1-3):271-280, doi:10.1016/S0378-4754(00)00270-6.