MORDM IX: Discovering scenarios of consequence

In the previous blog post, we performed a simple sensitivity analysis using the Delta Moment-Independent Global Sensitivity Analysis method to identify the decision variables that will most likely cause changes in performance should their recommended values change. In this post, we will end out the MORDM series by performing scenario discovery to identify combinations of socioeconomic and/or hydroclimatic scenarios that result in the utilities being unable to meet their satisficing using Gradient Boosted Trees (Boosted Trees).

Revisiting the concept of the satisficing criteria

The satisficing criteria method is one of three approaches for defining robust strategies as outlined by Lempert and Collins (2007). It defines a robust strategy as a portfolio or set of actions that, according to a set of predetermined criteria outlined by the decision-maker(s), perform reasonably well across a wide range of possible scenarios. It does not have to be optimal (Simon, 1959). In the context of the Research Triangle, this approach is quantified using the domain criterion method (Starr, 1962) where robustness is calculated as the fraction of states of the world (SOWs) in which a portfolio meets or exceeds the criteria (Herman et al, 2014). This approach was selected for two reasons: the nature of the visualization used to communicate alternatives and uncertainty for the test case, and the elimination of the need to know the probability distributions of the uncertain SOWs.

The remaining two approaches are the regret-based approach (choosing a solution that minimizes performance degradation relative to degree of uncertainty in SOWs) and the open options approach (choosing a portfolio that keeps as many options open as possible).

For the Research Triangle, the satisficing criteria are as follows:

  1. Supply reliability (Reliability) should be at least 98% in all SOWs.
  2. The frequency of water-use restrictions (Restriction frequency) should be no more than 10% across all SOWS.
  3. The worst-case cost of drought mitigation actions (Worst-case cost) should be no more than 10% across all SOWs.

This brings us to the following questions: under what conditions do the portfolios fail to meet these satisficing criteria?

Gradient Boosted Trees for scenario discovery

One drawback of using ROF metrics to define the portfolio action rules within the system is the introduction of SOW combinations that cause failure (failure regions) that are non-linear and non-convex (Trindade et al, 2019). It thus becomes necessary to use an model-free, unbiased approach that can classify non-linear success-failure regions while remaining easy to interpret.

Cue Gradient Boosted Trees, or Boosted Trees. This is a tree-based, machine-learn method that uses “rough and moderately inaccurate rules or thumb” to generate “a single, highly-accurate prediction rule”. This aggregation of rough rules of thumb is referred to as a ‘weak learning model’ (Freund and Shapire, 1999), and the final prediction rule is the ‘strong learning model’ that is the result of the boosting process. The transformation from a weak to strong model is conducted via a process of creating weak classifier algorithms that are only slightly better than random guessing and forcing them to continue classifying the scenarios that they previously classified incorrectly. These algorithms are iteratively updated to improve their ability to classify regions of success or failure, turning them into strong learning models.

In this post, we will implement Boosted Trees using the GradientBoostingClassifier function from the SKLearn Python library. Before beginning, we should first install the SKLearn library. In the command line, type in the following:

pip install sklearn

Done? Great – let’s get to the code!

# -*- coding: utf-8 -*-
Created on Sun June 19 18:29:04 2022

@author: Lillian Lau

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from copy import deepcopy
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

FUNCTION to check whether performance criteria are met
def check_satisficing(objs, objs_col, satisficing_bounds):
    meet_low = objs[:, objs_col] >= satisficing_bounds[0]
    meet_high = objs[:, objs_col] <= satisficing_bounds[1]

    meets_criteria = np.hstack((meet_low, meet_high)).all(axis=1)

    return meets_criteria

Here, we import all required libraries and define a function, check_satisficing, that evaluates the performance of a given objective to see if it meets the criteria set above.

Name all file headers and compSol to be analyzed
obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

    CRE: Cary restriction efficiency
    DRE: Durham restriction efficiency
    RRE: Raleigh restriction efficiency
    DMP: Demand multiplier
    BTM: Bond term multiplier
    BIM: Bond interest rate multiplier
    IIM: Infrastructure interest rate multiplier
    EMP: Evaporation rate multplier
    STM: Streamflow amplitude multiplier
    SFM: Streamflow frequency multiplier
    SPM: Streamflow phase multiplier

rdm_headers_dmp = ['CRE', 'DRE', 'RRE']
rdm_headers_utilities = ['DMP', 'BTM', 'BIM', 'IIM']
rdm_headers_inflows = ['STM', 'SFM', 'SPM']
rdm_headers_ws = ['EMP', 'CRR PTD', 'CRR CTD', 'LM PTD', 'LM CTD', 'AL PTD',
                  'AL CTD', 'D PTD', 'D CTD', 'NRR PTDD', 'NRR CTD', 'SCR PTD',
                  'SCT CTD', 'GC PTD', 'GC CTD', 'CRR_L PT', 'CRR_L CT',
                  'CRR_H PT', 'CRR_H CT', 'WR1 PT', 'WR1 CT', 'WR2 PT',
                  'WR2 CT', 'DR PT', 'DR CT', 'FR PT', 'FR CT']

rdm_headers_ws_drop = ['CRR PTD', 'CRR CTD', 'LM PTD', 'LM CTD', 'AL PTD',
                       'AL CTD', 'D PTD', 'D CTD', 'NRR PTDD', 'NRR CTD',
                       'SCR PTD', 'SCT CTD', 'GC PTD', 'GC CTD']

rdm_all_headers = ['CRE', 'DRE', 'RRE', 'DMP', 'BTM', 'BIM', 'IIM',
                   'STM', 'SFM', 'SPM', 'EMP', 'CRR_L PT', 'CRR_L CT',
                   'CRR_H PT', 'CRR_H CT', 'WR1 PT', 'WR1 CT', 'WR2 PT',
                   'WR2 CT', 'DR PT', 'DR CT', 'FR PT', 'FR CT']

all_headers = rdm_all_headers

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

N_RDMS = 100
N_SOLNS = 69

1 - Load DU factor files
rdm_factors_directory = 'YourFilePath/WaterPaths/TestFiles/'
rdm_dmp_filename = rdm_factors_directory + 'rdm_dmp_test_problem_reeval.csv'
rdm_utilities_filename = rdm_factors_directory + 'rdm_utilities_test_problem_reeval.csv'
rdm_inflows_filename = rdm_factors_directory + 'rdm_inflows_test_problem_reeval.csv'
rdm_watersources_filename = rdm_factors_directory + 'rdm_water_sources_test_problem_reeval.csv'

rdm_dmp = pd.read_csv(rdm_dmp_filename, sep=",", names=rdm_headers_dmp)
rdm_utilities = pd.read_csv(rdm_utilities_filename, sep=",", names=rdm_headers_utilities)
rdm_inflows = pd.read_csv(rdm_inflows_filename, sep=",", names=rdm_headers_inflows)
rdm_ws_full = np.loadtxt(rdm_watersources_filename, delimiter=",")
rdm_ws = pd.DataFrame(rdm_ws_full[:, :len(rdm_headers_ws)], columns=rdm_headers_ws)

rdm_ws = rdm_ws.drop(rdm_headers_ws_drop, axis=1)

dufs = pd.concat([rdm_dmp, rdm_utilities, rdm_inflows, rdm_ws], axis=1, ignore_index=True)
dufs.columns = rdm_all_headers
dufs_np = dufs.to_numpy()

duf_numpy = dufs_np[:100, :]

all_params = duf_numpy

Next, we define shorthand names for the performance objectives and the DU SOWs (here referred to as ‘robust decision-making parameters’, or RDMs. One SOW is defined by a vector that contains one sampled parameter from deeply uncertain demand, policy implementation and hydroclimatic realizations. But how are these SOWs generated?

The RDM files shown in the code block above actually lists multipliers for the baseline input parameters. These multipliers scale the input up or down, widening the envelope of uncertainty and enabling the generate of more challenging scenarios as shown in the figure above. Next, remember how we used 1000 different hydroclimatic realizations to perform optimization with Borg? We will use the same 1000 realizations and pair each of them with one set of these newly-generated, more extreme scenarios as visualized in the image below. This will result in a total of (1000 x 100) different SOWs.

Each portfolio discovered during the optimization step is then evaluated over these SOWs (shown above) to identify if they meet or violate the satisficing criteria. The procedure to assess the ability of a portfolio to meet the criteria is performed below:

Load objectives and robustness files
out_directory = '/scratch/lbl59/blog/WaterPaths/post_processing/'

# objective values across all RDMs
objs_filename = out_directory + 'meanObjs_acrossSoln.csv'
objs_arr = np.loadtxt(objs_filename, delimiter=",")
objs_df = pd.DataFrame(objs_arr[:100, :], columns=obj_names)

Determine the type of factor maps to plot.
factor_names = all_headers
Extract each utility's set of performance objectives and robustness
Cary_all = objs_df[['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C']].to_numpy()
Durham_all = objs_df[['REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D']].to_numpy()
Raleigh_all = objs_df[['REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R']].to_numpy()
Regional_all = objs_df[['REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']].to_numpy()

# Cary satisficing criteria
rel_C = check_satisficing(Cary_all, [0], [0.98, 1.0])
rf_C = check_satisficing(Cary_all, [1], [0.0, 0.1])
wcc_C = check_satisficing(Cary_all, [4], [0.0, 0.1])
satisficing_C = rel_C*rf_C*wcc_C
print('satisficing_C: ', satisficing_C.sum())

# Durham satisficing criteria
rel_D = check_satisficing(Durham_all, [0], [0.98, 1.0])
rf_D = check_satisficing(Durham_all, [1], [0.0, 0.1])
wcc_D = check_satisficing(Durham_all, [4], [0.0, 0.1])
satisficing_D = rel_D*rf_D*wcc_D
print('satisficing_D: ', satisficing_D.sum())

# Raleigh satisficing criteria
rel_R = check_satisficing(Raleigh_all, [0], [0.98,1.0])
rf_R = check_satisficing(Raleigh_all, [1], [0.0,0.1])
wcc_R = check_satisficing(Raleigh_all, [4], [0.0,0.1])
satisficing_R = rel_R*rf_R*wcc_R
print('satisficing_R: ', satisficing_R.sum())

# Regional satisficing criteria
rel_reg = check_satisficing(Regional_all, [0], [0.98, 1.0])
rf_reg = check_satisficing(Regional_all, [1], [0.0, 0.1])
wcc_reg = check_satisficing(Regional_all, [4], [0.0, 0.1])
satisficing_reg = rel_reg*rf_reg*wcc_reg
print('satisficing_reg: ', satisficing_reg.sum())

satisficing_dict = {'satisficing_C': satisficing_C, 'satisficing_D': satisficing_D,
                    'satisficing_R': satisficing_R, 'satisficing_reg': satisficing_reg}

utils = ['satisficing_C', 'satisficing_D', 'satisficing_R', 'satisficing_reg']

Following this, we apply the GradientBoostingClassifier to extract the parameters that most influence the ability of a portfolio to meet the satisficing criteria:

Fit a boosted tree classifier and extract important features
for j in range(len(utils)):
    gbc = GradientBoostingClassifier(n_estimators=200,

    # fit the classifier to each utility's sd
    # change depending on utility being analyzed!!
    criteria_analyzed = utils[j]
    df_to_fit = satisficing_dict[criteria_analyzed], df_to_fit)

    feature_ranking = deepcopy(gbc.feature_importances_)
    feature_ranking_sorted_idx = np.argsort(feature_ranking)
    feature_names_sorted = [factor_names[i] for i in feature_ranking_sorted_idx]

    feature1_idx = len(feature_names_sorted) - 1
    feature2_idx = len(feature_names_sorted) - 2

    feature_figpath = 'YourFilePath/WaterPaths/post_processing/BT_Figures/'

    feature_figname = feature_figpath + 'BT_' + criteria_analyzed + '.jpg'

    fig = plt.figure(figsize=(8, 8))
    ax = fig.gca()
    ax.barh(np.arange(len(feature_ranking)), feature_ranking[feature_ranking_sorted_idx])
    ax.set_xlim([0, 1])
    ax.set_xlabel('Feature ranking')

    7 - Create factor maps
    # select top 2 factors
    fm_figpath = 'YourFilePath/WaterPaths/post_processing/BT_Figures/'
    fm_figname = fm_figpath + 'factor_map_' + criteria_analyzed + '.jpg'

    selected_factors = all_params[:, [feature_ranking_sorted_idx[feature1_idx],
    gbc_2_factors = GradientBoostingClassifier(n_estimators=200,

    # change this one depending on utility and compSol being analyzed, df_to_fit)

    x_data = selected_factors[:, 0]
    y_data = selected_factors[:, 1]

    x_min, x_max = (x_data.min(), x_data.max())
    y_min, y_max = (y_data.min(), y_data.max())

    xx, yy = np.meshgrid(np.arange(x_min, x_max*1.001, (x_max-x_min)/100),
                         np.arange(y_min, y_max*1.001, (y_max-y_min)/100))

    dummy_points = list(zip(xx.ravel(), yy.ravel()))

    z = gbc_2_factors.predict_proba(dummy_points)[:,1]
    z[z<0] = 0
    z = z.reshape(xx.shape)

    fig_factormap = plt.figure(figsize = (10,8))
    ax_f = fig_factormap.gca()
    ax_f.contourf(xx, yy, z, [0, 0.5, 1], cmap='RdBu', alpha=0.6, vmin=0, vmax=1)

    ax_f.scatter(selected_factors[:,0], selected_factors[:,1],
                 c=df_to_fit, cmap='Reds_r', edgecolor='grey',
                 alpha=0.6, s=100, linewidths=0.5)

    ax_f.set_xlabel(factor_names[feature_ranking_sorted_idx[feature1_idx]], size=16)
    ax_f.set_ylabel(factor_names[feature_ranking_sorted_idx[feature2_idx]], size=16)
    ax_f.legend(loc='upper center', bbox_to_anchor=(0.5, -0.08),
                fancybox=True, shadow=True, ncol=3, markerscale=0.5)


The code will result in the following four factor maps:

These factor maps all agree on one thing: the demand growth rate (evaluated via the demand growth multiplier, DMP) is the DU factor that the system is most vulnerable to. This means three things:

  1. Each utility’s success is contingent upon different degrees of demand growth. Cary and Durham, being the smaller utilities with fewer resources, are the most susceptible to changes in demand beyond baseline projections. Raleigh can accommodate a 20% high demand growth rate than initially planned for, likely due to a larger reservoir capacity.
  2. The utilities should be wary of their demand growth projections. Planning and operating their water supply system solely based on the baseline projection of demand growth could be catastrophic both for individual utilities and the region as a whole. This is because as a small deviation from projected demand or the failure to account for uncertainty in planning for demand growth could result in the utility permanently failing to meet its satisficing criteria

MORDM Series Summary

We have come very far since the beginning of our MORDM tutorial using the Research Triangle test case. Here’s a brief recap:

Designing and generating states of the world

  1. We first generated a series of synthetic streamflows using the Kirsch Method to enable the simulation of multiple scenarios to generate more severe, deeply uncertain hydroclimatic events within the region.
  2. Next, we generated a set of risk-of-failure (ROF) tables to approximate the risk that a utility’s ability to meet weekly demand would be compromised and visualized the effects of different ROF thresholds on system storage dynamics.

Searching for alternatives

  1. Using the resulting ROF tables, we performed simulation-optimization to search for a Pareto-approximate set of candidate actions for each of the Triangle’s utilities WaterPaths and the Borg MOEA.
  2. The individual and regional performance of these Pareto-approximate actions were then visualized and explored using parallel axis plots.

Defining robustness measures and identifying controls

  1. We defined robustness using the satisficing criteria and calculated the robustness of these set of actions (portfolios) against the effects of challenging states of the world (SOWs) to discover how they fail.
  2. We then performed a simple sensitivity analysis across the average performance of all sixty-nine portfolios of actions, and identified that each utility’s use of water use restrictions, investment in new supply infrastructure, coupled with uncertain demand growth rates most strongly affect performance.
  3. Finally, we used Boosted Trees to discover consequential states of the world that most affect a portfolio’s ability to succeed in meeting all criteria, or to fail catastrophically.

This brings us to the end of our MORDM series! As usual, all files used in this series can be found in the Git Repository here. Hope you found this useful, and happy coding!


Freund, Y., Schapire, R., Abe, N., 1999. A short introduction to boosting. J.-Jpn. Soc. Artif. Intell. 14 (771–780), 1612

Herman, J., Zeff, H., Reed, P., & Characklis, G. (2014). Beyond optimality: Multistakeholder robustness tradeoffs for regional water portfolio planning under deep uncertainty. Water Resources Research, 50(10), 7692-7713. doi: 10.1002/2014wr015338

Lempert, R., & Collins, M. (2007). Managing the Risk of Uncertain Threshold Responses: Comparison of Robust, Optimum, and Precautionary Approaches. Risk Analysis, 27(4), 1009-1026. doi: 10.1111/j.1539-6924.2007.00940.x

Simon, H. A. (1959). Theories of Decision-Making in Economics and Behavioral Science. The American Economic Review, 49(3), 253–283.

Starr, M. (1963). Product design and decision theory. Journal Of The Franklin Institute, 276(1), 79. doi: 10.1016/0016-0032(63)90315-6

Trindade, B., Reed, P., & Characklis, G. (2019). Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management. Advances In Water Resources, 134, 103442. doi: 10.1016/j.advwatres.2019.103442

A step-by-step tutorial for scenario discovery with gradient boosted trees

Our recently published eBook, Addressing Uncertainty in Multisector Dynamics Research, provides several interactive tutorials for hands on training in model diagnostics and uncertainty characterization. The purpose of this post is to expand upon these trainings by providing a tutorial demonstrating gradient boosted trees for scenario discovery. I’ll first provide some brief background on scenario discovery and gradient boosted trees, then demonstrate a Python implementation on a water supply planning problem. All code here is written in Python, but the workflow is model agnostic, and can be paired with simulation models in any language. I’ve included my code within the text below, but all code and data for this post can also be found in this git repository.

Scenario discovery gradient boosted trees

In water resources planning and management, decision makers are often faced with uncertainty about how their system will change in the future. Traditionally, planners have confronted this uncertainty by developing prespecified narrative scenarios, which reduce the multitude of possible future conditions into a small subset of important future states of the world (a prominent example is the ‘scenario matrix framework’ used to evaluate climate change (O’Neill et al., 2014)). While this approach provides intuitive appeal, it may increase system vulnerability if future conditions do not evolve as decision makers expect (for a detailed critique of scenario based planning see Reed et al., 2022). This vulnerability is especially apparent for systems facing deep uncertainty, where decision makers do not know or cannot agree upon the probability density functions of key system inputs (Kwakkel et al., 2016).

Scenario discovery (Groves and Lempert, 2007) is an exploratory modeling centered approach that seeks to discover consequential future scenarios using computational experiments rather than relying on prespecified information. To perform scenario discovery, decision makers first identify a set of relevant uncertainties and their plausible ranges. Next, an ensemble of these uncertainties is developed by sampling across parameter ranges. Candidate policies are then evaluated across this ensemble and machine learning or data mining algorithms are used to examine which combinations of uncertainties generate vulnerability in the system. These combinations can then be used to develop narrative scenarios to inform implementation and monitoring efforts or new policy development.

A core element of the scenario discovery process is the algorithm used to classify future states of the world. Popular algorithms include the PRIM, CART and logistic regression. Recently, gradient boosted trees have been applied as an alternative classificiation algorithm. Gradient boosted trees have advantages over other scenario discovery algorithms because they can easily capture nonlinear and non-differentiable boundaries in the uncertainty space (which are particularly prevalent in water supply planning problems that have discrete capacity expansion options), are highly resistant to overfitting and provide a clear means of ranking the importance of uncertain factors (Trindade et al., 2020). For a comprehensive overview of gradient boosted trees, see Bernardo’s post here.

Test case: the Sedento Valley

To demonstrate gradient boosted trees for scenario discovery we’ll use the Sedento Valley water supply planning test case (Trindade et al., 2020). In the Sedento Valley, three water utilities seek to discover cooperative water supply managment and infrastructure investment portfolios to meet several conflicting objectives in a system facing deep uncertainty. In this post, we’ll investigate how these deep uncertainties (which include demand growth, the efficacy of water use restrictions, financial variables and parameters governing infrastructure permitting and construction time) impact a utility’s ability to maintain three performance criteria: keeping reliability > 98%, restriction frequency < 20% and worst case cost less than 10% of annual revenue. For simplicity, we’ll focus on one regional water utility named Watertown.

Step 1: create a sample of deeply uncertain states of the world

To start the scenario discovery process, we generate an ensemble of deep uncertainties that represent future states of the world (SOWs). Here, we’ll use Latin Hypercube Sampling with an implementation I found in the Surrogate Modeling Toolbox.

import numpy as np
from smt.sampling_methods import LHS

This script will generate 1000 Latin Hypercube Samples (LHS)
of deeply uncertain system parameters for the Sedento Valley

# create an array storing the ranges of deeply uncertain parameters
DU_factor_limits = np.array([
    [0.9, 1.1], # Watertown restriction efficacy 
    [0.9, 1.1], # Dryville restriction efficacy
    [0.9, 1.1], # Fallsland restriction efficacy
    [0.5, 2.0], # Demand growth rate multiplier
    [1.0, 1.2], # Bond term
    [0.6, 1.0], # Bond interest rate
    [0.6, 1.4], # Discount rate
    [0.75, 1.5], # New River Reservoir permitting time
    [1.0, 1.2], # New River Reservoir construction time
    [0.75, 1.5], # College Rock Reservoir (low) permitting time
    [1.0, 1.2], # College Rock Reservoir (low) construction time
    [0.75, 1.5], # College Rock Reservoir (high) permitting time
    [1.0, 1.2], # College Rock Reseroir (high) construction time
    [0.75, 1.5], # Water Reuse permitting time
    [1.0, 1.2], # Water Reuse construction time
    [0.8, 1.2], # Inflow amplitude
    [0.2, 0.5], # Inflow frequency
    [-1.57, 1.57]]) # Inflow phase

# Use the smt package to set up the LHS sampling
sampling = LHS(xlimits=DU_factor_limits)

# We will create 1000 samples
num = 1000

# Create the actual sample
x = sampling(num)

# save to a csv file
np.savetxt('DU_factors.csv', x, delimiter=',')

Step 2: Evaluate performance across SOWs

Next, we’ll evaluate the performance of our policy across the LHS sample of DU factors. For the Sedento Valley test case, we use WaterPaths, an open-source simulation system for integrated water supply portfolio management and infrastructure investment planning (for more see Trindade et al., 2020). This step is not included in the git repository as it requires high-performance computing for this system, but results can be found in the “Model_output.csv” file. For simulation details, see Gold et al., 2022.

Step 3: Convert model output into a boolean array for classification

To perform classification, we need to convert the results of our simulations to a binary array classifying each SOW as a “success” or “failure” based on whether the policy met the performance criteria under the SOW. First, we define a small function to determine if an SOW meets a set of criteria, then we apply this function to our results. We also load the DU factor LHS sample.

# First, define a function to check whether performance criteria are met
def check_criteria(objectives, crit_objs, crit_vals):
    Determines if an objective meets a given set of criteria for a set of SOWs

        objectives: np array of all objectives across a set of SOWs
        crit_objs: the column index of the objective in question
        crit_vals: an array containing [min, max] of the values 
        meets_criteria: an numpy array containing the SOWs that meet both min and max criteria

    # check max and min criteria for each objective
    meet_low = objectives[:, crit_objs] >= crit_vals[0]
    meet_high = objectives[:, crit_objs] <= crit_vals[1]

    # check if max and min criteria are met at the same time
    meets_criteria = np.hstack((meet_low, meet_high)).all(axis=1)

    return meets_criteria

##### Load data and pre-process #####

# load objectives and create input array of boolean values for SD input
Reeval_objectives = np.loadtxt('Model_output.csv', skiprows=1, delimiter=',')
REL = check_criteria(Reeval_objectives, [0], [.979, 1])
RF = check_criteria(Reeval_objectives, [1], [0, 0.10])
WCC = check_criteria(Reeval_objectives, [2], [0, 0.10])
SD_input = np.vstack((REL, RF, WCC)).SD_input(axis=0)

# load DU factors
DU_factors = np.loadtxt('DU_factors.csv', skiprows=1, delimiter=',')
DU_names = ['Watertown Rest. Eff.', 'Dryville Rest. Eff.', 'FSD_inputsland Rest. Eff.',
            'Demand Growth Rate', 'Bond Term', 'Bond Interest',
            'Discount Rate', 'NRR Perm', 'NRR Const', 'CRR L Perm',
            'CRR L Const.',	'CRR H Perm.', 'CRR H Const.', 'WR1 Perm.',
             'WR1 Const.', 'Inflows A', 'Inflows m','Inflows p']

Step 4: Fit a boosted trees classifier

After we’ve formatted the data, we’re ready to perform boosted trees classification. There are several packages for boosted trees in Python, here we’ll use the implementation from scikit-learn. We’ll use an ensemble of 200 trees with depth 3 and a learning rate of 0.1. These parameters need to be tuned for the individual problem, I found this nice post that goes into detail on parameter tuning.

##### Boosted Tree Classification #####

from sklearn.ensemble import GradientBoostingClassifier

# create a gradient boosted classifier object
gbc = GradientBoostingClassifier(n_estimators=200,

# fit the classifier, SD_input)

Step 5: Examine which DU factors have the most impact on performance criteria

Now we’re ready to examine the results of our classification. First, we’ll examine how important each DU factor is to the classification results generated by boosted trees. To rank the imporance of each DU factor, we examine the percentage decrease in impurity of the ensemble of trees that is associated with each factor. In plain english, this is a measure of how helpful each DU factor is to correctly classifying SOWs. This infromation is generated during the fit of the classifier above and is easily accessible as an attribute of our scikit-learn classifier.

For our example, one deep uncertainty, demand growth rate, clearly stands out as the most influential, as shown in the figure below. A second, the restriction efficacy for Watertown (the utility we’re focusing on), also stands out as a higher level of importance. All other DU factors have little impact on the classification in this case.

##### Factor Ranking #####

# Extract the feature importances
feature_importances = deepcopy(gbc.feature_importances_)

# rank the feature importances and plot
importances_sorted_idx = np.argsort(feature_importances)
sorted_names = [DU_names[i] for i in importances_sorted_idx]

fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax.barh(np.arange(len(feature_importances)), feature_importances[importances_sorted_idx])
ax.set_xlabel('Feature Importance')

Step 6: Create factor maps

Finally, we visualize the results of our classification through factor mapping. In the plot below, we show the uncertainty space projected onto the two most influential factors, demand growth rate and restriciton efficacy. Each point represents a sampled SOW, red points represent SOWs that resulted in failure, while white points represent SOWs that resulted in success. The color in the background shows the predicted regions of success and failure from the boosted trees classification.

Here we observe that high levels of demand growth are the primary source of vulnerability for the water utility. When restriction efficacy is lower than our estimate (multiplier < 1), the utility faces failures at demand growth levels of about 1.7 times the estimated values. When restriction effectiveness is at or above estimates, the acceptable scaling of demand growth raises to about 1.8.

Taken as a whole, these results provide valueable insights for decision makers. From our original 18 deep uncertainties, we have discovered that two are critical for the success of our water supply management policy. Further, we have defined thresholds within the uncertainty space that define scenarios that lead to failure. We can use this information to inform monitoring efforts for the water supply policy, or to inform a new problem formulation that tailors actions to mitigate these vulnerabilities.

##### Factor Mapping #####

# Select the top two factors discovered above
selected_factors = DU_factors[:, [3,0]]

# Fit a classifier using only these two factors
gbc_2_factors = GradientBoostingClassifier(n_estimators=200,
                                 max_depth=3), SD_input)

# plot prediction contours
x_data = selected_factors[:,0]
y_data = selected_factors[:,1]

x_min, x_max = (x_data.min(), x_data.max())
y_min, y_max = (y_data.min(), y_data.max())

# create a grid to makes predictions on
xx, yy = np.meshgrid(np.arange(x_min, x_max * 1.001, (x_max - x_min) / 100),
                        np.arange(y_min, y_max * 1.001, (y_max - y_min) / 100))
dummy_points = list(zip(xx.ravel(), yy.ravel()))

z = gbc_2_factors.predict_proba(dummy_points)[:, 1]
z[z < 0] = 0.
z = z.reshape(xx.shape)

# plot the factor map        
fig = plt.figure(figsize=(10,8))
ax = fig.gca()
ax.contourf(xx, yy, z, [0, 0.5, 1.], cmap='RdBu',
                alpha=.6, vmin=0.0, vmax=1)
ax.scatter(selected_factors[:,0], selected_factors[:,1],\
            c=SD_input, cmap='Reds_r', edgecolor='grey', 
            alpha=.6, s= 100, linewidth=.5)
ax.set_xlim([.5, 2])
ax.set_xlabel('Demand Growth Multiplier')
ax.set_ylabel('Restriction Eff. Multiplier')


Gold, D. F., Reed, P. M., Gorelick, D. E., & Characklis, G. W. (2022). Power and Pathways: Exploring Robustness, Cooperative Stability, and Power Relationships in Regional Infrastructure Investment and Water Supply Management Portfolio Pathways. Earth’s Future, 10(2), e2021EF002472.

Groves, D. G., & Lempert, R. J. (2007). A new analytic method for finding policy-relevant scenarios. Global Environmental Change, 17(1), 73-85.

Kwakkel, J. H., Walker, W. E., & Haasnoot, M. (2016). Coping with the wickedness of public policy problems: approaches for decision making under deep uncertainty. Journal of Water Resources Planning and Management, 142(3), 01816001.

O’Neill, B. C., Kriegler, E., Riahi, K., Ebi, K. L., Hallegatte, S., Carter, T. R., … & van Vuuren, D. P. (2014). A new scenario framework for climate change research: the concept of shared socioeconomic pathways. Climatic change, 122(3), 387-400.

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B., & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo.

Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772.

Understanding Information Usage in Machine Learning Models

Deep learning models still largely remain black box concepts, where users have little understanding of how and when input variables are being used for prediction. In some applications, simpler and more interpretable models may be appropriate, but there are instances where deep learning models are simply more accurate. The question is, does one have to sacrifice accuracy for interpretability? Ideally, no. If you truly understand how a model works, this puts you in an optimal position to determine what changes need to be made to make it more accurate. Techniques for better understanding the inner workings of more complex ML models likely will lead to the acceptance of these models in process-driven fields like hydrology. We should aspire to move away from the practice of blindly tuning hyperparameters to get good validation results, and rather to focus more about what these parameters may physically represent (which is certainly not always easy or applicable, otherwise it would be done more often). The goal of this blog post is to outline some concepts within the computer science community that can help users understand information usage in their ML models that can be applied for hydrology applications.

An example of a complex ML model

The current state of the art for recurrent-style neural networks is a Long Short-Term Memory (LSTM) network which has become increasingly popular in hydrology applications. These RNN style networks contain a cell state that has the ability of learn long-term dependencies as well as gates to regulate the flow of information in and out the cell.

A single LSTM cell and its components (Colah, 2020)

The top horizontal line denotes the cell state, Ct ,which is continually being updated, either removing irrelevant information from subsequent time periods or incorporating new recent information. The first gate of the cell state is denoted as the forget gate, or ft. The forget gate takes the previous hidden state ht and the current input xt and decides what to forget from the previous time step’s cell state Ct-1 through the application of a sigmoid function, which effectively can set values in the cell state to a value between 0 and 1 (i.e. representing completely forget to completely retain). Next the input gate, it ,decides which values to update using a sigmoid function. A tanh layer creates new cell state candidate, C~t ,which are multiplied by the values from the input gate to create an update. The cell state can then be updated first by forgetting the prescribed values at the forget gate, and then adding the scaled values, it* C~t. Finally the output gate is used to describe what will be output to the next LSTM unit, which is represented by ht  at the bottom of the unit. The input xt is run through a sigmoidal function. The cell state is pushed through a tanh function from above and then multiplied with the transformed input to decide what the hidden state will be that is relayed to the next unit. The LSTM layer can be made up of hundreds of these cells and multiple layers can be used. Most importantly, the LSTM allows for the preservation and memory of past inputs and outputs, which can help expedite training in a system with slow changing dynamics. 

Clearly an LSTM is one of the more complex deep learning models. However, it has shown great success in hydrology applications, particularly where it’s important to capture physical processes that occur at different time scales. For example, Kratzert et al. (2018) demonstrate how an LSTM network used for rainfall-runoff across 241 catchments is able to perform comparably to SAC-SMA in a fraction of the computational time. The authors also do a great job of unboxing some of the LSTM behavior, including demonstrating how the cell state captures temperature and snow dynamics. However, very few other studies investigate their ML models to this degree.

Techniques for Interpretability

Model-Agnostic Local Models

Many interpretability techniques have been introduced that utilize simple local methods to explain model predictions for a single sample to better understand the behavior of a more complex model. Local interpretable model-agnostic explanations (LIME) is one such technique introduced by Ribeiro et al. (2016) that utilizes a local surrogate model. LIME generates perturbations of input and output data, weighs the samples according to proximity to the baseline sample, and trains a localized (usually linear) model. The goal is to find a model that will minimize loss (minimize the distance between the prediction and estimation) and also minimize complexity.

Another localized technique utilizes Shapley values. Originating from cooperative game theory, Shapley values assign a payout depending on a player’s contribution to the total payout (Shapley, 1953). The analog to ML is that the Shapley value becomes the marginal contribution of a feature across many coalitions (subsets) of possible feature combinations. Because Shapley values are calculated across many possible coalitions, computation time to compute them can be cumbersome.

Model-Specific Local Models

DeepLift is a backpropogation-based approach that propagates the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the activation of each neuron to its ‘reference activation’ and assigns contribution scores according to the difference. DeepSHAP is a modification of the DeepLift algorithm used to efficiently estimate Shapley values over the input feature space for a given instance. DeepShap has been shown to be faster than model-agnostic methods and can explain series of complex models more than model specific local models (Chen et al., 2021). Rather than passing multipliers comparing neuron activation, DeepShap will backpropagate SHAP values.


The authors of DeepShap have created a GitHub repository here that allows you to implement SHAP for any machine learning model and DeepShap for deep learning models implemented using TensorFlow and Keras. The package has some really beautiful figures that you can create with single lines of code. My goal was to see if I could harvest some tools from DeepShap to help make sense of a rainfall runoff LSTM model for an ephemeral subbasin in the Tuolumne River Basin that I had previously coded.

I utilize the following features to predict outflow: Precipitation, Temperature, DOY (represented as sine/cosine of day in this case), prior precipitation aggregated for the past three days ago and two weeks, and then interaction variables. In a prior blog post, I demonstrate why these aggregate variables and interaction variables help to capture different regimes of flow. The results look great, but can we use DeepShap to think more about the parameters of the LSTM?

LSTM Prediction

One parameter of interest is the lag of information that the LSTM feeds in at every time step. In my LSTM, I decide to feed my inputs in in batches of 30, so at every time step, the LSTM is seeing the past 30 days of inputs. I use the DeepExplainer function to calculate the average Shapley values across a set of 500 time steps of the input data and I plot these features for the past 30 days.

#Deep Shap Implementation (Training)
#Choose random points from the dataset to train the DeepExplainer
random_ind = np.random.choice(X_Train.shape[0], 1000, replace=False)
data = X_Train[random_ind[0:500]]
e = shap.DeepExplainer((model.layers[0].input, model.layers[-1].output),data,K.get_session())

#Deep Shap Testing
#Create a test set out of another part of the training set 
test = X_Train[random_ind[500:1000]]
shap_values = e.shap_values(test)
shap_val = np.array(shap_values)
shap_val = np.reshape(shap_val,(int(shap_val.shape[1]),int(shap_val.shape[2]),int(shap_val.shape[3])))
shap_abs = np.absolute(shap_val)
sum_0 = np.mean(shap_abs,axis=0)

#Plot Shapley Values
shap_plot = pd.DataFrame(sum_0, columns=["Precipitation","Temperature","DOY","Three Day Precip","Two Week Precip","Precip_Times_Temperature","Temperature_Times_Day","Precip_Times_Temperature_Times_Day"])
shap_plot['days'] = [i-16 for i in list(range(1,16))]
shap_plot.plot.area(x='days',figsize=(10, 6), cmap='rainbow')
plt.title("Deep SHAP - Feature Importance")
plt.savefig('SHAP.png', bbox_inches='tight')
Feature importance for a 30-day lag

The figure above shows at what times in the batch certain information is being utilized based on the magnitude of the Shapley value associated with that information. We see primarily that current precipitation is the most prominent driver of the next day’s outflow along with the precipitation*temperature interactive variable which intuitively makes sense.

Let’s look back 5 days though. Many of the interactive variables are still relevant but see that temperature and day of the year are now given a higher priority over precipitation .

Shapley factors for 5 days prior to the current time step

Lags of these variables may give the algorithm more information about seasonality that is more informative than looking at these variables in their current state. Furthermore, precipitation at this lag may not be relevant. There’s lots of additional work to be done, but hopefully this tutorial illustrates that If you have the tools to look into your model and a good understanding of the system, it then becomes possible to better attribute your model’s behavior to a process-based understanding.


Chen, H., Lundberg, S. M., & Lee, S. I. (2021). Explaining a Series of Models by Propagating Shapley Values. arXiv preprint arXiv:2105.00108.

Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018). Rainfall–runoff modelling using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences22(11), 6005-6022.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144).

Shapley, Lloyd S. “A value for n-person games.” Contributions to the Theory of Games 2.28 (1953): 307-317.

I followed tutorials shown in these blogs as well:

CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.


Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]

# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std

# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd

# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual

df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error'), Y_Train, epochs=10)

#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)

#Plot results
plt.figure(figsize=(11, 7))


plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')


Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.


Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018).

Dertat, A. (2017).

Visualizing High Dimensional Data Using the T-SNE Method

In this blog post, I am going to go over a machine learning method that is used to visualize high dimensional datasets. The method is called T-distributed Stochastic Neighbor Embedding, or T-SNE for short. The method was developed by the Google Brain Team (van der Maaten & Hinton, 2008). In this blog post, I describe T-SNE and provide a simple example in R.

Two weeks ago, Dave introduced a few useful dimensionality reduction methods to visualize the Pareto front. Last week, Rohini’s blog post provided some very useful information about self-organizing maps. I guess this blog post will officially make this a series of posts on dimensionality reduction and the visualization of high dimensional datasets. If you haven’t already read these two super informative blog posts, I highly recommend you take a look at them.

T-SNE is an extension of the Local Linear Embedding technique, and it uses an interesting method to maintain the local relationships and similarities that exist in the original high-dimensional datasets when transferring that into a 2-D plane. As van der Maaten and Hinton (2008) argue, other dimensionality reduction techniques (e.g., PCA) usually focus on macro distances in the datasets, which might lead to overlooking short-distance relationships. However, T-SNE is able to capture both local and macro distances in the data structure. Therefore, it can give more reasonable results when you are dealing with non-linear high dimensional datasets.

T-SNE uses the following formula, which basically creates a Gaussian distribution over each data point and computes the density of all other data points in respect to the distribution around that point. The denominator of the equation normalizes the nominator.

In this equation, xi is the reference point, and xj is its neighboring points. Also, Sigma is the bandwidth that returns the same perplexity for each point. Perplexity is a measure of uncertainty that has a direct relationship with entropy. For more information about it, you can read this Wikipedia page. Basically, perplexity is a hyper parameter of T-SNE, and the final outcome might be very sensitive to its value. Also, in this equation, pij is the probability of the similarity of each two points in the high dimensional space. Basically, if two points are close to each other, pij will be high. If they are far from each other, the probability of similarity is low.

To make the low dimensional map consistent with its original high dimensional dataset, T-SNE also calculates a similar point-to-point probability of similarity in the 2D map.

Basically, T-SNE moves the point on the 2D plain to find the locations that minimize the Kullback-Leibler divergence metric between the Pi distributions of the original data and the Qi distributions of the 2D plane.

As you probably noticed, in the low dimensional space, T-SNE uses the Student T-distribution curve (with one degree of freedom). The reason is that Student T (compared to Gaussian) is a fat-tailed distribution. This allows the T-SNE to separate dissimilar points with higher margins, which makes the map easier to interpret.

T-SNE Example in R

Here, I provide a simple R code example that demonstrates how you can use T-SNE to visualize high dimensional datasets. This example only has four dimensions, so it doesn’t really represent a high dimensional problem. As such, keep in mind that T-SNE shows its true capabilities when there are tens or even hundreds of dimensions.

First, you need to install “tsne” library and load it.


# We also need to load ggplot2 and ggpubr libraries


In this example, I use R’s famous “iris” data set. Here is how you can activate it.

# Load the iris dataset

Now you can use the following scrip to visualize the data using T-SNE.

# Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=30, max_iter=1000, check_duplicates = FALSE) # Run TSNE

# Create a data frame using IRIS values and the T-SNE values

# Specify column names
names(df_to_gg)<-c("Species", "Y.1", "Y.2")

# Show the objects in the 2D T-SNE representation
ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
  labs(title=paste("T-SNE Visualization- Perplexity Number = 30" )) +
  scale_color_manual(values = c("darkblue", "orange", "pink3"))

Here is the 2-D map that our code generates

Although T-SNE is a powerful method and has become quite popular in recent years, it might have some pitfalls. This blog post discusses some of these issues, including the sensitivity of the final map to perplexity hyperparameters or the fact that it can be tricky to interpret T-SNE maps because they might not represent the actual relationships between different clusters.

To explore the impacts of the perplexity hyperparameter on the final clusters, I use the following code to create T-SNE maps for different perplexity values.

# Here are the perplexity values that I am taking into account in my T-SNE analysis
perplexity_number_values<-c(2,5,10, 15, 30, 40)

# Initializing a plot list object
plot_list = list()

# This loop explores the effect of perplexity values on the T-SNE results
for (i_prp in 1:6){
  # I am setting a seed to makes sure that my results for different perplexities are not sensitive to any random factors
  # Perplexity number
  perplexity_number<- perplexity_number_values[i_prp]
  # Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
  y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=perplexity_number, max_iter=1000, check_duplicates = FALSE) # Run TSNE
  # Create a data frame using IRIS values and the T-SNE values
  # Specify Column names
  names(df_to_gg)<-c("Species", "Y.1", "Y.2")
  # Show the objects in the 2D T-SNE representation
  plt<-ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
    labs(title=paste("T-SNE Visualization- Perplexity Number = ", perplexity_number )) +
    scale_color_manual(values = c("darkblue", "orange", "pink3"))
  # Save figure in the plot list object
# Combine all the plot objects
plt_combined<-ggarrange(plotlist = plot_list[1:6], ncol = 2, nrow = 3, common.legend = T)

Here is the final result that underlines the high sensitivity of the location of our clusters on 2-D map to the choice of the perplexity values. There are other hyperparameters that can be tried as well, for example, the maximum number of iterations might change the final output in some cases.

There are many other very good tutorials on T-SNE, such as here and here. Also, for example, this tutorial provides a nice and easy-to-follow Python code for T-SNE. Finally, there are some great YouTube videos that clearly explain the logic behind T-SNE and its algorithm (for example, here and here).

Deeper Dive into Self-Organizing Maps (SOMs)

This blog post focuses on Self-Organizing Maps (SOMs) or Kohonen maps, first created by Teuvo Kohonen in the 1980s, and most recently introduced in Dave Gold’s latest blog post. SOMs are a type of artificial neural network (ANN) that are trained using unsupervised learning to produce a two-dimensional pattern representation of higher-dimensional datasets. SOMs are therefore a good algorithm for dimension reduction, and inputs that are closer in their higher-dimensional input vectors are also mapped closely together in the resulting 2D representation [1].

Self Organizing Maps. Recently, I learned about SOMs while… | by ...
User-defined neuron grid that represents the basis of SOM  [1]

SOMs can be very effective for clustering/classifying outputs and are advantageous because they “self-organize” to identify clusters. The SOM map is first initialized with a user-defined 2D grid of neurons instead of containing layers like traditional ANNs. The weight vector that characterizes the neurons can be randomly initialized with value or more favorably, by sampling across the eigenvectors of the two largest PCs of the input dataset. The training process that SOMs use to build maps is called “competitive learning”, which is also fundamentally different from backpropagation used for ANNs. During competitive learning, an input training point is fed into the algorithm. Its Euclidean distance from each of the neurons is calculated, and the closest neuron is termed the best matching unit, or BMU. The weight of the neuron and surrounding neurons is adjusted to be closer to the related input vector, using an update formula. This process is repeated for every input in the training data and for a large number of iterations until patterns and clusters begin to emerge [3]. The animation on the right illustrates competitive learning. Note how the locations of neurons are adjusted as training points get assigned to their respective BMUs and how whole grid shifts in response to the new samples.

Competitive learning [2]

At a high level, this is how SOMs are trained. Next, we will go through a clustering example using the kohonen package offered in R.

Test Case: Maryland Trout

For this text case, we will be using a presence/absence dataset of Maryland trout. Each data point is characterized by a feature vector that includes information on agricultural land cover, logarithmic distance to the nearest road, riffle quality, dissolved oxygen, and spring water temperature. The label associated with each point is a 0 or 1, denoting if trout was observed or not. We will train a SOM to cluster the feature vector and then investigate the characteristics of the resulting clusters. Below is a snippet of our dataset. We use columns 3-7 as our feature vector while column 2 denotes our label. SOMs can also be used for classification, which is not discussed in this blog post. Therefore, you can train the SOM only on a subset of the dataset and used the trained SOM to predict the clusters that the testing points will be assigned to.


First we must scale our data in order to make sure that variables in our feature vector that may have different magnitudes are treated equally by the algorithm. Then we define the SOM grid. The choice of the number of neurons generally follows the rule: 5*sqrt(# of datapoints). In our case, we have 84 datapoints, so technically a grid that is smaller than 7×7 will work well. I decide to use 5×5, which will be explained in the next paragraph. I also specify that I want the grid to be in a hexagonal shape.  Finally, we train our model by feeding in our input data, the user-defined grid, and specifying an iteration rate of 1000. This means that the input data will be presented 1000 times to the algorithm.

#Create the training dataset
data_train &amp;amp;lt;-data[, c(3:7)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix &amp;amp;lt;- as.matrix(scale(data_train))
# Create the SOM Grid
som_grid &amp;amp;lt;- somgrid(xdim = 5, ydim=5, topo="hexagonal")
# Finally, train the SOM
som_model &amp;amp;lt;- som(data_train_matrix,
                 grid=som_grid, rlen = 1000,
        = TRUE )

Once the SOM is trained, there are a variety of figures that can be easily created. We can plot the training progress and observe how the distance to the BMU decreases as a function of iteration and starts to level off as we reach many iterations. This is ideal behavior because it means that the training points are associating closely with their BMU.

plot(som_model, type="changes")
Average distance from a test point to its BMU

We can also plot a heat map of the node counts, which shows how many of each of the training inputs are associated with each node. Any node that is grey means that no training points were assigned to that BMU. This may indicate that the map is too big, so you can go back and adjust the grid size adjust accordingly until the grey points are no longer existent. For this dataset, this was around 5×5. Furthermore, we want the number of points associated with each node to be as uniform as possible.

plot(som_model, type="count", main="Node Counts")

Next we can view the neighbor distances for each neuron, which is also called the U-matrix. Small distances indicate similarity between neighboring nodes.

plot(som_model, type="dist.neighbours", main = "SOM neighbor distances")

We can also view the weight vector of the nodes, which are termed codes. The fan in each node indicates the variables of prominence that are linking the datapoints assigned to the neuron. In our dataset, you can see that in the upper left corner of the map, the algorithm is assigning datapoints to these neurons because of the similarity among the agricultural area and water temperature variables in the feature vector. In the lower right corner, these datapoints are linked primarily due to similarities in riffle quality, and dissolved oxygen. This type of map is useful to understand by what features the points are actually starting to cluster.

plot(som_model, type="codes")

In my opinion, one of the most useful standard figures that the kohonen library creates are heatmaps that show the gradient of input values across the nodes. A heatmap can be created for each feature and when placed side by side, they can help demonstrate correlation between inputs. We can clearly see that areas characterized by low agricultural land tend to have a better riffle quality, lower water temperature, and a higher dissolved oxygen. While these patterns are expected and have a strong physical explanation, these maps can also help to discover other unexpected patterns. Note that the values on the side bar are normalized and can be transformed back to the original scale.


#Plot heatmaps for each feature
plot(som_model, type = "property", property = getCodes(som_model)[,1], main=colnames(getCodes(som_model))[1],
plot(som_model, type = "property", property = getCodes(som_model)[,2], main=colnames(getCodes(som_model))[2],
plot(som_model, type = "property", property = getCodes(som_model)[,3], main=colnames(getCodes(som_model))[3],
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4],
plot(som_model, type = "property", property = getCodes(som_model)[,5], main=colnames(getCodes(som_model))[5],
#Unscale the variables- do this for each input column in data_train
var_unscaled &amp;amp;lt;- aggregate(as.numeric(data_train[,1]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled, main=colnames(getCodes(som_model))[1],

Finally we can cluster our neuron codes weight vectors) to add some boundaries to our map. Typically what is done is to use k-means clustering to determine an appropriate number of clusters that, in this case, minimize the within-cluster sum of squares (WCSS). We can then plot the results and look for an elbow in the scree plot.


#k-means clustering code from [4]
mydata = som_model$codes[[1]]
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)") #Define a palette and cluster with the number of clusters determined using k-means

The elbow isn’t so clear in this dataset, but I go with 8 clusters and then use hierarchical clustering to assign each neuron to a cluster. Finally, we can obtain the cluster that each point is assigned to and insert that back as a column in our dataset.

#Final clustering code from [4]
#Define a palette and cluster with the number of clusters determined using k-means
pretty_palette &amp;amp;lt;- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
som_cluster &amp;amp;lt;- cutree(hclust(dist(som_model$codes[[1]])), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(som_model, som_cluster)

# get vector with cluster value for each original data sample
cluster_assignment &amp;amp;lt;- som_cluster[som_model$unit.classif]
# for each of analysis, add the assignment as a column in the original data:
data$cluster &amp;amp;lt;- cluster_assignment



We’re now interested in understanding which of our points actually clustered together, so we generate summary statistics for each cluster. The points highlighted in blue represent the best values for the feature while the points in red represent the worst. Note that the clusters with a large amount of trout presence on average are characterized by best values of features that are favorable for aquatic life, while the clusters that exhibit a lower presence of trout do not have good values for these features. You can also see that clusters with similar characteristics are side by side on the clustering map.


Overall, I found the kohonen library to be a really useful tool to start to internalize how SOMs work and to fairly quickly be able to do a nice analysis on both the features and the resulting clusters. I would highly recommend this package as a starting point. There are packages in Python such as SimpSom and minisom that have snappier graphics and will be a good next step to explore.



[2] By Chompinha – Own work, CC BY-SA 4.0,


[4] Much direction for structuring this tutorial comes from

Getting started with API requests in Python

This is an introductory blogpost on how to work with Application Programming Interfaces (APIs) in Python. Seeing as this is our blog’s first post on this topic I’ll spend some time explaining some basic information I’ve had to learn in the process of doing this and why it might be useful in your research. There are several blogs and websites with tutorials online and there’s no point repeating, but I’ll try to explain this for an audience like me, i.e., (a) has no formal training in computer science/web servers/HTTP but competent with scripting, and (b) interested in using this for research purposes.

What is an API?

Many sites (e.g., Facebook, Twitter, many many more) make their data available through what’s called Application Programming Interfaces or APIs. APIs basically allow software to interact, and send and receive data from servers so as to typically provide additional services to businesses, mobile apps and the like, or allow for additional analysis of the data for research or commercial purposes (e.g., collecting all trending Twitter hashtags per location to analyze how news and information propagates).

I am a civil/environmental engineer, what can APIs do for me?

APIs are particularly useful for data that changes often or that involves repeated computation (e.g., daily precipitation measurements used to forecast lake levels). There’s a lot of this kind of data relevant for water resources systems analysts, easily accessible through APIs:

I’m interested, how do I do it?

I’ll demonstrate some basic information scripts using Python and the Requests library in the section below. There are many different API requests one can perform, but the most common one is GET, which is a request to retrieve data (not to modify it in any way). To retrieve data from an API you need two things: a base URL and an endpoint. The base URL is basically the static address to the API you’re interested in and an endpoint is appended to the end of it as the server route used to collect a specific set of data from the API. Collecting different kinds of data from an API is basically a process of manipulating that endpoint to retrieve exactly what is needed for a specific computation. I’ll demo this using the USGS’s API, but be aware that it varies for each one, so you’d need to figure out how to construct your URLs for the specific API you’re trying to access. They most often come with a documentation page and example URL generators that help you figure out how to construct them.

The base URL for the USGS API is

I am, for instance, interested in collecting streamflow data from a gage near my house for last May. In Python I would set up the specific URL for these data like so:

response = requests.get("¶meterCd=00060&siteType=ST&siteStatus=all")

For some interpretation of how this was constructed, every parameter option is separated by &, and they can be read like so: data in json format; indented so I can read them more easily; site number is the USGS gage number; start and end dates; a USGS parameter code to indicate streamflow; site type ‘stream’; any status (active or inactive). This is obviously the format that works for this API, for a different API you’d have to figure out how to structure these arguments for the data you need. If you paste the URL in your browser you’ll see the same data I’m using Python to retrieve.

Object response now contains several attributes, the most useful of which are response.content and response.status_code. content contains the content of the URL, i.e. your data and other stuff, as a bytes object. status_code contains the status of your request, with regards to whether the server couldn’t find what you asked for or you weren’t authenticated to access that data, etc. You can find what the codes mean here.

To access the data contained in your request (most usually in json format) you can use the json function contained in the library to return a dictionary object:

data = response.json()

The Python json library is also very useful here to help you manipulate the data you retrieve.

How can I use this for multiple datasets with different arguments?

So obviously, the utility of APIs comes when one needs multiple datasets of something. Using a Python script we can iterate through the arguments needed and generate the respective endpoints to retrieve our data. An easier way of doing this without manipulating and appending strings is to set up a dictionary with the parameter values and pass that to the get function:

parameters = {"format": 'json', "indent": 'on',
              "sites": '04234000', "startDT": '2020-05-01',
              "endDT": '2020-05-31', "parameterCd":'00060',
              "siteType": 'ST', "siteStatus":'all'}
response = requests.get("", 

This produces the same data, but we now have an easier way to manipulate the arguments in the dictionary. Using simple loops and lists to iterate through one can retrieve and store multiple different datasets from this API.

Other things to be aware of

Multiple other people use these APIs and some of them carry datasets that are very large so querying them takes time. If the server needs to process something before producing it for you, that takes time too. Some APIs limit the number of requests you can make as a result and it is general good practice to try to be as efficient as possible with your requests. This means not requesting large datasets you only need parts of, but being specific to what you need. Depending on the database, this might mean adding several filters to your request URL to be as specific as possible to only the dates and attributes needed, or combining several attributes (e.g., multiple gages in the above example) in a single request.

Using Docker with Virtual Machines

This post is a continuation of my previous post on accessing virtual machines on RedCloud. In this post, you will learn how to run a Docker container on a VM Ubuntu image, but you can also do this tutorial with Ubuntu on a local machine.

Why Containerize Code?

Let’s use an example: Say that you have three Python applications that you want to run on your computer, but they all use different version of Python or its packages. We cannot host these applications at the same time using Python on our computer…so what do we do? Think of another case- you want to share neural network code with a collaborator, but you don’t want to have to deal with the fact that it’s incredibly difficult for them to get TensorFlow and all its dependencies downloaded on their machine. Containerization allows us to isolate the installation and running of our application without having to worry about the setup of the host machine. In other words, everything someone would need to run your code will be provided in the container. Shown in Figure 1, each container shares the host OS’s kernel but has a virtual copy of some of the components of the OS. This allows the containers to be isolated from each other while running on the same host. Therefore, containers are exceptionally light and take only seconds to start [1].


Fig 1. Docker Container Example [1]


Docker started as a project to build single-application LXC (linux) containers that were more portable and flexible, but is now its own container runtime environment. It is written in GO and developed by DotCloud (a PaaS company). A Docker engine is used to build and manage a Docker image, which is just a template that contains the application and all of its dependencies that are required for it to run. A Docker container is a running instance of a Docker image. A Docker engine is composed of three main parts: a server known as the Docker daemon, a rest API, and a client. The docker daemon creates and manages images and containers, the rest API helps to link the server and applications, and the client (user) interacts with the docker daemon through the command line (Figure 2).


Fig.2 Docker Engine [2]


Running a Docker Container on Ubuntu

  1. We will be setting up and running a Docker container that contains code to train a rainfall-runoff LSTM model built in Python using Tensorflow. The Github repo with the LSTM code and data can be downloaded here.
  2. Spin up a VM instance (Ubuntu 18.04) or use your own machine if you have a partition.
  3. I use MobaXterm to SSH into the VM instance and drag the HEC_HMS_LSTM folder into my home directory.
  4. Within the directory, you will see 2 additional files that are not part of the main neural network code: requirements.txt and jupyter.dockerfile. A requirements.txt file contains a list of only the packages that are necessary to run your application. You can make this file with just two lines of code: pip install pipreqs and then pipreqs  /path/to/project. The created file will look like this:


requirements.txt file

The jupyter.dockerfile is our Dockerfile. It is a text file that contains all the commands a user could call on the command line to assemble an image.



Here in our Dockerfile, we are using the latest Ubuntu image and setting our working directory to the HEC_HMS_LSTM folder that has the neural network code,, that we would like to execute. We start by looking for updates and then install Python, pip, and jupyter. Then we need to take our working directory contents and copy them into the container. We copy the requirements.txt file into the container and then add the whole HEC_HMS_LSTM folder into the container. We then use pip to install all of the packages listed in requirements.txt. Finally, we instruct the docker daemon to run the python script,

5. Next we have to download docker onto Ubuntu using the command line. This is the only thing that anyone will have to download to run your container. The steps to download Docker for Ubuntu through the command line are pretty easy using this link. You may be asked to allow automatic restarts during the installation, and you should choose this option. When Docker is done downloading, you can check to see that the installation was successful by seeing if the service is active.


Successful Docker Installation

6. Now that Docker is downloaded, we can build our Docker image and then run the container. We build the image using: 

sudo docker build -f jupyter.dockerfile -t jupyter:latest .

In this command, the -f flag denotes the name of the dockerflile and -t is the name that we would like to tag our image with. If you’re running multiple containers, tagging is helpful to distinguish between the containers. The build will go through each step of the dockerfile. Be cognizant of how much space you have on your disk to store the downloaded Docker, the python libraries and the data you will generate; you may have to go back and resize your VM instance. We can then run our image using:

sudo docker run jupyter:latest

You’ll see the neural network training and the results of the prediction.


Successfully training the neural network in a Docker Container on a virtual machine









Beginner’s Guide to TensorFlow and Keras

This post is meant to provide a very introductory overview of TensorFlow and Keras and concludes with example of a how to use these libraries to implement a very basic neural network.


At core, TensorFlow is a free, open-source symbolic math library that expresses calculations in terms of dataflow graphs that are composed of nodes and tensors. TensorFlow is particularly adept at handling operations required to train neural networks and thus has become a popular choice for machine learning applications1. All nodes and tensors as well as the API are Python-based. However, the actual mathematical operations are carried out efficiently using high-performance C++ binaries2. TensorFlow was originally developed by the Google Brain Team for internal use but since has been released for public use. The library has a reputation of being on the more technical side and less intuitive to new users, so based on user feedback from initial versioning, TensorFlow decided to add Keras to their core library in 2017.


Keras is an open-source neural network Python library that has become a popular API to run on top of TensorFlow and other deep learning frameworks. Keras is useful especially for beginners in machine learning because it offers a user-friendly, intuitive, and quick way to build models. Particularly, users tend to most interested in quickly implementing neural networks. In Keras, neural network models are composed of standalone modules of neural network layers, cost functions, optimizers, activation functions, and regularizers that can be built separately and combined3.

Example: Predicting the age of abalone

In the example below, I will implement a very basic neural network to illustrate the main components of Keras. The problem we are interested in solving is predicting the age of abalone (sea snails) given a variety of physical characteristics. Traditionally, the age of abalone is calculated by cutting the shell, staining it, and counting the number of rings in the shell, so the goal is to see if less time-consuming and intrusive measurements, such as the diameter, length, and gender, can be used to predict the age of the snail. This dataset is one of the most popular regression datatsets from the UC Irvine Machine Learning Repository. The full set of attributes and dataset can be found here.

Step 1: Assessing and Processing Data

The first step to approaching this (and any machine learning) problem is to assess the quality and quantity of information that is provided in the dataset. We first import relevant basic libraries. More specific libraries will be imported above their respective functions for illustrative purposes. Then we import the dataset which has 4177 observations of 9 features and also has no null values.

#Import relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Import dataset
dataset = pd.read_csv('Abalone_Categorical.csv')
#Print number of observations and features
print('This dataset has {} observations with {} features.'.format(dataset.shape[0], dataset.shape[1]))
#Check for null values

Note that the dataset is composed of numerical attributes except for the “Gender” column which is categorical. Neural networks cannot work with categorical data directly, so the gender must be converted to a numerical value. Categorical variables can be assigned a number such as Male=1, Female=2, and Infant=3, but this is a fairly naïve approach that assumes some sort of ordinal nature. This scale will imply to the neural network that a male is more similar to a female than an infant which may not be the case. Instead, we instead use one-hot encoding to represent the variables as binary vectors.

#One hot encoding for categorical gender variable

Gender = dataset.pop('Gender')

dataset['M'] = (Gender == 'M')*1.0
dataset['F'] = (Gender == 'F')*1.0
dataset['I'] = (Gender == 'I')*1.0

Note that we now have 10 features because Male, Female, and Infant are represented as the following:





0 0




0 0


It would be beneficial at this point to look at overall statistics and distributions of each feature and to check for multicollinearity, but further investigation into these topics will not be covered in this post. Also note the difference in the ranges of each feature. It is generally good practice to normalize features that have different units which likely will correspond to different scales for each feature. Very large input variables can lead to large weights which, in turn, can make the network unstable. It is up to the user to decide on the normalization technique that is appropriate for their dataset, with the condition that the output value that is returned also falls in the range of the chosen activation function. Here, we separate the dataset into a “data” portion and a “label” portion and use the MinMaxScalar from Scikit-learn which will, by default, transform the data into the range: (0,1).

#Reorder Columns

dataset = dataset[['Length', 'Diameter ', 'Height', 'Whole Weight', 'Schucked Weight','Viscera Weight ','Shell Weight ','M','F','I','Rings']]

#Separate input data and labels

#Normalize the data using the min-max scalar

from sklearn.preprocessing import MinMaxScaler
scalar= MinMaxScaler()
X= scalar.fit_transform(X)
y= y.reshape(-1,1)

We then split the data into a training set that is composed of 80% of the dataset. The remaining 20% is the testing set.

#Split data into training and testing 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 2: Building and Training the Model

Then we build the Keras structure. The core of this structure is the model, which is of the “Sequential” form. This is the simplest style of model and is composed of a linear stack of layers.

#Build Keras Model

import keras
from keras import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(units=10, input_dim=10,activation='relu'))
model.compile(optimizer='adam', loss='mean_squared_error',  metrics=['mae','mse'])

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10),y_train,batch_size=5, validation_split = 0.2, callbacks=[early_stop], epochs=100)

# Model summary for number of parameters use in the algorithm

Stacking layers is executed with model.add(). We stack a dense layer of 10 nodes that has 10 inputs feeding into the layer. The activation function chosen for this layer is a ReLU (Rectified Linear Unit), a popular choice due to better gradient propagation and sparser activation than a sigmoidal function for example. A final output layer with a linear activation function is stacked to simply return the model output. This network architecture was picked rather arbitrarily, but can be tuned to achieve better performance. The model is compiled using model.compile(). Here, we specify the type of optimizer- in this case the Adam optimizer which has become a popular alternative to the more traditional stochastic gradient descent. A mean-squared-error loss function is specified and the metrics reported will be “mean squared error” and “mean absolute error”. Then we call to iterate training in batch sizes. Batch size corresponds to the number of training instances that are processed before the model is updated. The number of epochs is the number of complete passes through the dataset. The more times that the model can see the dataset, the more chances it has to learn the patterns, but too many epochs can also lead to overfitting. The number of epochs appropriate for this case is unknown, so I can implement a validation set that is 20% of my current training set. I set a large value of 100 epochs and add early stopping criteria that stops training when the validation score stops improving and helps prevent overfitting.

Training and validation error can be plotted as a function of epochs4 .

def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.ylabel('Mean Square Error [$Rings^2$]')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')


This results in the following figure:


Error as function of epochs

Step 3: Prediction and Assessment of Model

Once our model is trained, we can use it to predict the age of the abalone in the test set. Once the values are predicted, then they must be re-scaled back which is performed using the inverse_transform function from Scikit-learn.

#Predict testing labels

y_pred= model.predict(X_test)

#undo normalization 


Then the predicted and actual ages are plotted along with a 1-to-1 line to visualize the performance of the model. An R-squared value and a RMSE are calculated as 0.554 and 2.20 rings respectively.

#visualize performance
fig, ax = plt.subplots()
ax.scatter(y_test_transformed, y_pred_transformed)
ax.plot([y_test_transformed.min(), y_test_transformed.max()], [y_test_transformed.min(), y_test_transformed.max()], 'k--', lw=4)
ax.set_xlabel('Measured (Rings)')
ax.set_ylabel('Predicted (Rings)')

#Calculate RMSE and R^2
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(y_test_transformed, y_pred_transformed))

from sklearn.metrics import r2_score


Predicted vs. Actual (Measured) Ages

The performance is not ideal, but the results are appropriate given that this dataset is notoriously hard to use for prediction without other relevant information such as weather or location. Ultimately, it is up to the user to decide if these are significant/acceptable values, otherwise the neural network hyperparameters can be further fine-tuned or more input data and more features can be added to the dataset to try to improve performance.


[1]Dean, Jeff; Monga, Rajat; et al. (November 9, 2015). “TensorFlow: Large-scale machine learning on heterogeneous systems” (PDF) Google Research

[2] Yegulalp, Serdar. “What Is TensorFlow? The Machine Learning Library Explained.” InfoWorld, InfoWorld, 18 June 2019,

[3]“Keras: The Python Deep Learning Library.” Home – Keras Documentation,

[4] “Basic Regression: Predict Fuel Efficiency  :   TensorFlow Core.” TensorFlow,

Dynamic Emulation in Water Resources Applications

The purpose of this blog post is to introduce dynamic emulation in the context of applications to hydrology. Hydrologic modelling involves implementing mathematical equations to represent physical processes such as precipitation, runoff, and evapotranspiration and to characterize energy and water flux within the hydrologic system (Abbott et al., 1986). Users of a model might be interested in using it to approach a variety of problems related to, for instance, modeling the rainfall-runoff process in a river basin. The user might try to validate the model, perform sensitivity or uncertainty analysis, determine optimal planning and management of the basin, or test how hydrology of the basin is affected by different climate scenarios. Each of these problems requires a numerically intensive Monte Carlo style approach for which a physical model is not conducive. A solution to this problem is to create an emulator for the physical model. An emulator is a computationally efficient model whose response characteristics mimic those of a complex model as closely as possible. This model can then replace the more complicated model in computationally intensive exercises (Tych & Young, 2012).

There are many different approaches to creating an emulator; one particularly unified approach is Dynamic Emulation Modelling (DEMo) (Castelletti et al., 2012). DEMo seeks to accomplish three goals when creating an emulator:

  1. The emulator is less computationally intensive than the physical model.
  2. The emulator’s input-output behavior approximates as well as possible the behavior of the physical model.
  3. The emulator is credible to users.

DEMo has five main steps:

    1. Design of Computer Experiments: Determine a set of input data to build the emulator off of that will cover the range of responses of the physical model
    2. Variable Aggregation: Reduce dimensionality of the input data
    3. Variable Selection: Select components of the reduced inputs that are most relevant to explaining the output data
    4. Structure Identification and Parameter Estimation: In the case of a rainfall runoff model, choose a set of appropriate black box models that can capture the complex, non-linear process and fit the parameters of these models accordingly.
    5. Evaluation and Physical Interpretation: Evaluate the model on a validation set and determine how well the model’s behavior and structure can be explained or attributed to physical processes.

The next section outlines two data-driven style models that can be used for hydrologic emulation.

Artificial Neural Networks (ANNs)

Rainfall-runoff modelling is one of the most complex and non-linear hydrologic phenomena to comprehend and model. This is due to tremendous spatial and temporal variability in watershed characteristics. Because ANNs can mimic high-dimensional non-linear systems, they are a popular choice for rainfall-runoff modeling (Maier at al., 2010). Depending on the time step of interest as well as the complexity of the hydrology of the basin, a simple feedforward network may be sufficient for accurate predictions. However, the model may benefit from having the ability to incorporate memory that might be inherent in the physical processes that dictate the rainfall-runoff relationship. Unlike feedforward networks, recurrent neural networks are designed to understand temporal dynamics by processing inputs in sequential order (Rumelhart et al., 1986) and by storing information obtained from previous outputs. However, RNNs have trouble learning long-term dependencies greater than 10 time steps (Bengio, 1994). The current state of the art is Long Short-Term Memory (LSTM) models. These RNN style networks contain a cell state that has the ability of learn long-term dependencies as well as gates to regulate the flow of information in and out the cell, as shown in Figure 1.


Figure 1: Long Short-Term Memory Module Architecture1

LSTMs are most commonly used in speech and writing recognition but have just begun to be implemented in hydrology applications with success especially in modelling rainfall-runoff in snow-influenced catchments. Kratzert et al., 2018, show that the LSTM is able to outperform the Sacramento Soil Moisture Accounting Model (SAC-SMA) coupled with a Snow-17 routine to compute runoff in 241 catchments.

Support Vector Regression (SVR)

Support vector machines are commonly used for classification but have been successfully implemented for working with continuous values prediction in a regression setting. Support vector regression relies on finding a function within a user specified level of precision, ε, from the true value of every data point, shown in Figure 2.


Figure 2: Support Vector Regression Model2

It is possible that this function may not exist, and so slack variables, ξ , are introduced which allow errors up to ξ to still exist. Errors that lie within the epsilon bounds are treated as  0, while points that lie outside of the bounds will have a loss equal to the distance between the point and the ε bound. Training an SVR model requires solving the following optimization problem:


Where w is the learned weight vector and xand yare training points. C is a penalty parameter imposed on observations that lie outside the epsilon margin and also serves as a method for regularization. In the case that linear regression is not sufficient for the problem, the inner products in the dual form of the problem above can be substituted with a kernel function that maps x to a higher dimensional space, This allows for estimation of non-linear functions. Work by Granata et al., 2016 compares an SVR approach with EPA’s Storm Water Management Model (SWMM) and finds comparable results in terms of RMSE and R2 value.


Abbott, M.b., et al. “An Introduction to the European Hydrological System — Systeme Hydrologique Europeen, ‘SHE’, 1: History and Philosophy of a Physically-Based, Distributed Modelling System.” Journal of Hydrology, vol. 87, no. 1-2, 1986, pp. 45–59., doi:10.1016/0022-1694(86)90114-9.

Bengio, Y., et al. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks, vol. 5, no. 2, 1994, pp. 157–166., doi:10.1109/72.279181.

Castelletti, A., et al. “A General Framework for Dynamic Emulation Modelling in Environmental Problems.” Environmental Modelling & Software, vol. 34, 2012, pp. 5–18., doi:10.1016/j.envsoft.2012.01.002.

Castelletti, A., et al. “A General Framework for Dynamic Emulation Modelling in Environmental Problems.” Environmental Modelling & Software, vol. 34, 2012, pp. 5–18., doi:10.1016/j.envsoft.2012.01.002.

Granata, Francesco, et al. “Support Vector Regression for Rainfall-Runoff Modeling in Urban Drainage: A Comparison with the EPA’s Storm Water Management Model.” Water, vol. 8, no. 3, 2016, p. 69., doi:10.3390/w8030069.

Kratzert, Frederik, et al. “Rainfall–Runoff Modelling Using Long Short-Term Memory (LSTM) Networks.” Hydrology and Earth System Sciences, vol. 22, no. 11, 2018, pp. 6005–6022., doi:10.5194/hess-22-6005-2018.

Maier, Holger R., et al. “Methods Used for the Development of Neural Networks for the Prediction of Water Resource Variables in River Systems: Current Status and Future Directions.” Environmental Modelling & Software, vol. 25, no. 8, 2010, pp. 891–909., doi:10.1016/j.envsoft.2010.02.003.

Rumelhart, David E., et al. “Learning Representations by Back-Propagating Errors.” Nature, vol. 323, no. 6088, 1986, pp. 533–536., doi:10.1038/323533a0.

Tych, W., and P.c. Young. “A Matlab Software Framework for Dynamic Model Emulation.” Environmental Modelling & Software, vol. 34, 2012, pp. 19–29., doi:10.1016/j.envsoft.2011.08.008.



(2) Kleynhans, Tania, et al. “Predicting Top-of-Atmosphere Thermal Radiance Using MERRA-2 Atmospheric Data with Deep Learning.” Remote Sensing, vol. 9, no. 11, 2017, p. 1133., doi:10.3390/rs9111133.