February 25, 2020 by keyvan Malek

Sensitivity Analysis Tools

Sensitivity analysis (SA) is one of the main themes of the Water Programming Blog. There are several decent blog posts that go over theoretical aspects of sensitivity analysis (for example, here , here, and here). Also, many blog posts explain how to efficiently and elegantly visualize sensitivity analysis results (for example, here, and here). In addition, there are many blog posts related to SALib, a widely used Python library developed at Cornell University by former members of Dr. Reed research group (for example, here, here, and here).

Recently, I have been trying to put together a comprehensive list of other SA tools, and I thought it might be useful to write a blog post on this topic. I organized the following list based on the platforms I have explored so far, including MATLAB, Python, and R. After that, I will introduce some other open-source and commercialized SA tools.

MATLAB

Many MATLAB packages have been developed to perform sensitivity analysis and uncertainty quantification. As the following table shows, they have been created by a variety of universities and research institutes. Also, several of them cover different sensitivity analysis methods, such as Regression-based SA, Variance-based SA (e.g., Sobol), and derivative-based SA. All of them support at least two sampling techniques, such as Latin Hyper Cube Sampling. Many of them are generic (discipline-free) and can be used to answer different types of questions; however, a few of them (e.g., PeTTSy and DyGloSA) have been tailored to specific applications, such as biological models. Also, almost all of them include some post-processing and visualization components.

There are two toolboxes that work in platforms other than MATLAB. The SAFE package developed by Pianosi et al. (2015) has R and Python versions, and the SaSAT package developed at the University of New South Wales works in Microsoft Excel.

Abbreviation	Full Name	Example of Methods Supported	Institution
GSAT	Global Sensitivity Analysis Toolbox	Sobol and FAST	MATLAB
SAFE	Sensitivity Analysis For Everybody	EET, or Morris method,RSA, Sobol, FAST, and PAWN	University of Bristol
GSUA	Global Sensitivity and Uncertainty Analysis Toolbox	Sobol	MATLAB
GUI-HDMR	Global Sensitivity Analysis Toolbox	Global Sensitivity Analysis using HDMR	University of Leeds
DyGloSA	Dynamical Global Sensitivity Analysis Toolbox	Dynamical Global parameter Sensitivity Analysis (GPSA) of ODE models	University of Luxembourg
PeTTSy	Perturbation Theory Toolbox for Systems	Perturbation analysis of complex systems biology models	University of Cambridge
SaSAT	Sampling and Sensitivity Analysis Tools	Regression-based (Pearson, Spearman, and Partial Rank Correlation Coefficients)	The University of New South Wales
SensSB	Sensitivity Analysis in Systems Biology models	Local SA, derivative and variance based global sensitivity analysis	Process Engineering Group at IIM-CSIC (Vigo, Spain)
SobolGSA	Global Sensitivity Analysis and Metamodeling Software	Morris, Sobol FAST and derivative-based	Imperial College London
SUMO	SUrrogate Modeling Toolbox	Surrogate models, sensitivity analysis	Ghent University
UQLab	The Framework for Uncertainty Quantification	Morris, Kucherenko,ANCOVA, Borgonovo, Sobol	ETH Zurich

FAST: Fourier Amplitude Sensitivity Testing
EET: Elementary Effects Test
RSA: Regional Sensitivity Analysis

Python

Interestingly, I was not able to find many Python libraries, and most of the ones that I did find were developed for specific applications. Please leave a comment if you are aware of any other packages that have not been listed here. Among these packages, SALib seems to be the one that covers more SA and sampling methods. There are two SA and QU packages that have C++ versions (OpenTURNS and UQTk). Also, uncertainpy have been originally developed for neuroscience applications.

Abbreviation	Description	Example of Methods Supported	Institution
SALib	Python sensitivity analysis library	Sobol, Morris, FAST, RBD-FAST, Delta Moment-Independent Measure, Derivative-based, Factorial	Cornell University
uncertainpy	Uncertainty quantification and sensitivity analysis library	Sobol	University of Oslo
MATK	Model analysis toolKit	FAST, Sobol	Los Alamos National Laboratory
UQTk	Quantification of uncertainty in numerical models	Sobol	Sandia National Lab
OpenTURNS	Open source initiative for the Treatment of Uncertainties	Spearman Correlation Coefficients, Sobol, ANCOVA, UQ	Technical University of Denmark
varsens	Variance Based Sensitivity Analysis	Sobol	Vanderbilt University

FAST: Fourier Amplitude Sensitivity Testing
QU: Quantification of Uncertainty

R

I was able to find about fifty R packages that have sensitivity analysis features. The following table lists the ones that have the most comprehensive SA functionalities. It seems that the rest of them were developed for specific areas of science and have limited SA functionality. I list some of these here (RMut, pksensi, ivmodel, FME, episensr, pse).

Based on what I found, sensitivity package seems to cover a wider range of SA methods. Reader can refer to this blog post for more information about the sensitivity package.

Name	Example of Methods Supported
sensobol	Third-order Sobol
sensitivity	Sobol, Morris, FAST, RBD-FAST, Delsa, Derivative-based , Factorial
ODEsensitivity	Morris, Sobol
multisensi	SA on models with multivariate outputs
konfound	Robustness and sensitivity of empirical models
fast	FAST
BASS	Sobol

FAST: Fourier Amplitude Sensitivity Testing

Other Platforms

There are many other SA tools that have been developed in other platforms, and the following table lists only a few of them. There are also several commercial SA platforms such as SDI, VISYOND, and SMARTUQ that seem to have nice graphical user interfaces (GUIs), but, because they are not freeware and the source codes are not available, they might have limited applications in academic research.

Abbreviation	Main applications	Programming Language	Institution
Dakota	Optimization, QU, SA (Sobol, FAST, Morris)	C++	Sandia National Laboratory
PSUADE	QU, Spearman, Pearson Correlation Coefficient, Sobol, Morris, FAST	C++	Lawrence Livermore National Laboratory
SIMLab	Sobol, FAST, Morris	GUI-based	The European Commission’s science and knowledge service

QU: Quantification of Uncertainty
FAST: Fourier Amplitude Sensitivity Testing

Please leave a comment and let me know if you are aware of any other useful tools that I did not list here.

February 17, 2020 by Antonia Hadjimichael

Factor prioritization and factor fixing: how to know what’s important

There have been several blogposts on sensitivity analysis (SA) on this blog, focusing primarily on tools to perform it (e.g., SALib) and visualize outputs. Today I’ll be providing some more information on how to decide which factors are most important in affecting our output and which are largely inconsequential. Picking what is actually important for what we care about is obviously largely subjective and case-dependent, but this post is meant to provide some support to that exercise. I will performing a Global Sensitivity Analysis of a system resulting in a rank-ordering of the most important factors driving variability in the output (i.e., factor prioritization), which can be used to decide which are the least influential factors that can be fixed to simplify the model (i.e., factor fixing) [1].

The scripts I’ll be using can be found here, and I’ll be using a fishery model to demonstrate, as a simplified representation of a socio-ecological system we’re trying to manage. The procedure I’ll be following has been based on the work found in [2-4].

The idea is this:
I generate 1000 samples of uncertain factors that might be driving variability in my outcome (let’s call this Set 1). I apply a certain SA method on the samples and the outcomes and get sensitivity indices for each of my factors, ranking them from most important to least. Where do I draw the line between important and not important?
We can create a Set 2, using only the T most important factors from our Set 1 sample, and fixing all other factors to their default values.
We can also create a Set 3, now fixing the T most important factors to defaults and using the sampled values of all other factors from Set 1.

If we classified our important and unimportant factors correctly, then the correlation coefficient between the model outputs of Set 2 and Set 1 should approximate 1 (since we’re fixing all factors that don’t matter), and the correlation coefficient between outputs from Set 3 and Set 1 should approximate 0 (since the factors we sampled are inconsequential to the output).

Here’s how it’s done using SALib and the Delta Method (in the interest of space I’ll only share the most important snippets of code, you need the full scripts to make it run, which are in this repository) :

First we set up our problem using SALib nomenclature, generate 1000 samples using all factors (which will be our Set 1) and run the model for all 1000 samples. Finally we analyze our output using the Delta method. (This should take a couple minutes to run on your personal computer.)

# Set up dictionary with system parameters
problem = {
  'num_vars': 9,
  'names': ['a', 'b', 'c', 'd','h',
            'K','m','sigmaX','sigmaY'],
  'bounds': [[ 0.002, 2],
             [0.005, 1],
             [0.2, 1],
             [0.05, 0.2],
             [0.001, 1],
             [100, 5000],
             [0.1, 1.5],
             [0.001, 0.01],
             [0.001, 0.01]]
}

defaultvalues = np.array([0.005, 0.5, 0.5, 0.1, 0.1, 2000, 0.7, 0.004, 0.004])

# Generate samples
nsamples = 1000
X_Set1 = latin.sample(problem, nsamples) # This is Set 1

# Run model for all samples
output = [fish_game(*X_Set1[j,:]) for j in range(nsamples)]

# Perform analysis
results = delta.analyze(problem, X_Set1, np.asarray(output), print_to_console=True)

This will produce output like below, telling as the Delta indices of each of the sampled parameters, the confidence internals of those, the First order Sobol indices of the parameters, and their equivalent confidence intervals.

Parameter delta delta_conf S1 S1_conf
a 0.102206 0.021648 0.052453 0.033510
b 0.139056 0.018379 0.065019 0.022922
c 0.090550 0.016505 0.006749 0.007823
d 0.076542 0.005375 0.003923 0.009140
h 0.097057 0.016910 0.021070 0.009275
K 0.267461 0.020434 0.190670 0.057397
m 0.252351 0.040149 0.315562 0.031664
sigmaX 0.076175 0.014001 0.005930 0.005333
sigmaY 0.075390 0.015346 0.004970 0.011557

Without further analysis, one simple way of determining whether a parameter is unimportant is to check whether the confidence interval of its value overlaps 0 (i.e., subtract delta_conf from delta). For our particular results, this doesn’t seem to be the case for any of our delta values, though it does happen for some of the S1 values (c, d, sigmaY). You can refer to this post for discussion on what this might mean.
Looking at the delta values, we can clearly see two factors coming up top (K and m), followed by b, and a closely behind it. The rest of the parameters are reduced in their importance in small decrements after that. So where should we draw the line of importance? Another simple way is to use a threshold (say, 0.1) as a cutoff value [3], but one could argue over including a and not h, given how close their indices are and the wider confidence interval of a (see also the appendix below on this).

But, let’s continue with our analysis. What I am doing below is the following. First, I sort the factors from most to least important based on my results for the delta indices. Then, I create my Sets 2 and 3 on which I’ll be iteratively replacing the values of important factors with either those from Set 1 or with defaults. Finally, I loop through all possible numbers of important factors (1 to 9), generate Sets 2 and 3, calculate outputs for all samples in each, and calculate their correlation with the outputs from Set 1. (This should take 20-30 minutes to run on your personal computer.)

# Sort factors by importance
factors_sorted = np.argsort(results['delta'])[::-1]

# Set up DataFrame of default values to use for experiment
X_defaults = np.tile(defaultvalues,(nsamples, 1))

# Create initial Sets 2 and 3
X_Set2 = np.copy(X_defaults)
X_Set3 = np.copy(X_Set1)

for f in range(1, len(factors_sorted)+1):
    ntopfactors = f
    
    for i in range(ntopfactors): #Loop through all important factors
        X_Set2[:,factors_sorted[i]] = X_Set1[:,factors_sorted[i]] #Fix use samples for important
        X_Set3[:,factors_sorted[i]] = X_defaults[:,factors_sorted[i]] #Fix important to defaults
    
    # Run model for all samples    
    output_Set2 = [fish_game(*X_Set2[j,:]) for j in range(nsamples)]
    output_Set3 = [fish_game(*X_Set3[j,:]) for j in range(nsamples)]
    
    # Calculate coefficients of correlation
    coefficient_S1_S2 = np.corrcoef(output,output_Set2)[0][1]
    coefficient_S1_S3 = np.corrcoef(output,output_Set3)[0][1]

I can also plot the outputs from each iteration, which should look something like this (this is animated to show all figures, in the interest of space):

The figures above tell us the following:
If we choose one important factor (K) and fix all other parameters our outputs don’t really capture the variability of outcomes produced when considering all nine (this is also a case against one-at-a-time type analyses). The coefficient of correlation between Sets 1 and 2 is pretty low (0.44) suggesting we’re still missing important parameters. We’re doing a better job by actually fixing our most important parameter and varying all others (figure on the right, with R=0.763).
Adding the second most important factor (m), shifts things significantly to the right direction, by increasing our coefficient on the right and reducing the one on the left to R=0.203.
There is only a slight improvement with the addition of the third factor (b), but with the inclusion of the fourth (a), our reduced model is already looking very close to the full, with R=0.94. Our counter model excluding these four factors (on the right) also has a very low coefficient of R=0.025.
One could consider this performance sufficient, with the model reduced to four parameters instead of nine. Further adding parameter h and then c would further improve the values to a near perfect match between Set 2 and Set 1, but this is where subjectivity takes over, depending on the cost of adding these variables and how much we care about fidelity in this case.
It is also clear that it is likely safe to fix the last three parameters, as in this case they don’t have any consequential effects on our outcomes.

References:
[1] Saltelli, Andrea, et al. Global Sensitivity Analysis: The Primer. (2008)
[2] T. H. Andres, “Sampling methods and sensitivity analysis for large parameter sets,” Journal of Statistical Computation and Simulation, vol. 57, no. 1–4, pp. 77–110, Apr. 1997, doi: 10.1080/00949659708811804.
[3] Y. Tang, P. Reed, T. Wagener, and K. van Werkhoven, “Comparing sensitivity analysis methods to advance lumped watershed model identification and evaluation,” Hydrology and Earth System Sciences, vol. 11, no. 2, pp. 793–817, Feb. 2007, doi: https://doi.org/10.5194/hess-11-793-2007.
[4] J. Nossent, P. Elsen, and W. Bauwens, “Sobol’ sensitivity analysis of a complex environmental model,” Environmental Modelling & Software, vol. 26, no. 12, pp. 1515–1525, Dec. 2011, doi: 10.1016/j.envsoft.2011.08.010.

Appendix:
Another way to identify a threshold of importance to classify parameters, is to add a dummy parameter to your model, that does nothing. Reperforming my SA for this same system including the dummy, produces this:

Parameter delta delta_conf S1 S1_conf
a 0.105354 0.019236 0.040665 0.020949
b 0.144955 0.023576 0.050471 0.014810
c 0.075516 0.009578 0.003889 0.006113
d 0.081177 0.011604 0.004186 0.007235
h 0.101583 0.010008 0.032759 0.021343
K 0.261329 0.022876 0.174340 0.038246
m 0.258345 0.024750 0.325690 0.052234
sigmaX 0.071862 0.008620 0.001681 0.006720
sigmaY 0.077337 0.009344 0.003131 0.006918
dummy 0.072546 0.008313 0.004176 0.009567

Even though the dummy does absolutely nothing in our model, it was still given a non-zero delta index by the analysis (0.07). One could use this as the cutoff value of non-importance and choose to fix parameters c, sigmaX, and sigmaY.

February 10, 2020 by Rohini Gupta

Accessing a Virtual Machine in Red Cloud

This blog post is an introduction to Red Cloud- a cloud computing service that is maintained by Cornell’s Center for Advanced Computing (CAC). Red Cloud is a private research cloud and can only be accessed by those with a subscription, but exploratory accounts are available for free for Cornell students, faculty and staff.

Subscriptions to cloud systems such as Red Cloud allow access to a variety of remote computing sources within seconds. Users can request instances, or virtual machines (VMs), of a variety of configurations ranging from CPUs to GPUs with varying amounts of RAM. In Red Cloud, users can access instances with up to 28 core and 240 GB of RAM. In this post, I’ll go through the very basic steps you need to access a VM through Red Cloud. These steps should generally apply to any cloud system that uses OpenStack as their cloud computing platform.

Step 1: Accessing OpenStack

OpenStack is a cloud operating system that will allow us to access the Red Cloud resources through a simple web interface. Log in with your CAC username and password (for the Reed Group: your credentials to access the Cube). This will lead you to an overview page that shows your usage of the resources.

OpenStack Login

Click on the Images tab. This shows the virtual machines that are available for use. You can access machines that have Linux distributions such as Centos (a -cuda means that these images can support GPUs) or Ubuntu. VMs usually have very minimal software installed, so there are also various images with pre-loaded software like Matlab.

OpenStack Overview Page

Available Images

Step 2: Creating a Key Pair

A key pair needs to be set up before launching to allow secure access to your instance through SSH authentication. You can create a new key pair under the Key Pairs tab.

Creating a Key Pair (Source: CAC)

Give it a meaningful name and copy the private key to a text file. Change the extension to a .pem file and store it somewhere convenient on your computer.

Step 3: Creating a Security Group

A security group allows you to control how you to specify what internet traffic can come from (ingress) or go to (egress) the instance. You can create your own, but for now, we will use Red Cloud’s default security group and add rules to that. Click on “Manage Rules”.

Overview of Security Groups

You’ll see a variety of Ingress/Egress rules already in the group.

Adding Rules to the Security Group

However, if you’re accessing a Linux-based VM, you will need to also allow access through an SSH command. Click on “Add Rule” and then choose “SSH” in the first drop-down menu and then “Add”. The SSH rule will now be listed as one of your rules. There are many options for rules including restricting access to only Cornell IP addresses etc.

Adding an SSH Rule

Step 3: Launch an Image

Now we have all the tools to launch an instance. Under the Compute and then Instances tab, you will have the option to launch an instance.

Launching an Instance

Under the Details tab, give your instance a name.

Naming Your Instance

Under the Source tab, choose your instance. I’ll go with the latest stable version of Ubuntu and then click the up arrow.

Choosing an Image

Then, choose your flavor. It’s recommended to start with the lowest RAM (8GB), especially if you are just starting to explore cloud computing. You can always scale up when your image is launched if you need to.

Choosing a Flavor

Under the Security Group tab, check to see that the default security group is selected. Then choose your key pair under the Key Pair tab. Great, now we can launch the instance by clicking the blue “Launch Instance” button.

The instance will go through a variety of tasks until it stabilizes at “Running”.

Instance Status

Now we can SSH into our remote server using the IP address that is listed for the instance. I’ll use MobaXterm to start a local terminal and navigate into the directory where I saved my private key. Use the following command, inserting in the IP address of your instance and the name of your key.

SSH-ing into the Ubuntu VM

Now we’ve entered into our Ubuntu machine and we can interact with it using the command line. Enjoy!

Ubuntu Image!

Once you are done, make sure to either shelve (if you want your disk contents to be unchanged) or delete your instance. Even if the machine is idle, this prevents it from being used by other users, so you will still be billed.

Shelving an Instance

In the next tutorial, I’ll describe Docker and how you can use VMs to run containerized code.

Acknowledgements: Much of the information shared in this tutorial comes from many conversations with staff at CAC, particularly Peter Vaillancourt and the Red Cloud wiki.

February 3, 2020 by Jared D. Smith

EnGauge: R Code Repository for Environmental Gauge Data Acquisition, Processing, and Visualization

Introduction and Motivation

Gauge data is an essential component of water systems research projects; however, data acquisition, processing, and exploratory (spatio-temporal) data analysis often consumes a large chunk of limited project research time. I developed the EnGauge GitHub repository to reduce the time required to download, process, and explore streamflow, water quality, and weather station gauge data that are hosted primarily on U.S. government servers. This repository compiles and modifies functions from other Packages for Hydrological Data Retrieval and Statistical Analysis, and develops new functions for processing and exploring the data.

Data Acquisition

Given a polygon shapefile of the region of interest and an optional radial buffer size, the types of gauge data downloaded can include:

USGS streamflow from the NWIS portal
EPA STORET, USGS, USDA and other water quality data via the water quality portal
NOAA ACIS, GHCN weather station data

The USGS R package dataRetrieval and the NOAA rnoaa package contain the primary functions used for data acquisition. Additional references to learn about these packages are available in the EnGauge README file and at the provided web links.

Data Processing

Significant processing is required to use some of these gauge datasets for environmental modeling. The EnGauge repository has functions that may be used to address the following common data processing needs:

Check for duplicate records
Check for zeros and negative values
Check detection limits
Fill date gaps (add NAs to dates missing from timeseries)
Aggregate to daily, monthly, and/or annual timeseries
Project spatial data to a specified coordinate system
Write processed data to shapefiles, .txt files, and lists that can be loaded into other software for further analysis and/or modeling.

Data Visualization and Exploratory Data Analysis – From GitHub Example

This example is applied to the Gwynns Falls watershed in the Baltimore Ecosystem Study Long Term Ecological Research site. The following figures are some of the output from the EnGague USGSdataRetrieval.R script (as of commit 2fc84cd).

Record lengths at each gauge
Locations of sites with zero and/or negative values
Locations of sites with different water quality information: total nitrogen and total phosphorus in this example
Locations of sites with certain weather station data: maximum temperature in this example
Visualizing quality codes on timeseries
Summary exploratory spatial data analysis for sites
Summary daily, monthly, annual information
Monthly heatmap
Outlier visualization: currently implements a simplistic global spatio-temporal method defined by flows greater than a selected quantile. Plots offer qualitative support for the flows at other stations on the dates with high outliers at the reference station.
DEM vs. Gauge Elevation: If you supply a DEM, the reported gauge elevation can be compared to the DEM elevation within the region of interest (ROI)
Seasonal Scatterplot with Histograms: If you have two timeseries of different data types, e.g. streamflow and water quality, a scatterplot by season may be made (not in example code, but a function is available in the repository).

Concluding Thoughts

This repository can be used to download gauge data from several sources, to employ standard data processing methods across those sources, and to explore the resulting data. Spend less time getting your data ready to do your research, and more time thinking about what your data are telling you and actually using it for modeling. Check out the EnGague repository for your next research project!

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

Month: February 2020