Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Determining the appropriate number of samples for a sensitivity analysis

Sensitivity analysis aims at assessing the relative contributions of the different sources of uncertainty to the variability in the output of a model. There are several mathematical techniques available in the literature, with variance-based approaches being the most popular, and variance-based indices the most widely used, especially “total effects” indices. Literature also provides several estimators/approximators for these indices (reviews can be found here [1]), which typically need N = n × (M + 2) model evaluations (and resulting outputs), where M is the number of uncertain inputs and n is some factor of proportionality that is selected ad hoc, depending on the complexity of the model (e.g. linear or not, monotonic or not, continuous or not). [Note: Literature typically refers to n as the number of samples and to N as the number of model evaluations, and this is how I’ll be using them also.]

The generation of this set of model runs of size N is by far the most computationally demanding step in the calculation of variance-based indices, and a major limitation to their applicability, as it’s typically in the order of magnitude of thousands [2] (n typically >1000). For example, consider a model with 5 uncertain inputs that takes 1 minute to run. To appropriately estimate sensitivity indices for such a model, we would potentially need about N=7000 model evaluations, which would take almost 5 days on a personal computer, excluding the time for the estimator calculation itself (which is typically very short).

The aim is therefore to pick the minimum n needed to ensure our index calculation is reliable. Unfortunately, there is no hard and fast rule on how to do that, but the aim of this post is to illuminate that question a little bit and provide some guidance. I am going off the work presented here [3] and elsewhere, and the aim is to perform the estimation of sensitivity indices repeatedly, using an increasing number of n until the index values converge.

I will be doing this using a fishery model, which is a nonlinear and nonmonotonic model with 9 parameters. Based on previous results suggesting that 3 of these parameters are largely inconsequential to the variability in the output, I’ll be fixing them to their nominal values. I’ll be using SALib to perform the analysis. My full script can be found here, but I’ll only be posting the most important snippets of code in this post.

First, I set up my SALib ‘problem’ and create arrays to store my results:

# Set up dictionary with system parameters
problem = {
  'num_vars': 6,
  'names': ['a', 'b', 'c','h',
            'K','m'],
  'bounds': [[ 0.002, 2],
             [0.005, 1],
             [0.2, 1],
             [0.001, 1],
             [100, 5000],
             [0.1, 1.5]]
}

# Array with n's to use
nsamples = np.arange(50, 4050, 50)

# Arrays to store the index estimates
S1_estimates = np.zeros([problem['num_vars'],len(nsamples)])
ST_estimates = np.zeros([problem['num_vars'],len(nsamples)])

I then loop through all possible n values and perform the sensitivity analysis:

# Loop through all n values, create sample, evaluate model and estimate S1 & ST
for i in range(len(nsamples)):
    print('n= '+ str(nsamples[i]))
    # Generate samples
    sampleset = saltelli.sample(problem, nsamples[i],calc_second_order=False)
    # Run model for all samples
    output = [fish_game(*sampleset[j,:]) for j in range(len(sampleset))]
    # Perform analysis
    results = sobol.analyze(problem, np.asarray(output), calc_second_order=False,print_to_console=False)
    # Store estimates
    ST_estimates[:,i]=results['ST']
    S1_estimates[:,i]=results['S1']

I can then plot the evolution of these estimates as n increases:

Evolution of first order and total order indices with increasing number of samples (n)

What these figures tell us is that choosing an n below 1000 for this model would potentially misestimate our indices, especially the first order ones (S1). As n increases, we see the estimates becoming less noisy and converging to their values. For more complex models, say, with more interactive effects, the minimum n before convergence could actually be a lot higher. A similar experiment by Nossent et al. [3] found that convergence was reached only after n=12,000.

An observation here is that the values of the total indices (ST) are higher than those of the the first order indices (S1), which makes sense, as ST includes both first order effects (captured by S1) and second order effects (i.e. interactions between the factors). Another observation here is that the parameters with the most significant effects (m and K) converge much faster than parameters with less impact on the output (a and b). This was also observed by Nossent et al. [3].

Finally, sensitivity analysis is often performed for the purposes of factor prioritization, i.e. determining (often rank-ordering) the most important parameters for the purposes of, for example, deciding where to devote most calibration efforts in the model or most further analysis to reduce the uncertainty in a parameter. The figures below show the evolution of that rank-ordering as we increase n.

Evolution of parameter ranking based on first order and total order indices with increasing number of samples (n)

These figures show that with a number of samples that is too small, we could potentially misclassify a factor as important or unimportant when it actually is not.

Now, one might ask: how is this useful? I’m trying to minimize my n, but you’ve just made me run way too many model evaluations, multiple times, just to determine how I could have done it faster? Isn’t that backwards?

Well, yes and no. I’ve devoted this time to run this bigger experiment to get insight on the behavior of my model. I have established confidence in my index values and factor prioritization. Further, I now know that n>1500 would probably be unnecessary for this system and even if the model itself or my parameter ranges change. As long as the parameter interactions, and model complexity remain relatively the same, I can leverage this information to perform future sensitivity analyses, with a known minimum n needed.

References:
[1] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and S. Tarantola, “Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index,” Computer Physics Communications, vol. 181, no. 2, pp. 259–270, Feb. 2010, doi: 10.1016/j.cpc.2009.09.018.
[2] F. Pianosi and T. Wagener, “A simple and efficient method for global sensitivity analysis based on cumulative distribution functions,” Environmental Modelling & Software, vol. 67, pp. 1–11, May 2015, doi: 10.1016/j.envsoft.2015.01.004.
[3] J. Nossent, P. Elsen, and W. Bauwens, “Sobol’ sensitivity analysis of a complex environmental model,” Environmental Modelling & Software, vol. 26, no. 12, pp. 1515–1525, Dec. 2011, doi: 10.1016/j.envsoft.2011.08.010.

Parallel processing with R on Windows

Parallel programming can save you a lot of time when you are processing large amounts of data. Modern computers provide multiple processors and cores and hyper-threading ability; therefore, R has become compatible with it and enables multiple simultaneous computations on all resources. There are some discussions regarding when to parallelize, because there is no linear relationship between the number of processors and cores used simultaneously and the computational timing efficiency. In this blog post, I am going to utilize two packages in R, which allows parallelization, for a basic example of when each instance of computation is standalone and when there is no need for communication between cores that are being used in parallel.

Install.packages(“parallel”)
Install.packages(“doParallel”)
library(doParallel)
library(parallel)

If you enter Ctrl+Shift+Esc on your keyboard and click on the Performance tab in the Task Manager window, you will see how many actual logical processes, which are the combination of processors and cores, are available on your local Windows machine and can be used simultaneously for your analysis. We can also detect this number with the following command:

no_cores <- detectCores(logical = TRUE)  # returns the number of available hardware threads, and if it is FALSE, returns the number of physical cores

Now, we need to allocate this number of available cores to the R and provide a number of clusters and then register those clusters. If you specify all the cores to the R, you may have trouble doing anything else on your machine, so it is better not to use all the resources in R.

cl <- makeCluster(no_cores-1)  
registerDoParallel(cl)  

First, we are going to create a list of files that we want to analyze. You can download the example dataset here.

all_samples<- as.data.frame(list.files("your directory/R_Parallel_example/"))
seq_id_all<- seq_along(1:nrow(all_samples))

Then, we will create a function that we are will use for processing our data. Each file in this set has a daily value for several variables and 31 years. Columns 1, 2, and 3 are year, month, and day, respectively. I am going to extract the yield value for each year from column “OUT_CROP_BIOMYELD” and calculate the average yield for the entire period. All the libraries that you are going to use for your data process should be called inside the function. I am going to use “data.table” library to efficiently read my data into R. At the end of the function, I put the two outputs that I am interested in (“annual_yield” and “average_yield”) into one list to return from the function.

myfunction<- function(...) {
  i<-(...)
  library(data.table)
  setwd(paste("your directory/R_Parallel_example/"))
  sample<- fread(paste(all_samples[i,]))
      annual_yield<- subset(sample,sample$OUT_CROP_BIOMYELD>0)  # this column (OUT_CROP_BIOMYELD) is always zero except when the yield is reported which should be a value above zero.
      annual_yield$No<-  as.numeric(gsub("_47.65625_-117.96875","",all_samples[i,]))  # extract some part of the file name, use it as an identification for this dataset
      annual_yield<- annual_yield[,c(1,13,17)]  # extract just “Year”,”Yield” and “No” columns.
      colnames(annual_yield)<- c("Year","Yeild","No")
  average_yield<- colMeans(annual_yield[,c("Yeild","No")])  #calculate average year for each dataset
  return( list(annual_yield,average_yield))
}

Now, we need to export our function on the cluster. Because in the function we used “all_samples” data-frame, which was created outside the function, this should also be exported to the cluster:

clusterExport(cl,list('myfunction','all_samples'))

With the command line below, we are running the function across the number of cores that we specified earlier, and with “system.time,” the process time will be printed at the end:

system.time(
  results<- c(parLapply(cl,seq_id_all,fun=myfunction))
)

The function outputs are saved in the list “results” that we can extract:

k_1<- list()
k_2<- list()
for (k in 1: nrow(all_samples)){
  k_1[[k]]<- results[[k]][[1]]
  k_2[[k]]<- results[[k]][[2]]
}
annual_yield<- data.table::rbindlist(k_1)
period_yield<- data.table::rbindlist(k_2)

Using Docker with Virtual Machines

This post is a continuation of my previous post on accessing virtual machines on RedCloud. In this post, you will learn how to run a Docker container on a VM Ubuntu image, but you can also do this tutorial with Ubuntu on a local machine.

Why Containerize Code?

Let’s use an example: Say that you have three Python applications that you want to run on your computer, but they all use different version of Python or its packages. We cannot have more than one version of Python on our computer…so what do we do? Think of another case- you want to share neural network code with a collaborator, but you don’t want to have to deal with the fact that it’s incredibly difficult for them to get TensorFlow and all its dependencies downloaded on their machine. Containerization allows us to isolate the installation and running of our application without having to worry about the setup of the host machine. In other words, everything someone would need to run your code will be provided in the container. Shown in Figure 1, each container shares the host OS’s kernel but has a virtual copy of some of the components of the OS. This allows the containers to be isolated from each other while running on the same host. Therefore, containers are exceptionally light and take only seconds to start [1].

Picture1

Fig 1. Docker Container Example [1]

Docker

Docker started as a project to build single-application LXC (linux) containers that were more portable and flexible, but is now its own container runtime environment. It is written in GO and developed by DotCloud (a PaaS company). A Docker engine is used to build and manage a Docker image, which is just a template that contains the application and all of its dependencies that are required for it to run. A Docker container is a running instance of a Docker image. A Docker engine is composed of three main parts: a server known as the Docker daemon, a rest API, and a client. The docker daemon creates and manages images and containers, the rest API helps to link the server and applications, and the client (user) interacts with the docker daemon through the command line (Figure 2).

Picture2

Fig.2 Docker Engine [2]

 

Running a Docker Container on Ubuntu

  1. We will be setting up and running a Docker container that contains code to train a rainfall-runoff LSTM model built in Python using Tensorflow. The Github repo with the LSTM code and data can be downloaded here.
  2. Spin up a VM instance (Ubuntu 18.04) or use your own machine if you have a partition.
  3. I use MobaXterm to SSH into the VM instance and drag the HEC_HMS_LSTM folder into my home directory.
  4. Within the directory, you will see 2 additional files that are not part of the main neural network code: requirements.txt and jupyter.dockerfile. A requirements.txt file contains a list of only the packages that are necessary to run your application. You can make this file with just two lines of code: pip install pipreqs and then pipreqs  /path/to/project. The created file will look like this:

requirements

requirements.txt file

The jupyter.dockerfile is our Dockerfile. It is a text file that contains all the commands a user could call on the command line to assemble an image.

dockerfile

Dockerfile

Here in our Dockerfile, we are using the latest Ubuntu image and setting our working directory to the HEC_HMS_LSTM folder that has the neural network code, Hourly_LSTM.py, that we would like to execute. We start by looking for updates and then install Python, pip, and jupyter. Then we need to take our working directory contents and copy them into the container. We copy the requirements.txt file into the container and then add the whole HEC_HMS_LSTM folder into the container. We then use pip to install all of the packages listed in requirements.txt. Finally, we instruct the docker daemon to run the python script, Hourly_LSTM.py.

5. Next we have to download docker onto Ubuntu using the command line. This is the only thing that anyone will have to download to run your container. The steps to download Docker for Ubuntu through the command line are pretty easy using this link. You may be asked to allow automatic restarts during the installation, and you should choose this option. When Docker is done downloading, you can check to see that the installation was successful by seeing if the service is active.

dockerdownload

Successful Docker Installation

6. Now that Docker is downloaded, we can build our Docker image and then run the container. We build the image using: 

sudo docker build -f jupyter.dockerfile -t jupyter:latest .

In this command, the -f flag denotes the name of the dockerflile and -t is the name that we would like to tag our image with. If you’re running multiple containers, tagging is helpful to distinguish between the containers. The build will go through each step of the dockerfile. Be cognizant of how much space you have on your disk to store the downloaded Docker, the python libraries and the data you will generate; you may have to go back and resize your VM instance. We can then run our image using:

sudo docker run jupyter:latest

You’ll see the neural network training and the results of the prediction.

NNFinal

Successfully training the neural network in a Docker Container on a virtual machine

 

Credits:

[1]https://www.freecodecamp.org/news/docker-simplified-96639a35ff36/

[2]https://docs.docker.com

 

 

 

 

Make your Git repository interactive with Binder

Have you ever tried to demo a piece of software you wrote only to have the majority of participants get stuck when trying to configure their computational environment? Difficulty replicating computational environments can prevent effective demonstration or distribution of even simple codes. Luckily, new tools are emerging that automate this process for us. This post will focus on Binder, a tool for creating custom computing environments that can be distributed and used by many remote users simultaneously. Binder is language agnostic tool, and can be used to create custom environments for R, Python and Julia. Binder is powered by BinderHub, an open source service in the cloud. At the bottom of this post, I’ll provide an example of an interactive Python Jupyter Notebook that I created using BinderHub.

BinderHub

BinderHub combines two useful libraries: repo2docker and JupyterHub. repo2docker is a tool to build, run and push Docker images from source code repositories. This allows you to create copies of custom environments that users can replicate on any machine. These copies are can be stored and distributed along with the remote repository. JuptyerHub is a scalable system that can be used to spawn multiple Jupyter Notebook servers. JuptyerHub takes the Docker image created by repo2docker and uses it to spawn a Jupyter Notebook server on the cloud. This server can be accessed and run by multiple users at once. By combining repo2docker and JupyterHub, BinderHub allows users to both replicate complex environments and easily distribute code to large numbers of users.

Creating your own BinderHub deployment

Creating your own BinderHub deployment is incredibly easy. To start, you need a remote repository containing two things: (1) a Jupyter notebook with supporting code and (2) configuration files for your environment. Configuration files can either be an environment.yml file (a standard configuration file that can be generated with conda, see example here) or a requirements.txt file (a simple text file that lists dependencies, see example here).

To create an interactive BinderHub deployment:

  1. Push your code to a remote repository (for example Github)
  2. Go to mybinder.org and paste the repository’s URL into the dialoge box (make sure to select the proper hosting service)
  3. Specify the branch if you are not on the Master
  4. Click “Launch”

The website will generate a URL that you can copy and share with users. I’ve created an example for our Rhodium tutorial, which you can find here:

https://mybinder.org/v2/gh/dgoldri25/Rhodium/master?filepath=DMDU_Rhodium_Demo.ipynb

To run the interactive Jupyter Notebook, click on the file titled “Rhodium_Demo.ipynb”. Happy sharing!

Sensitivity Analysis Tools

Sensitivity analysis (SA) is one of the main themes of the Water Programming Blog. There are several decent blog posts that go over theoretical aspects of sensitivity analysis (for example, here , here, and here). Also, many blog posts explain how to efficiently and elegantly visualize sensitivity analysis results (for example, here, and here). In addition, there are many blog posts related to SALib, a widely used Python library developed at Cornell University by former members of Dr. Reed research group (for example, here, here, and here).

Recently, I have been trying to put together a comprehensive list of other SA tools, and I thought it might be useful to write a blog post on this topic. I organized the following list based on the platforms I have explored so far, including MATLAB, Python, and R. After that, I will introduce some other open-source and commercialized SA tools.

MATLAB

Many MATLAB packages have been developed to perform sensitivity analysis and uncertainty quantification. As the following table shows, they have been created by a variety of universities and research institutes. Also, several of them cover different sensitivity analysis methods, such as Regression-based SA, Variance-based SA (e.g., Sobol), and derivative-based SA. All of them support at least two sampling techniques, such as Latin Hyper Cube Sampling. Many of them are generic (discipline-free) and can be used to answer different types of questions; however, a few of them (e.g., PeTTSy and DyGloSA) have been tailored to specific applications, such as biological models. Also, almost all of them include some post-processing and visualization components.

There are two toolboxes that work in platforms other than MATLAB. The SAFE package developed by Pianosi et al. (2015) has R and Python versions, and the SaSAT package developed at the University of New South Wales works in Microsoft Excel.

AbbreviationFull NameExample of Methods SupportedInstitution
GSATGlobal Sensitivity Analysis ToolboxSobol and FASTMATLAB
SAFESensitivity Analysis For EverybodyEET, or Morris method,RSA, Sobol, FAST, and PAWN University of Bristol
GSUAGlobal Sensitivity and Uncertainty Analysis ToolboxSobolMATLAB
GUI-HDMR Global Sensitivity Analysis ToolboxGlobal Sensitivity Analysis using HDMRUniversity of Leeds
DyGloSADynamical Global Sensitivity Analysis ToolboxDynamical Global parameter Sensitivity Analysis (GPSA) of ODE modelsUniversity of Luxembourg
PeTTSyPerturbation Theory Toolbox for SystemsPerturbation analysis of complex systems biology modelsUniversity of Cambridge
SaSATSampling and Sensitivity Analysis ToolsRegression-based (Pearson, Spearman, and Partial Rank Correlation Coefficients)The University of New South Wales
SensSBSensitivity Analysis in Systems Biology modelsLocal SA, derivative and variance based global sensitivity analysisProcess Engineering Group at IIM-CSIC (Vigo, Spain)
SobolGSAGlobal Sensitivity Analysis and Metamodeling SoftwareMorris, Sobol FAST and derivative-based Imperial College London
SUMO SUrrogate Modeling ToolboxSurrogate models, sensitivity analysisGhent University
UQLabThe Framework for Uncertainty QuantificationMorris, Kucherenko,ANCOVA, Borgonovo, SobolETH Zurich
FAST: Fourier Amplitude Sensitivity Testing
EET: Elementary Effects Test
RSA: Regional Sensitivity Analysis 

Python

Interestingly, I was not able to find many Python libraries, and most of the ones that I did find were developed for specific applications. Please leave a comment if you are aware of any other packages that have not been listed here. Among these packages, SALib seems to be the one that covers more SA and sampling methods. There are two SA and QU packages that have C++ versions (OpenTURNS and UQTk). Also, uncertainpy have been originally developed for neuroscience applications.

AbbreviationDescriptionExample of Methods SupportedInstitution
SALibPython sensitivity analysis library Sobol, Morris, FAST, RBD-FAST, Delta Moment-Independent Measure, Derivative-based, FactorialCornell University
uncertainpyUncertainty quantification and sensitivity analysis librarySobolUniversity of Oslo
MATKModel analysis toolKit FAST, SobolLos Alamos National Laboratory
UQTkQuantification of uncertainty in numerical modelsSobol Sandia National Lab
OpenTURNSOpen source initiative for the Treatment of UncertaintiesSpearman Correlation Coefficients, Sobol, ANCOVA, UQTechnical University of Denmark
varsensVariance Based Sensitivity AnalysisSobolVanderbilt University
FAST: Fourier Amplitude Sensitivity Testing
QU: Quantification of Uncertainty

R

I was able to find about fifty R packages that have sensitivity analysis features. The following table lists the ones that have the most comprehensive SA functionalities. It seems that the rest of them were developed for specific areas of science and have limited SA functionality. I list some of these here (RMut, pksensi, ivmodel, FME, episensr, pse).

Based on what I found, sensitivity package seems to cover a wider range of SA methods. Reader can refer to this blog post for more information about the sensitivity package.

NameExample of Methods Supported
sensobolThird-order Sobol
sensitivitySobol, Morris, FAST, RBD-FAST, Delsa, Derivative-based , Factorial
ODEsensitivityMorris, Sobol
multisensiSA on models with multivariate outputs
konfoundRobustness and sensitivity of empirical models
fastFAST
BASSSobol
FAST: Fourier Amplitude Sensitivity Testing

Other Platforms

There are many other SA tools that have been developed in other platforms, and the following table lists only a few of them. There are also several commercial SA platforms such as SDI, VISYOND, and SMARTUQ that seem to have nice graphical user interfaces (GUIs), but, because they are not freeware and the source codes are not available, they might have limited applications in academic research.

AbbreviationMain applicationsProgramming LanguageInstitution
DakotaOptimization, QU, SA (Sobol, FAST, Morris)C++Sandia National Laboratory
PSUADEQU, Spearman, Pearson Correlation Coefficient, Sobol,  Morris, FASTC++Lawrence Livermore National Laboratory
SIMLabSobol, FAST, MorrisGUI-based The European Commission’s science and knowledge service
QU: Quantification of Uncertainty
FAST: Fourier Amplitude Sensitivity Testing

Please leave a comment and let me know if you are aware of any other useful tools that I did not list here.

Factor prioritization and factor fixing: how to know what’s important

There have been several blogposts on sensitivity analysis (SA) on this blog, focusing primarily on tools to perform it (e.g., SALib) and visualize outputs. Today I’ll be providing some more information on how to decide which factors are most important in affecting our output and which are largely inconsequential. Picking what is actually important for what we care about is obviously largely subjective and case-dependent, but this post is meant to provide some support to that exercise. I will performing a Global Sensitivity Analysis of a system resulting in a rank-ordering of the most important factors driving variability in the output (i.e., factor prioritization), which can be used to decide which are the least influential factors that can be fixed to simplify the model (i.e., factor fixing) [1].

The scripts I’ll be using can be found here, and I’ll be using a fishery model to demonstrate, as a simplified representation of a socio-ecological system we’re trying to manage. The procedure I’ll be following has been based on the work found in [2-4].

The idea is this:
I generate 1000 samples of uncertain factors that might be driving variability in my outcome (let’s call this Set 1). I apply a certain SA method on the samples and the outcomes and get sensitivity indices for each of my factors, ranking them from most important to least. Where do I draw the line between important and not important?
We can create a Set 2, using only the T most important factors from our Set 1 sample, and fixing all other factors to their default values.
We can also create a Set 3, now fixing the T most important factors to defaults and using the sampled values of all other factors from Set 1.

If we classified our important and unimportant factors correctly, then the correlation coefficient between the model outputs of Set 2 and Set 1 should approximate 1 (since we’re fixing all factors that don’t matter), and the correlation coefficient between outputs from Set 3 and Set 1 should approximate 0 (since the factors we sampled are inconsequential to the output).

Here’s how it’s done using SALib and the Delta Method (in the interest of space I’ll only share the most important snippets of code, you need the full scripts to make it run, which are in this repository) :

First we set up our problem using SALib nomenclature, generate 1000 samples using all factors (which will be our Set 1) and run the model for all 1000 samples. Finally we analyze our output using the Delta method. (This should take a couple minutes to run on your personal computer.)

# Set up dictionary with system parameters
problem = {
  'num_vars': 9,
  'names': ['a', 'b', 'c', 'd','h',
            'K','m','sigmaX','sigmaY'],
  'bounds': [[ 0.002, 2],
             [0.005, 1],
             [0.2, 1],
             [0.05, 0.2],
             [0.001, 1],
             [100, 5000],
             [0.1, 1.5],
             [0.001, 0.01],
             [0.001, 0.01]]
}

defaultvalues = np.array([0.005, 0.5, 0.5, 0.1, 0.1, 2000, 0.7, 0.004, 0.004])

# Generate samples
nsamples = 1000
X_Set1 = latin.sample(problem, nsamples) # This is Set 1

# Run model for all samples
output = [fish_game(*X_Set1[j,:]) for j in range(nsamples)]

# Perform analysis
results = delta.analyze(problem, X_Set1, np.asarray(output), print_to_console=True)

This will produce output like below, telling as the Delta indices of each of the sampled parameters, the confidence internals of those, the First order Sobol indices of the parameters, and their equivalent confidence intervals.

Parameter delta delta_conf S1 S1_conf
a 0.102206 0.021648 0.052453 0.033510
b 0.139056 0.018379 0.065019 0.022922
c 0.090550 0.016505 0.006749 0.007823
d 0.076542 0.005375 0.003923 0.009140
h 0.097057 0.016910 0.021070 0.009275
K 0.267461 0.020434 0.190670 0.057397
m 0.252351 0.040149 0.315562 0.031664
sigmaX 0.076175 0.014001 0.005930 0.005333
sigmaY 0.075390 0.015346 0.004970 0.011557

Without further analysis, one simple way of determining whether a parameter is unimportant is to check whether the confidence interval of its value overlaps 0 (i.e., subtract delta_conf from delta). For our particular results, this doesn’t seem to be the case for any of our delta values, though it does happen for some of the S1 values (c, d, sigmaY). You can refer to this post for discussion on what this might mean.
Looking at the delta values, we can clearly see two factors coming up top (K and m), followed by b, and a closely behind it. The rest of the parameters are reduced in their importance in small decrements after that. So where should we draw the line of importance? Another simple way is to use a threshold (say, 0.1) as a cutoff value [3], but one could argue over including a and not h, given how close their indices are and the wider confidence interval of a (see also the appendix below on this).

But, let’s continue with our analysis. What I am doing below is the following. First, I sort the factors from most to least important based on my results for the delta indices. Then, I create my Sets 2 and 3 on which I’ll be iteratively replacing the values of important factors with either those from Set 1 or with defaults. Finally, I loop through all possible numbers of important factors (1 to 9), generate Sets 2 and 3, calculate outputs for all samples in each, and calculate their correlation with the outputs from Set 1. (This should take 20-30 minutes to run on your personal computer.)

# Sort factors by importance
factors_sorted = np.argsort(results['delta'])[::-1]

# Set up DataFrame of default values to use for experiment
X_defaults = np.tile(defaultvalues,(nsamples, 1))

# Create initial Sets 2 and 3
X_Set2 = np.copy(X_defaults)
X_Set3 = np.copy(X_Set1)

for f in range(1, len(factors_sorted)+1):
    ntopfactors = f
    
    for i in range(ntopfactors): #Loop through all important factors
        X_Set2[:,factors_sorted[i]] = X_Set1[:,factors_sorted[i]] #Fix use samples for important
        X_Set3[:,factors_sorted[i]] = X_defaults[:,factors_sorted[i]] #Fix important to defaults
    
    # Run model for all samples    
    output_Set2 = [fish_game(*X_Set2[j,:]) for j in range(nsamples)]
    output_Set3 = [fish_game(*X_Set3[j,:]) for j in range(nsamples)]
    
    # Calculate coefficients of correlation
    coefficient_S1_S2 = np.corrcoef(output,output_Set2)[0][1]
    coefficient_S1_S3 = np.corrcoef(output,output_Set3)[0][1]

I can also plot the outputs from each iteration, which should look something like this (this is animated to show all figures, in the interest of space):

The figures above tell us the following:
If we choose one important factor (K) and fix all other parameters our outputs don’t really capture the variability of outcomes produced when considering all nine (this is also a case against one-at-a-time type analyses). The coefficient of correlation between Sets 1 and 2 is pretty low (0.44) suggesting we’re still missing important parameters. We’re doing a better job by actually fixing our most important parameter and varying all others (figure on the right, with R=0.763).
Adding the second most important factor (m), shifts things significantly to the right direction, by increasing our coefficient on the right and reducing the one on the left to R=0.203.
There is only a slight improvement with the addition of the third factor (b), but with the inclusion of the fourth (a), our reduced model is already looking very close to the full, with R=0.94. Our counter model excluding these four factors (on the right) also has a very low coefficient of R=0.025.
One could consider this performance sufficient, with the model reduced to four parameters instead of nine. Further adding parameter h and then c would further improve the values to a near perfect match between Set 2 and Set 1, but this is where subjectivity takes over, depending on the cost of adding these variables and how much we care about fidelity in this case.
It is also clear that it is likely safe to fix the last three parameters, as in this case they don’t have any consequential effects on our outcomes.

References:
[1] Saltelli, Andrea, et al.  Global Sensitivity Analysis: The Primer. (2008)
[2] T. H. Andres, “Sampling methods and sensitivity analysis for large parameter sets,” Journal of Statistical Computation and Simulation, vol. 57, no. 1–4, pp. 77–110, Apr. 1997, doi: 10.1080/00949659708811804.
[3] Y. Tang, P. Reed, T. Wagener, and K. van Werkhoven, “Comparing sensitivity analysis methods to advance lumped watershed model identification and evaluation,” Hydrology and Earth System Sciences, vol. 11, no. 2, pp. 793–817, Feb. 2007, doi: https://doi.org/10.5194/hess-11-793-2007.
[4] J. Nossent, P. Elsen, and W. Bauwens, “Sobol’ sensitivity analysis of a complex environmental model,” Environmental Modelling & Software, vol. 26, no. 12, pp. 1515–1525, Dec. 2011, doi: 10.1016/j.envsoft.2011.08.010.


Appendix:
Another way to identify a threshold of importance to classify parameters, is to add a dummy parameter to your model, that does nothing. Reperforming my SA for this same system including the dummy, produces this:

Parameter delta delta_conf S1 S1_conf
a 0.105354 0.019236 0.040665 0.020949
b 0.144955 0.023576 0.050471 0.014810
c 0.075516 0.009578 0.003889 0.006113
d 0.081177 0.011604 0.004186 0.007235
h 0.101583 0.010008 0.032759 0.021343
K 0.261329 0.022876 0.174340 0.038246
m 0.258345 0.024750 0.325690 0.052234
sigmaX 0.071862 0.008620 0.001681 0.006720
sigmaY 0.077337 0.009344 0.003131 0.006918
dummy 0.072546 0.008313 0.004176 0.009567

Even though the dummy does absolutely nothing in our model, it was still given a non-zero delta index by the analysis (0.07). One could use this as the cutoff value of non-importance and choose to fix parameters c, sigmaX, and sigmaY.

Accessing a Virtual Machine in Red Cloud

This blog post is an introduction to Red Cloud- a cloud computing service that is maintained by Cornell’s Center for Advanced Computing (CAC). Red Cloud is a private research cloud and can only be accessed by those with a subscription, but exploratory accounts are available for free for Cornell students, faculty and staff.

Subscriptions to cloud systems such as Red Cloud allow access to a variety of remote computing sources within seconds. Users can request instances, or virtual machines (VMs), of a variety of configurations ranging from CPUs to GPUs with varying amounts of RAM. In Red Cloud, users can access instances with up to 28 core and 240 GB of RAM. In this post, I’ll go through the very basic steps you need to access a VM through Red Cloud. These steps should generally apply to any cloud system that uses OpenStack as their cloud computing platform.

Step 1: Accessing OpenStack

OpenStack is a cloud operating system that will allow us to access the Red Cloud resources through a simple web interface. Log in with your CAC username and password (for the Reed Group: your credentials to access the Cube). This will lead you to an overview page that shows your usage of the resources.

Figure1

OpenStack Login

Click on the Images tab. This shows the virtual machines that are available for use. You can access machines that have Linux distributions such as Centos (a -cuda means that these images can support GPUs) or Ubuntu. VMs usually have very minimal software installed, so there are also various images with pre-loaded software like Matlab.

OpenStack Overview

OpenStack Overview Page

Figure3

Available Images

Step 2: Creating a Key Pair

A key pair needs to be set up before launching to allow secure access to your instance through SSH authentication. You can create a new key pair under the Key Pairs tab.

Figure4

Creating a Key Pair (Source: CAC)

Give it a meaningful name and copy the private key to a text file. Change the extension to a .pem file and store it somewhere convenient on your computer.

Step 3: Creating a Security Group

A security group allows you to control how you to specify what internet traffic can come from (ingress) or go to (egress) the instance. You can create your own, but for now, we will use Red Cloud’s default security group and add rules to that. Click on “Manage Rules”.

Figure5

Overview of Security Groups

You’ll see a variety of Ingress/Egress rules already in the group.

Figure6

Adding Rules to the Security Group

However, if you’re accessing a Linux-based VM, you will need to also allow access through an SSH command. Click on “Add Rule” and then choose “SSH” in the first drop-down menu and then “Add”. The SSH rule will now be listed as one of your rules. There are many options for rules including restricting access to only Cornell IP addresses etc.

Figure7

Adding an SSH Rule

 Step 3: Launch an Image

Now we have all the tools to launch an instance. Under the Compute and then Instances tab, you will have the option to launch an instance.

Figure8

Launching an Instance

Under the Details tab, give your instance a name.

Figure9

Naming Your Instance

Under the Source tab, choose your instance. I’ll go with the latest stable version of Ubuntu and then click the up arrow.

Figure10

Choosing an Image

Then, choose your flavor. It’s recommended to start with the lowest RAM (8GB), especially if you are just starting to explore cloud computing. You can always scale up when your image is launched if you need to.

Figure11

Choosing a Flavor

Under the Security Group tab, check to see that the default security group is selected. Then choose your key pair under the Key Pair tab. Great, now we can launch the instance by clicking the blue “Launch Instance” button.

The instance will go through a variety of tasks until it stabilizes at “Running”.

Figure12

Instance Status

Now we can SSH into our remote server using the IP address that is listed for the instance. I’ll use MobaXterm to start a local terminal and navigate into the directory where I saved my private key. Use the following command, inserting in the IP address of your instance and the name of your key.

Figure13

SSH-ing into the Ubuntu VM

Now we’ve entered into our Ubuntu machine and we can interact with it using the command line. Enjoy!

Figure14

Ubuntu Image!

Once you are done, make sure to either shelve (if you want your disk contents to be unchanged) or delete your instance. Even if the machine is idle, this prevents it from being used by other users, so you will still be billed.

Figure15

Shelving an Instance

In the next tutorial, I’ll describe Docker and how you can use VMs to run containerized code.

Acknowledgements: Much of the information shared in this tutorial comes from many conversations with staff at CAC, particularly Peter Vaillancourt and the Red Cloud wiki.