There are 315 blog posts on this Water Programming Blog. Chances are, if you have a question, it has already been answered in one of these posts. However, when I first joined the group, it was sometimes hard for me to know what I was even supposed to be searching for. Here are some blog posts that I found particularly useful when I started out or ones that I continue to regularly refer to.

What even is a cluster? I had no idea when I first arrived but this post brought me up to speed.

- Using MobaXterm as a terminal is incredibly intuitive, especially for someone like me who had rarely touched a terminal in undergrad. MobaXterm allows you to drag and drop files from your computer directly into your directory on the Cube. Furthermore, with the MobaXterm graphical SFTP browser you can navigate through your directories similarly to a Windows environment. I found that it was easier to use other terminal environments like Cygwin after I had gotten used to the terminal through MobaXterm. See Dave’s post here.
- Once you are oriented with how the terminal works, the best thing to do is practice navigating using Linux commands. Linux commands can also be very helpful for file manipulation and processing. When I first started training, I was much more comfortable opening text files, for example, in Excel, and making the necessary changes. However, very quickly, I was confronted with manipulating hundreds of text files or set files at a time, which forced me to learn Linux commands. After I learned how to properly used these commands, I wished I had started using them a long time ago. You will work much more efficiently if start practicing the Linux commands listed in Bernardo’s blog post.

Most of my second semester was spent reproducing Julie Quinn’s Lake Problem paper, which is when I first started to understand how to use Borg. It took me entirely too long to realize that the commands in Jazmin’s tutorials here and here are completely generalizable for any application requiring the MOEA framework or Borg. Since these tutorials are done so early in training, it is very easy to forget that they may be useful later and applied to problems other than DTLZ. I found myself referring back many times to these posts to remember the commands needed to generate a reference set from multiple seeds and how to execute Borg using the correct flags.

I had heard GitHub tossed around by CS majors in undergrad but it never occurred to me that I would be using it one day. Now, I have realized what a great tool it is for code version control. If used correctly, it makes sharing code with collaborators so much more clean and organized. However, before you can “clone” the contents of anyone’s repository to your own computer, you need an SSH key, which was not obvious to me as newbie to both Github and Bitbucket. You also need a different SSH key for every computer that you use. To generate an SSH key, refer to 2) of this post. Then you can add the generated keys in your profile settings on your Github and Bitbucket accounts.

Once you have keys, you can start cloning directories and pushing changes from your local version to the repository that you cloned from using Git commands outlined in this blog post.

A consolidation of notes that I wrote down from interactions with senior students in the group that have proven to be useful:

- If you can’t get your set files to merge, make sure there is a # sign at the end of each set file.
- If a file is too big to view, use the
*head*or*tail*command to see the first few lines or last lines of a file to get an idea of what the contents of the file look like. - Every time you submit a job, a file with the name of the job script and job number will appear in your directory. If your code crashes and you aren’t sure where to start, this file is a good place to see what might be going on. I was using Borg and couldn’t figure out why it was crashing after just 10 minutes of running because no errors were being returned. When I looked at this file, hundreds of outputs had been printed that I had forgotten to comment out. This had overloaded the system and caused it to crash.
- If you want to compile a file or series of files, use the command
*make*. If you have multiple make files in one folder, then you’ll need to use the command*make -f*. If you get odd errors when using the*make*command, try*make clean*first and then recompile. - Most useful Cube commands:
*qsub*to submit a job*qdel job number*if you want to delete a job on the cube*qsub -I*to start an interactive node. If you start an interactive node, you have one node all to yourself. If you want to run something that might take a while but not necessarily warrant submitting a job, then use an interactive node (don’t run anything large on the command line). However, be aware that you won’t be able to use your terminal until your job is done. If you exit out of your terminal, then you will be kicked out of your interactive node.

In retrospect, I see just how much I have learned in just one year of being in the research group. When you start, it can seem like a daunting task. However, it is important to realize that all of the other students in the group were in your position at one point. By making use of all the resources available to you and with time and a lot of practice, you’ll get the hang of it!

]]>For this example, we will assume the state each year is either wet or dry, and the distribution of annual streamflows under each state is modeled by a Gaussian distribution. More states can be considered, as well as other distributions, but we will use a two-state, Gaussian HMM here for simplicity. Since streamflow is strictly positive, it might make sense to first log-transform the annual flows at the state line so that the Gaussian models won’t generate negative streamflows, so that’s what we do here.

After installing hmmlearn, the first step is to load the Gaussian hidden Markov model class with `from hmmlearn.hmm import GaussianHMM`

. The `fit`

function of this class requires as inputs the number of states (*n_components*, here 2 for wet and dry), the number of iterations to run of the Baum-Welch algorithm described in Part I (*n_iter*; I chose 1000), and the time series to which the model is fit (here a column vector, Q, of the annual or log-transformed annual flows). You can also set initial parameter estimates before fitting the model and only state those which need to be initialized with the *init_params* argument. This is a string of characters where ‘s’ stands for startprob (the probability of being in each state at the start), ‘t’ for transmat (the probability transition matrix), ‘m’ for means (mean vector) and ‘c’ for covars (covariance matrix). As discussed in Part I it is good to test several different initial parameter estimates to prevent convergence to a local optimum. For simplicity, here I simply use default estimates, but this tutorial shows how to pass your own. I call the model I fit on line 5 `model`

.

Among other attributes and methods, `model`

will have associated with it the means (`means_`

) and covariances (`covars_`

) of the Gaussian distributions fit to each state, the state probability transition matrix (transmat_), the log-likelihood function of the model (`score`

) and methods for simulating from the HMM (`sample`

) and predicting the states of observed values with the Viterbi algorithm described in Part I (`predict`

). The `score`

attribute could be used to compare the performance of models fit with different initial parameter estimates.

It is important to note that which state (wet or dry) is assigned a 0 and which state is assigned a 1 is arbitrary and different assignments may be made with different runs of the algorithm. To avoid confusion, I choose to reorganize the vectors of means and variances and the transition probability matrix so that state 0 is always the dry state, and state 1 is always the wet state. This is done on lines 22-26 if the mean of state 0 is greater than the mean of state 1.

from hmmlearn.hmm import GaussianHMM def fitHMM(Q, nSamples): # fit Gaussian HMM to Q model = GaussianHMM(n_components=2, n_iter=1000).fit(np.reshape(Q,[len(Q),1])) # classify each observation as state 0 or 1 hidden_states = model.predict(np.reshape(Q,[len(Q),1])) # find parameters of Gaussian HMM mus = np.array(model.means_) sigmas = np.array(np.sqrt(np.array([np.diag(model.covars_[0]),np.diag(model.covars_[1])]))) P = np.array(model.transmat_) # find log-likelihood of Gaussian HMM logProb = model.score(np.reshape(Q,[len(Q),1])) # generate nSamples from Gaussian HMM samples = model.sample(nSamples) # re-organize mus, sigmas and P so that first row is lower mean (if not already) if mus[0] > mus[1]: mus = np.flipud(mus) sigmas = np.flipud(sigmas) P = np.fliplr(np.flipud(P)) hidden_states = 1 - hidden_states return hidden_states, mus, sigmas, P, logProb, samples # load annual flow data for the Colorado River near the Colorado/Utah state line AnnualQ = np.loadtxt('AnnualQ.txt') # log transform the data and fit the HMM logQ = np.log(AnnualQ) hidden_states, mus, sigmas, P, logProb, samples = fitHMM(logQ, 100)

Okay great, we’ve fit an HMM! What does the model look like? Let’s plot the time series of hidden states. Since we made the lower mean always represented by state 0, we know that hidden_states == 0 corresponds to the dry state and hidden_states == 1 to the wet state.

from matplotlib import pyplot as plt import seaborn as sns import numpy as np def plotTimeSeries(Q, hidden_states, ylabel, filename): sns.set() fig = plt.figure() ax = fig.add_subplot(111) xs = np.arange(len(Q))+1909 masks = hidden_states == 0 ax.scatter(xs[masks], Q[masks], c='r', label='Dry State') masks = hidden_states == 1 ax.scatter(xs[masks], Q[masks], c='b', label='Wet State') ax.plot(xs, Q, c='k') ax.set_xlabel('Year') ax.set_ylabel(ylabel) fig.subplots_adjust(bottom=0.2) handles, labels = plt.gca().get_legend_handles_labels() fig.legend(handles, labels, loc='lower center', ncol=2, frameon=True) fig.savefig(filename) fig.clf() return None plt.switch_backend('agg') # turn off display when running with Cygwin plotTimeSeries(logQ, hidden_states, 'log(Flow at State Line)', 'StateTseries_Log.png')

Wow, looks like there’s some persistence! What are the transition probabilities?

print(model.transmat_)

Running that we get the following:

[[ 0.6794469 0.3205531 ]

[ 0.34904974 0.65095026]]

When in a dry state, there is a 68% chance of transitioning to a dry state again in the next year, while in a wet state there is a 65% chance of transitioning to a wet state again in the next year.

What does the distribution of flows look like in the wet and dry states, and how do these compare with the overall distribution? Since the probability distribution of the wet and dry states are Gaussian in log-space, and each state has some probability of being observed, the overall probability distribution is a mixed, or weighted, Gaussian distribution in which the weight of each of the two Gaussian models is the unconditional probability of being in their respective state. These probabilities make up the stationary distribution, π, which is the vector solving the equation π = π**P**, where **P** is the probability transition matrix. As briefly mentioned in Part I, this can be found using the method described here: π = (1/ Σ* _{i}*[

from scipy import stats as ss def plotDistribution(Q, mus, sigmas, P, filename): # calculate stationary distribution eigenvals, eigenvecs = np.linalg.eig(np.transpose(P)) one_eigval = np.argmin(np.abs(eigenvals-1)) pi = eigenvecs[:,one_eigval] / np.sum(eigenvecs[:,one_eigval]) x_0 = np.linspace(mus[0]-4*sigmas[0], mus[0]+4*sigmas[0], 10000) fx_0 = pi[0]*ss.norm.pdf(x_0,mus[0],sigmas[0]) x_1 = np.linspace(mus[1]-4*sigmas[1], mus[1]+4*sigmas[1], 10000) fx_1 = pi[1]*ss.norm.pdf(x_1,mus[1],sigmas[1]) x = np.linspace(mus[0]-4*sigmas[0], mus[1]+4*sigmas[1], 10000) fx = pi[0]*ss.norm.pdf(x,mus[0],sigmas[0]) + \ pi[1]*ss.norm.pdf(x,mus[1],sigmas[1]) sns.set() fig = plt.figure() ax = fig.add_subplot(111) ax.hist(Q, color='k', alpha=0.5, density=True) l1, = ax.plot(x_0, fx_0, c='r', linewidth=2, label='Dry State Distn') l2, = ax.plot(x_1, fx_1, c='b', linewidth=2, label='Wet State Distn') l3, = ax.plot(x, fx, c='k', linewidth=2, label='Combined State Distn') fig.subplots_adjust(bottom=0.15) handles, labels = plt.gca().get_legend_handles_labels() fig.legend(handles, labels, loc='lower center', ncol=3, frameon=True) fig.savefig(filename) fig.clf() return None plotDistribution(logQ, mus, sigmas, P, 'MixedGaussianFit_Log.png')

Looks like a pretty good fit – seems like a Gaussian HMM is a decent model of log-transformed annual flows in the Colorado River at the Colorado/Utah state line. Hopefully you can find relevant applications for your work too. If so, I’d recommend reading through this hmmlearn tutorial, from which I learned how to do everything I’ve shown here.

]]>For example, imagine you are a doctor trying to diagnose when an individual has the flu. On any given day, this person is in one of two states: sick or healthy. These states are likely to exhibit great persistence; when the person gets the flu, he/she will likely have it for several days or weeks, and when he/she is heathy, he/she will likely stay healthy for months. However, suppose you don’t have the ability to test the individual for the flu virus and can only observe his/her temperature. Different (overlapping) distributions of body temperatures may be observed depending on whether this person is sick or healthy, but the state itself is not observed. In this case, the person’s temperature can be modeled by an HMM.

So why are HMMs useful for describing hydro-climatological variables? Let’s go back to the example of ENSO. Maybe El Niño years in a particular basin tend to be wetter than La Niña years. Normally we can observe whether or not it is an El Niño year based on SST anomalies in the tropical Pacific, but suppose we only have paleodata of tree ring widths. We can infer from the tree ring data (with some error) what the total precipitation might have been in each year of the tree’s life, but we may not know what the SST anomalies were those years. Or even if we do know the SST anomalies, maybe there is another more predictive regime-shifting teleconnection we haven’t yet discovered. In either case, we can model the total annual precipitation with an HMM.

What is the benefit of modeling precipitation in these cases with an HMM as opposed to say, an autoregressive model? Well often the year to year correlation of annual precipitation may not actually be that high, but several consecutive wet or consecutive dry years are observed [Bracken et al., 2014]. Furthermore, paleodata suggests that greater persistence (e.g. megadroughts) in precipitation is often observed than would be predicted by autoregressive models [Ault et al., 2013; Ault et al., 2014]. This is where HMMs may come in handy.

Here I will explain how to fit HMMs generally, and in Part II I will show how to apply these methods using the Python package hmmlearn. To understand how to fit HMMs, we first need to define some notation. Let *Y _{t}* be the observed variable at time

In fitting a two-state Gaussian HMM, we therefore need to estimate the following vector of parameters: *θ* = [*μ _{0}*,

1)

Since *X _{t} *does not depend on future observations, P(

Why do we care about the probability of ending up in state *i* at time *t* given all of the observations (the left hand side of this equation)? In fitting a HMM, our goal is to find a set of parameters, *θ*, that maximize this probability, i.e. the likelihood function of the state trajectories given our observations. Since the denominator on the right hand side is just a normalizing constant, our goal is therefore to maximize the numerator, or the probability estimates from the forward-backward algorithm. We can maximize this product using Expectation-Maximization.

Expectation-Maximization is a two-step process for maximum likelihood estimation when the likelihood function cannot be computed directly, for example, because its observations are hidden as in an HMM. The first step is to calculate the expected value of the log likelihood function with respect to the conditional distribution of *X* given *Y* and *θ* (the left hand side of equation 1, or proportionally, the numerator of the right hand side). The second step is to find the parameters that maximize this function. These parameter estimates are then used to re-implement the forward-backward algorithm and the process repeats iteratively until convergence or some specified number of iterations. It is important to note that the maximization step is a local optimization around the current best estimate of *θ*. Hence, the Baum-Welch algorithm should be run multiple times with different initial parameter estimates to increase the chances of finding the global optimum.

Another interesting question beyond fitting HMMs to observations is diagnosing which states the observations were likely to have come from given the estimated parameters. This is often performed using the Viterbi algorithm, which employs dynamic programming (DP) to find the most likely state trajectory. In this case, the “decision variables” of the DP problem are the states at each time step, *X _{t}*, and the “future value function” being optimized is the probability of observing the true trajectory, (

Now that you know how HMMs are fit using the Baum-Welch algorithm and decoded using the Viterbi algorithm, read Part II to see how to perform these steps in practice in Python!

]]>But what happens when you are working between Python 2.7 and Python 3.x due to collaboration, using Python 3.4 because the last time you updated your script was four years ago, collaborating with others and want to ensure reproducibility and compatible environments, or banging your head against the wall because that one Python library installation is throwing up errors (shakes fist at PIL/Pillow)?

Creating Python environments is a straightforward solution to save you headaches down the road.

Python environments are a topic that many of us have feared through the years due to ambiguous definitions filled with waving hands. An environment is simply the domain in which users run software or scripts. With this same train of thought, a python environment is the domain with all of the Python packages are installed where a user (you!) is executing a script (usually interfacing through an IDE or Terminal/Command Prompt).

However, different scripts will work or fail in different environments avoid having to use all of these packages at once or having to completely reinstall Python, what we want to do is create new and **independent** Python environments. Applications of these environments include:

- Have multiple versions of Python (e.g. 2.7 and 3.4 and 3.6) installed on your machine at once that you can easily switch between
- Work with specific versions of packages and ensure they don’t update for the specific script you’re developing
- Allow for individuals to install the same, reproducible environment between workstations
- Create standardized environments for seamless collaboration
- Use older versions of packages to utilize outdated code

One problem that recent arose in Ithaca was that someone was crunching towards deadlines and could only run PIL (Python Imaging Library) on their home machine and not their desktop on campus due to package installation issues. This individual had the following packages they needed to install while using Python 2.7.5:

- PIL
- matplotlib
- numpy
- pandas
- statsmodels
- seaborn

To start, let’s first create an environment! To do this, we will be using Conda (install Anaconda for new users or MiniConda for anyone who doesn’t want their default Python environment to be jeopardized. If you want to avoid using Conda, feel free to explore Pipenv). As a quick note on syntax, I will be running everything in Windows 7 and every command I am using can be found on the Conda Cheatsheet. Only slight variations are required for MacOS/Linux.

First, with your Command Prompt open, type the following command to create the environment we will be working in:

conda create --name blog_pil_example python=2.7.5

At this point, a new environment titled blog_pil_example with Python 2.7.5 has been created. Congrats! Don’t forget to take screenshots to add to your new environment’s baby book (or just use the one above if it’s not your first environment).

From here, we need to activate the environment before interacting with it. To see which environments are available, use the following:

conda env list

Now, let’s go ahead and activate the environment that we want (blog_pil_example):

activate blog_pil_example

To leave the environment you’re in, simply use the following command:

deactivate

(For Linux and MaxOS, put ‘source ‘ prior to these commands)

We can see in the screenshot above that multiple other environments exist, but the selected/activated environment is shown in parentheses. Note that you’re still navigating through the same directories as before, you’re just selecting and running a different version of Python and installed packages when you’re using this environment.

Now onto the real meat and potatoes: installing the necessary packages. While you can use pip at this point, I’ve found Conda has run into fewer issues over the past year. (Read into channel prioritization if you’re interested in where package files are being sourced from and how to change this.) As a quick back to basics, we’re going to install one of the desired packages, matplotlib, using Conda (or pip). Using these ensures that the proper versions of the packages for your environment (i.e. the Python version and operating system) are retrieved. At the same time, all dependent packages will also be installed (e.g. numpy). Use the following command when in the environment and confirm you want to install matplotlib:

conda install matplotlib

Note that you can specify a version much like how we specified the python version above for library compatibility issues:

conda install matplotlib=2.2.0

If you wish to remove matplotlib, use the following command:

conda remove matplotlib

If you wish to update a specific package, run:

conda update matplotlib

Or to update all packages:

conda update

Additionally, you can prevent specific packages from updating by creating a pinned file in the environment’s conda-meta directory. *Be sure to do this prior to running the command to update all packages! *

After installing all of the packages that were required at the start of this tutorial, let’s look into which packages are actually installed in this environment:

conda list

By only installing the required packages, Conda was kind and installed all of the dependencies at the same time. Now you have a Python environment that you’ve created from scratch and developed into a hopefully productive part of your workflow.

The simplest way to utilize your newly created python environment is simply run python directly in the Command Prompt above. You can run any script when this environment is activated (shown in the parentheses on the left of the command line) to utilize this setup!

If you want to use this environment in your IDE of choice, you can simply point the interpreter to this new environment. In PyCharm, you can easily create a new Conda Environment when creating a new project, or you can point the interpreter to a previously created environment (instructions here).

For a good ground-up and more in depth tutorial with visualizations on how Conda works (including directory structure, channel prioritization) that has been a major source of inspiration and knowledge for me, please check out this blog post by Gergely Szerovay.

If you’re looking for a great (and nearly exhaustive) source of Python Packages (both current and previous versions), check out Gohlke’s webpage. To install these packages, download the associated file for your system (32/64 bit and then your operating system) then use pip to install the file (in Command Prompt, navigate to the folder the .whl file is located in, then type ‘pip install ,file_name>’). I’ve found that installing packages this way sometimes allows me to step around errors I’ve encountered while using

You can also create environments for R. Check it out here.

If you understand most of the materials above, you can now claim to be environmentally conscious!

]]>Modern High Performance Computing (HPC) resources are usually composed of a cluster of computing nodes that provide the user the ability to parallelize tasks and greatly reduce the time it takes to perform complex operations. A **node** is usually defined as a discrete unit of a computer system that runs its own instance of an operating system. Modern nodes have multiple chips, often known as Central Processing Units or CPUs, which each contain multiple **cores** each capable of processing a separate stream of instructions (such as a single Monte Carlo run). An example cluster configuration is shown in Figure 1.

To efficiently make use of a cluster’s computational resources, it is essential to allow multiple users to use the resource at one time and to have an efficient and equatable way of allocating and scheduling computing resources on a cluster. This role is done by job scheduling software. The scheduling software is accessed via a shell script called in the command line. A scheduling script does not actually run any code, rather it provides a set of instructions for the cluster specifying what code to run and how the cluster should run it. Instructions called from a scheduling script may include but are not limited to:

- What code would you like the cluster to run
- How would you like to parallelize your code (ie MPI, openMP ect)
- How many nodes would you like to run on
- How many core per processor would you like to run (normally you would use the maximum allowable per processor)
- Where would you like error and output files to be saved
- Set up email notifications about the status of your job

This post will highlight two commonly used Job Scheduling Languages, PBS and SLURM and detail some simple example scripts for using them.

The Portable Batching System (PBS) was originally developed by NASA in the early 1990’s [1] to facilitate access to computing resources. The intellectual property associated with the software is now owned by Altair Engineering. PBS is a fully open source system and the source code can be found here. PBS is the job scheduler we use for the Cube Cluster here at Cornell.

An annotated PBS submission script called “PBSexample.sh” that runs a C++ code called “triangleSimulation.cpp” on 128 cores can be found below:

#PBS -l nodes=8:ppn=16 # how many nodes, how many cores per node (ppn) #PBS -l walltime=5:00:00 # what is the maximum walltime for this job #PBS -N SimpleScript # Give the job this name. #PBS -M email.cornell.edu # email address for notifications #PBS -j oe # combine error and output file #PBS -o outputfolder/output.out # name output file cd $PBS_O_WORKDIR # change working directory to current folder #module load openmpi/intel # load MPI (Intel implementation) time mpirun ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3

To submit this PBS script via the command line one would type:

qsub PBSexample.sh

Other helpful PBS commands for UNIX can be found here. For more on PBS flags and options, see this detailed post from 2012 and for more example PBS submission scripts see Jon Herman’s Github repository here.

A second common job scheduler is know as SLURM. SLURM stands for “Simple Linux Utility Resource Management” and is the scheduler used on many XSEDE resources such as Stampede2 and Comet.

An example SLURM submission script named “SLURMexample.sh” that runs “triangleSimulation.cpp” on 128 core can be found below:

#!/bin/bash #SBATCH --nodes=8 # specify number of nodes #SBATCH --ntasks-per-node=16 # specify number of core per node #SBATCH --export=ALL #SBATCH -t 5:00:00 # set max wallclock time #SBATCH --job-name="triangle" # name your job #SBATCH --output="outputfolder/output.out" #ibrun is the command for MPI ibrun -v ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3 -p 2841

To submit this SLURM script from the command line one would type:

sbatch SLURM

The Cornell Center for Advanced Computing has an excellent SLURM training module within the introduction to Stampede2 workshop that goes into detail on how to most effectively make use of SLURM. More examples of SLURM submission scripts can be found on Jon Herman’s Github. Billy also wrote a blog post last year about debugging with SLURM.

We have had some success using this for HEC programs and EPA programs, but sometimes there are issues that preclude this solution working for all programs. But it is nice when it does work!

]]>But what makes a good sample, and how can we understand the strengths and weaknesses of the sampling techniques (and also of the associated sensitivity techniques we are using) through quick visualization of some associated metrics?

This post aims to answer this question. It will first look at what makes a good sample using some examples from a sampling technique called latin hypercube sampling. Then it will show some handy visualization tools for quickly testing and visualizing a sample.

Intuitively, the first criterion for a good sample is how well it covers the space from which to sample. The difficulty though, is how we define “how well” it practice, and the implications that has.

Let us take an example. A quick and popular way to generate a sample that covers the space fairly well is latin hypercube sampling (LHS; McKay et al., 1979). This algorithm relies on the following steps for drawing *N* samples from a hypercube-shaped of dimension* p*.:

1) Divide each dimension of the space in N equiprobabilistic bins. If we want uniform sampling, each bin will have the same length. Number bins from 1 to *N* each dimension.

2) Randomly draw points such that you have exactly one in each bin in each dimension.

For instance, for 6 points in 2 dimensions, this is a possible sample (points are selected randomly in each square labelled A to F):

It is easy to see that by definition, LHS has a good space coverage when projected on each individual axis. But space coverage in multiple dimensions all depends on the luck of the draw. Indeed, this is also a perfectly valid LHS configuration:

In the above configuration, it is easy to see that on top of poor space coverage, correlation between the sampled values along both axes is also a huge issue. For instance, if output values are hugely dependent on values of input 1, there will be large variations of the output values as values of input 2 change, regardless of the real impact of input 2 on the output.

Therefore, there are two kinds of issues to look at. One is correlation between sampled values of the input variables. We’ll look at it first because it is pretty straightforward. Then we’ll look at space coverage metrics, which are more numerous, do not look exactly at the same things, and can be sometimes conflicting. In fact, it is illuminating to see that sample quality metrics sometimes trade-off with one another, and several authors have turned to multi-objective optimization to come up with Pareto-optimal sample designs (e.g., Cioppa and Lucas, 2007; De Rainville et al., 2012).

One can look at authors such as Sheikholeslami and Razavi (2017) who summarize similar sets of variables. The goal there is not to write a summary of summaries but rather to give a sense that there is a relationship between which indicators of sampling quality matter, which sampling strategy to use, and what we want to do.

In what follows we note the *k*^{th }sampled value of input variable *i*, with and .

Sample correlation is usually measured through the Pearson statistic. For inputs variables *i* and *j* among the *p* input variables, we note and the values of these variables *i* and *j* in sample *k* have:

In the above equation, and are the average sampled values of inputs *i* and *j . *

Then, the indicator of sample quality looks at the maximal level of correlation across all variables:

This definition relies on the remark that .

There are different measures of space coverage.

We are best equipped to visualize space coverage via 1D or 2D projections of a sample. In 1D a measure of space coverage is by dividing each dimension in *N* equiprobable bins, and count the fraction of bins that have at least a point. Since *N* is the sample size, this measure is maximized when there is exactly one point in each bin — it is a measure that LHS maximizes.

Other measures of space coverage consider all dimension at once. A straightforward measure of space filling is the minimum Euclidean distance between two sampled points * X* in the generated ensemble:

Other indicators measure discrepancy which is a concept closely related to space coverage. In simple terms, a low discrepancy means that when we look at a subset of a sampled input space, its volume is roughly proportional to the number of points that are in it. In other words, there is no large subset with relatively few sampled points, and there is no small subset with a relatively large density of sampled points. A low discrepancy is desirable and in fact, Sobol’ sequences that form the basis of the Sobol’ sensitivity analysis method, are meant to minimize discrepancy.

The figures that follow can be easily reproduced by cloning a little repository SampleVis I put together, and by entering on the command line `python readme.py &> output.txt`

. That Python routine can be used with both latin hypercube and Sobol’ sampling (using the SAlib sampling tool; SAlib is a Python library developed primarily by Jon Herman and Will Usher, and which is extensively discussed in this blog.)

In what follows I give examples using a random draw of latin hypercube sampling with 100 members and 7 sampled variables.

No luck, there is statistically significant pairwise correlation between in three pairs of variables: x1 and x4, x4 and x6, and x5 and x6. Using LHS, it can take some time to be lucky enough until the drawn sample is correlation-free (alternatively, methods to minimize correlations have been extensively researched over the years, though no “silver bullet” really emerges).

This means any inference that works for both variables in any of these pairs may be suspect. The SampleVis toolbox contains also tools to plot whether these correlations are positive or negative.

The toolbox enables to plot several indicators of space coverage, assuming that the sampled space is the unit hypercube of dimension* p* (*p=7* in this example). It computes discrepancy and minimal distance indicators. Ironically, my random LHS with 7 variables and 100 members has a better discrepancy (here I use an indicator called L2-star discrepancy) than a Sobol’ sequence with as many variables and members. The minimal Euclidean distance as well is better than for Sobol’ (0.330 vs. 0.348). This means that if for our experiment, space coverage is more important than correlation, the drawn LHS is pretty good.

To better grasp how well points cover the whole space, it is interesting to plot the distance of the point that is closest to each point, and to represent that in growing order:

This means that some points are not evenly spaced, and some are more isolated than others. When dealing with a limited number of variables, it can also be interesting to visualize 2D projections of the sample, like this one:

This again goes to show that the sample is pretty-well distributed in space. We can compare with the same diagram for a Sobol’ sampling with 100 members and 7 variables:

It is pretty clear that the deterministic nature of Sobol’ sampling, for so few points, leaves more systematic holes in the sampled space. Of course, this sample is too small for any serious Sobol’ sensitivity analysis, and holes are plugged by a larger sample. But again, this comparison is a visual heuristic that tells a similar story as the global coverage indicator: this LHS draw is pretty good when it comes to coverage.

Campolongo, F., Cariboni, J. & Saltelli, A. (2007). An effective screening design for sensitivity analysis of large models. *Environmental Modelling & Software*, 22, 1509 – 1518.

Cioppa, T. M. & Lucas, T. W. (2007). Efficient Nearly Orthogonal and Space-Filling Latin Hypercubes. *Technometrics*, 49, 45-55.

De Rainville, F.-M., Gagné, C., Teytaud, O. & Laurendeau, D. (2012). Evolutionary Optimization of Low-discrepancy Sequences. *ACM Trans. Model. Comput. Simul., ACM*, 22, 9:1-9:25.

Herman, J. D., Kollat, J. B., Reed, P. M. & Wagener, T. (2013). Technical Note: Method of Morris effectively reduces the computational demands of global sensitivity analysis for distributed watershed models. *Hydrology and Earth System Sciences*, 17, 2893-2903.

Joe, S. & Kuo, F. (2008). Constructing Sobol Sequences with Better Two-Dimensional Projections. *SIAM Journal on Scientific Computing*, 30, 2635-2654.

McKay, M.D., Beckman R.J. & Conover, W.J. (1979).A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2), 239-245.

Morris, M. D. (1991). Factorial Sampling Plans for Preliminary Computational Experiments. *Technometrics*, 33, 161-174.

Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M. & Tarantola, S. (2010). Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. *Computer Physics Communications*, 181, 259 – 270.

Sheikholeslami, R. & Razavi, S. (2017). Progressive Latin Hypercube Sampling: An efficient approach for robust sampling-based analysis of environmental models. *Environmental Modelling & Software*, 93, 109 – 126.

Sobol’, I. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. *Mathematics and Computers in Simulation*, 55, 271 – 280.

]]>

As hinted at above, I originally created the plot to show bivariate data, with one variable plotted as the location on the dial and the other as the color. You could also plot the same variable as both color and location if you wanted to emphasize the meaning of increasing value on the dial. An example dial created with the code is shown below.

The color distribution, location of arrow and labeling of the gauge and colorbar are all fully customizable. I created the figure by first making a pie chart using marplotlib, inscribing a small white circle in the middle and then cropping the image in half using the Python image processing library (PIL also known as Pillow). The arrow is created using the matplotlib “arrow” function and will point to a specified location on the dial. The code is created such that you can add an array of any length to specify your colors, the array does not have to be monotonic like the one shown above, but will accept any values between zero and one (if your data is not in this range I’d suggest normalizing).

Annotated code is below:

import matplotlib.pyplot as plt from matplotlib import cm, gridspec import numpy as np import math from PIL import Image from mpl_toolkits.axes_grid1 import make_axes_locatable # set your color array and name of figure here: dial_colors = np.linspace(0,1,1000) # using linspace here as an example figname = 'myDial' # specify which index you want your arrow to point to arrow_index = 750 # create labels at desired locations # note that the pie plot ploots from right to left labels = [' ']*len(dial_colors)*2 labels[25] = '100' labels[250] = '75' labels[500] = '50' labels[750] = '25' labels[975] = '0' # function plotting a colored dial def dial(color_array, arrow_index, labels, ax): # Create bins to plot (equally sized) size_of_groups=np.ones(len(color_array)*2) # Create a pieplot, half white, half colored by your color array white_half = np.ones(len(color_array))*.5 color_half = color_array color_pallet = np.concatenate([color_half, white_half]) cs=cm.RdYlBu(color_pallet) pie_wedge_collection = ax.pie(size_of_groups, colors=cs, labels=labels) i=0 for pie_wedge in pie_wedge_collection[0]: pie_wedge.set_edgecolor(cm.RdYlBu(color_pallet[i])) i=i+1 # create a white circle to make the pie chart a dial my_circle=plt.Circle( (0,0), 0.3, color='white') ax.add_artist(my_circle) # create the arrow, pointing at specified index arrow_angle = (arrow_index/float(len(color_array)))*3.14159 arrow_x = 0.2*math.cos(arrow_angle) arrow_y = 0.2*math.sin(arrow_angle) ax.arrow(0,0,-arrow_x,arrow_y, width=.02, head_width=.05, \ head_length=.1, fc='k', ec='k') # create figure and specify figure name fig, ax = plt.subplots() # make dial plot and save figure dial(dial_colors, arrow_index, labels, ax) ax.set_aspect('equal') plt.savefig(figname + '.png', bbox_inches='tight') # create a figure for the colorbar (crop so only colorbar is saved) fig, ax2 = plt.subplots() cmap = cm.ScalarMappable(cmap='RdYlBu') cmap.set_array([min(dial_colors), max(dial_colors)]) cbar = plt.colorbar(cmap, orientation='horizontal') cbar.ax.set_xlabel("Risk") plt.savefig('cbar.png', bbox_inches='tight') cbar = Image.open('cbar.png') c_width, c_height = cbar.size cbar = cbar.crop((0, .8*c_height, c_width, c_height)).save('cbar.png') # open figure and crop bottom half im = Image.open(figname + '.png') width, height = im.size # crop bottom half of figure # function takes top corner <span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" >&#65279;</span>and bottom corner coordinates # of image to keep, (0,0) in python images is the top left corner im = im.crop((0, 0, width+c_width, int(height/2.0))).save(figname + '.png')

This code was my way of making a dial plot, and I think it works well for plotting gradients on the dial. In the course of writing this I came across a couple similar codes, I’m listing them below. They both have advantages if you want to plot a small number of colors on your dial but I had trouble getting them to scale.

Here’s an example that creates dials using matplotlib patches, this method looks useful for plotting a small number of categorical data, I like the customization of the labels: http://nicolasfauchereau.github.io/climatecode/posts/drawing-a-gauge-with-matplotlib/

Here’s another alternative using the plotly library, I like the aesthetics but if you’re unfamiliar with plotly there’s a lot to learn before you can nicely customize the final product: https://plot.ly/python/gauge-charts/

]]>First and foremost, your data must be in an appropriate from for hierarchical clustering to be conducted. Table 1 shows an example of how your data can be set up. Four different spatial temperatures projected by CMIP5 models are shown along with various attributes that could be potential driving forces behind clustering: the institution at which the model comes from, the RCP (radiative forcing scenario) used in the model, and the initial conditions with which the model was run.

At this point, it is helpful to add the model names as the row names (shown in the leftmost column) of your data frame, otherwise the dendrogram function will use the row number as a label on the dendrogram which can make it hard to interpret the clustering results.

Next, create a distance matrix, which will be composed of Euclidean distances between pairs of model projections. This is what clustering will be based on. We first create a new data frame composed of just the temperature values (shown below) by removing columns from the Model Attributes table.

The following code can be used to create Table 2 from the original table and then the distance matrix.

#Create a new data frame with just temperature values just_temperature=Model_Attributes[ -c(1:4) ] #Create a distance matrix d=dist(just_temperature)

Now, one can make the clustering diagram. Here I chose to use complete linkage clustering as the agglomeration method and wanted my dendrogram to be horizontal.

#Perform clustering complete_linkage_cluster=as.dendrogram(hclust(d,method="complete")) #Adjust dimensions of dendrogram so that it fits in plotting window par(mar=c(3,4,1,15)) plot(complete_linkage_cluster,horiz =TRUE)

And that’s it! Here is the most basic dendrogram.

Now for customization. You will first need to install the “dendextend” library in R.

We have 11 institutions that the models can come from and we want to visualize if institution has some impact on clustering, by assigning a color to the label. Here we use the rainbow color palette to assign each model a color and then replot the dendrogram.

library(dendextend) #Create a vector of colors with one color for each institution col=rainbow(max(Model_Attributes$Institution)) #Add colors to the ordered dendrogram labels_colors(complete_linkage_cluster)= col[Model_Attributes$Institution][order.dendrogram(complete_linkage_cluster)] #Replot the dendrogram par(mar=c(3,4,1,15)) #Dendrogram parameters plot(complete_linkage_cluster,horiz =TRUE)

Now suppose we wanted to change the branch colors to show what RCP each model was run with. Here, we assign a color from the rainbow palette to each of the four RCPs and add it to the dendrogram.

col=rainbow(max(Model_Attributes$RCP)) col_branches= col[Model_Attributes$RCP][order.dendrogram(complete_linkage_cluster)] colored_dendrogram=color_branches(complete_linkage_cluster,col=col_branches) par(mar=c(3,4,1,15)) plot(colored_dendrogram,horiz =TRUE)

Now finally, we can change the node shapes to reflect the initial condition. There are 10 total initial conditions, so we’re going to use the first 10 standard pch (plot character) elements to represent the individual nodes.

pch=c(1:max(Model_Attributes$Initial_Conditions)) nodes=pch[Model_Attributes$Initial_Conditions[order.dendrogram(complete_linkage_cluster)] nodePar = list(lab.cex = 0.6, pch = c(NA,19),cex = 0.7, col = "black") #node parameters dend1 = colored_dendrogram %>% set("leaves_pch", c(nodes)) par(mar=c(3,4,1,15)) plot(dend1,horiz =TRUE)

And that’s how you customize a dendrogram in R!

]]>- Plotting multiple datasets,
- Displaying dataset names,
- Choosing columns to be plot,
- Coloring each dataset based on a column and a different Matplotlib color map,
- Specifying ranges to be plotted,
- Inverting multiple axis,
- Brushing by intervales in multiple axis,
- Choosing different fonts for title and rest of the plot, and
- Export result as a figure file or viewing plot in Matplotlib’s interactive window.

The source code can be found **here**, and below is an example of how to use it:

import numpy as np from plotting.parallel_axis import paxis_plot from matplotlib.colors import LinearSegmentedColormap from matplotlib import cm bu_cy = LinearSegmentedColormap.from_list('BuCy', [(0, 0, 1), (0, 1, 1)]) bu_cy_r = bu_cy.reversed() data1 = np.random.normal(size=(100, 8)) data2 = np.random.normal(size=(100, 8)) columns_to_plot = [0, 1, 3, 5, 7] color_column = 0 axis_labels = ['axes ' + str(i) for i in range(8)] dataset_names = ['Data set 1', 'Data set 2'] plot_ranges = [[-3.5, 3.5]] * 3 + [[-2.9, 3.1]] + [[-3.5, 3.5]] * 4 axis_to_invert = [1, 5] brush_criteria = {1: [-10., 0.], 7: [10., 0.]} paxis_plot((data1, data2), columns_to_plot, color_column, [bu_cy_r, cm.get_cmap('autumn_r')], axis_labels, 'Title Here', dataset_names, axis_ranges=plot_ranges, fontname_title='Gill Sans MT', fontname_body='CMU Bright', file_name='test.png', axis_to_invert=axis_to_invert, brush_criteria=brush_criteria)

The output of this script should be a file named “test.png” that looks similar to the plot below:

]]>