A template for reproducible papers

Writing fully reproducible papers is something everyone talks about but very few people actually do. Following nice examples I’ve seen developed by others (see here and here), I wanted to develop a GitHub template that I could easily use to organize the analysis I perform for each paper. I wanted it to be useful for the Reed group in general, but also anyone else who’d like to use it, so the version I’m presenting today is an initial version that will be adapted and evolve as our needs grow.

The template can be found here: https://github.com/antonia-had/paper_template and this blogpost will discuss its contents. The repository is set up as a template, so you can use “Import repository” when you create a new repository for your project or click on the green “Use this template” button on the top right.

The idea is that everything is organized and documented well so that another person can easily replicate your work. This will help with your own tools being more widely used and cited, but also future group members to easily pick up from where you left. The other selfish way in which this has helped me is that it forces me to spend some time and arrange things from the beginning so I can be more organized (and therefore more productive) during the project. Most importantly, when a paper does get accepted you don’t need to go back and organize everything so it looks halfway decent for a public repository. For these reasons I try to use a template like this from the early stages of a project.

A lot of the template is self explanatory, but I’ll go through to explain what is in it in short. The idea is you take it and just replace the text with your own in the README files and use it as a guide to organize your paper analysis and results.

There are directories to organize your content to code, data, and results (or anything else that works for you). Every directory has its own README listing its contents and how they should be used. All code that you didn’t write and data that you didn’t generate need to be cited. Again, this is useful to document from the beginning so you don’t need to search for it later.

Most of my work is done in Python, so I wrote up how to handle Python dependencies. The way I suggest going about it is through a ‘.yml‘ file that specifies all the dependencies (i.e. all the packages and versions your script uses) for your project. I believe the best way to handle this is by creating a Python environment for every project you work on so you can create a separate list of dependencies for each. We have a nice blogpost on how to create and manage Python environments here.

When the project is done and you’re ready to submit or publish your paper, export all dependencies by running:

conda env export > environment.yml --no-builds

and store your environment.yml in the GitHub repository. When someone else needs to replicate your results, they would just need to create the same Python environment (running conda env create --file environment.yml) before executing your scripts.

Finally, you could automate the analysis and figure production with a makefile that executes your scripts so who ever is replicating does not need to manually execute all of them. This also helps avoiding somebody executing the scripts in the wrong order. An example of this can be found in this template. The makefile can also be in Python, like Julie did here.

In recognition that this is not an exhaustive template of everything one might need, the aim is to have this blog post and the template itself evolve as the group identifies needs and material to be added.

Information Theory and Moment-Independent Sensitivity Indices

In this post, I will go over the concept of information theory and its relevance to sensitivity analysis. Then, I will talk about two other methods that are closely related to the concept of information theory. These methods are also often categorized as moment-independent sensitivity analysis methods.

What are the advantages of moment-independent sensitivity analysis indices?

Many methods exist for sensitivity analysis: for example, variance-based methods, such as the one proposed by Sobol (1990), that clarify how much of the variance of outputs can be explained by each input variable (Saltelli et al., 2008). They have been broadly implemented, and numerous studies show their capabilities. However, as Auder and Iooss (2008) and Zadeh et al., (2017) explain, as these approaches are not moment indifferent, variance-based methods might be inefficient in heavy-tailed distributions or when the outputs tend to have outliers. They also become less reliable when they encounter multi-modal distributions. To respond to these issues, moment-independent methods have been developed that are based on PDFs and CDFs. In this blog post, I go over some of them.

But before I introduce these methods, I would like to mention that there are studies that show that these moment-independent methods can be outperformed by other sensitivity analysis methods (e.g., variance-based Sobol). For example, Puy et al., (2020) shows that a moment-independent method (i.e., PAWN, which is explained below) does not necessarily produce better answers. Also, Auder and Iooss (2008) discuss how, at least in some cases, moment-independent methods can significantly increase computational costs.

Entropy-Based Methods

Shannon, the founder of information theory, first introduced the concept of entropy in his famous 1948 paper, A Mathematical Theory of Communication. To put it simply, entropy is the opposite of information. Higher entropy indicates higher uncertainty. In the last several decades, the concept of entropy has found its way into many scientific areas, including economics, statistics, engineering, environmental science, and machine learning.

A famous example that can help to explain the concept of entropy is coin tossing. Let’s say that you have two coins. One of them is a fair coin, meaning that the probabilities of getting heads or tails are equal (P=0.5). The other coin is not fair, as the probability of getting heads (P=0.8) is higher than that of getting tails (P=0.2). Now, you are going to flip one of these coins; which one has a higher uncertainty? The answer is that the fair coin has a higher entropy. Entropy has also been described as a measure of surprise. In our coin example, a coin with higher entropy has a higher chance to surprise you. The following explains some of the basic concepts related to information theory.

Entropy

The following formula can be used to calculate entropy:

Conditional Entropy

Another important concept in information theory is conditional entropy. Conditional entropy is the amount of information needed to estimate a random variable Y when a random variable X is known. Intuitively, low-conditional entropy implies higher dependence of random variable Y on random variable X. The following formula can be used to calculate conditional entropy:

Mutual Information

Mutual information is another concept that is closely related to entropy and explains how much of the information in Y can be estimated when X is known. The following shows how mutual information is related to entropy and conditional entropy:`

Entropy-Based Sensitivity Analysis

The concept of entropy has been used in sensitivity analysis. There are two popular entropy-based indices that have been used for sensitivity analysis: i) Krzykacz-Hausmann (2001) and ii) Liu et al., (2006). Both of them are based on analysis of PDFs of inputs and outputs. In this blog post, I will only explain the first one, Liu et al., adapted Kullback-Leibler (K-L) relative entropy concept which is conceptually similar and has a more complex mathematical formulation that can be solved using Monte Carlo sampling. More information on K-L-sensitivity methods can be found in Liu et al., (2006) and Auder and Iooss (2008).

Krzykacz-Hausmann Sensitivity Index

These authors use the direct definitions of Shannon’s information theory and use the following formula to calculate it:

 Higher values of η indicate higher sensitivity of RV Y to RV X.

Moment-Independent Methods

There are some other moment-independent methods that have been widely used in recent years, and I am including two of them here: i) PAWN and ii) δ-moment independent.

PAWN

PAWN is another moment-independent sensitivity analysis metric, developed by Pianosi and Wagener (2015). The main difference between PAWN and other moment-independent approaches is that PAWN calculates the difference between Cumulative Density Function (CDF) instead of PDF. The main advantage is that CDF is generally easier and faster to calculate. PAWN can be also used to focus on specific parts of the distribution.

To calculate the  index, they used Kolmogorov–Smirnov statistics, which calculate the distance between the conditional CDF and unconditional CDF.

where KS (xi) is the Kolmogorov–Smirnov index for factor xi, Fy is the unconditional CDF, Fy|x is the conditional CDF, and stat is either the maximum or the median of values calculated for different values that xi was conditioned on. The PAWN index varies between 0 and 1, while higher values of Ti indicate a higher importance for a factor. Pawn can also be used to zoom into a specific range of the output surface, as has been explained in Pianosi and Wagener (2015).

δ-Moment Independent Method

The delta (δ) moment-independent method is conceptually similar to the PAWN method. The main difference is that, instead of the CDF curve, it calculates the difference between unconditional and conditional PDFs. The method was first introduced by Borgonovo (2006) and has become very popular ever since. The following equation is used to calculate the δ index.

The δ index has all of the advantages of the PAWN index, but in many cases, it can be more computationally expensive.

In my next two blog posts, I will introduce some open-source moment-independent and entropy-based software packages and will give you some practical examples. I will also go over the application of information theory in causal analysis and inference.

Interactive Lake Problem Visualizations Using Altair

Introduction

This post is an introduction to Altair, a python package derived from Vega and Vega-Lite, which allows high level visualization using a JSON syntax. I came across Altair while trying to look for a package that would allow interactive linking of plots. Altair allows the user to build sophisticated static and interactive plots using very little code. It is modular and has a simple “visualization grammar”  that allows plots to be built up layer by layer. Initially, there is a learning curve to understand the syntax and almost every plot type requires data to be in a Pandas dataframe. Furthermore, Altair has yet to introduce 3D plotting functionality. However, I believe that it can be a very useful interactive tool for teaching. The Altair API does not render figures, but rather outputs JSON data structures that can be saved as HTML files or viewed in Jupyter Notebooks.

In this blog post, I re-create/create some figures for the Lake Problem using Altair. Below you will see videos (be sure to hit “enable HD” on each video), but after installing Altair (instructions in Blog Post.ipynb) you can create the figures yourself by downloading the following GitHub tutorial and stepping through the Jupyter Notebook or you can simply click on the html files in the GitHub folder to open the interactive figures in your browser.

Linked Scatter and Line Plot

This type of figure allows you to link data across two figures using a common id, so when you click a point, the corresponding line is shown in the second figure. I used this figure to highlight the discharge policy that is associated with each point in the DPS Pareto front. I find this to be very  useful to be able to understand the general objective characteristics that lead to different policy shapes. Note how the high economic benefit solutions keep a steady release, while the solutions that favor low P concentration adjust accordingly. This style of figure creates a more wholesome experience; without it, the user is left to manually output a figure for each policy and keep track of which point on the front is associated with their respective policies.

Linked Scatter and Histogram

Using this type of figure allows the user to highlight sections of the scatter plot which contains points from different categories and to see a histogram of the highlighted points. I used this type of plot to recreate the first panel of Figure 10, which shows the parameter combinations  of b and q that lead to success and failure in different states of the world for the criteria of economic benefit > 0.2 and reliability > 0.95. For any given rectangular cross section you can see the breakdown of success and failure in the linked histogram. The histogram provides a more quantitative interpretation of the results, which may not be clear just looking a the scatter plot.

Linked Scatter Plots

By brushing one scatter plot, you see the corresponding behavior of points on the other linked plot. This style of interactive figure might be useful to view 2D sub-problems of more complex multi-objective problems side by side.

Linked Legend Histograms

This type of plot allows you to isolate parts of the figure by clicking on the legend. Here we have histograms of the objective values for the Intertemporal (orange) and DPS (blue) formulations. By clicking on the legend label, the respective histogram will be kept while the other becomes transparent. This is just a another convenient way of representing the objectives rather than in a scatter plot, gives a sense of which values are most common, and also draws the user’s attention to the shape of each histogram through the isolation technique.

 

Beyond Hypervolume: Dynamic visualization of MOEA runtime

Multi-objective evolutionary algorithms have become an essential tool for decision support in water resources systems. A core challenge we face when applying them to real world problems is that we don’t have analytic solutions to evaluate algorithmic performance, i.e. since we don’t know what solutions are possible before hand, we don’t have a point of reference to assess how well our algorithm is performing. One way we can gain insight into algorithmic performance is by examining runtime dynamics. To aid our understanding of the dynamics of the Borg MOEA, I’ve written a small Python library to read Borg runtime files and build a dynamic dashboard that visualizes algorithmic progress.

The Borg MOEA produces runtime files which track algorithmic parameters and the evolving Pareto approximate set over an optimization run. Often we use these data to calculate performance metrics, which provide information on the relative convergence of an approximation set and the diversity of solutions within it (for background on metrics see Joe’s post here). Commonly, generational distance, epsilon indicator and hypervolume are used to examine quality of the approximation set. An animation of these metrics for the 3 objective DTLZ2 test problem is shown in Figure 1 below.

Figure 1: Runtime metrics for the DTLZ2 test problem. The x-axis is number of function evaluations, the y-axis is the each individual metric

While these metrics provide a helpful picture of general algorithmic performance, they don’t provide insight into how the individual objectives are evolving or Borg’s operator dynamics.

Figure 2 shows a diagnostic dashboard of the same 3 objective DTLZ2 test problem run. I used the Celluloid python package to animate the figures. I like this package because it allows you to fully control each frame of the animation.

Figure 2: DTLZ2 runtime dynamics. The tree objectives are shown in a scatter plot and a parallel axis plot. The third figure plots hypervolume over the optimization run and the bottom figure shows Borg’s operator dynamics. (a higher resolution version of this file can be found here: https://www.dropbox.com/s/0nus5xhwqoqtghk/DTLZ2_runtime.gif?dl=0)

One thing we can learn from this dashboard is that though hypervolume starts to plateau around 3500 NFE, the algorithm is still working to to find solutions that represent an adequately diverse representation of the Pareto front. We can also observe that for this DTLZ2 run, the SPX and SBX operators were dominant. SBX is an operator tailored to problems with independent decision variables, like DTLZ2, so this results make sense.

I’m planning on building off this dashboard to include a broader suite of visualization tools, including pairwise scatter plots and radial plots. The library can be found here: https://github.com/dgoldri25/runtimeDiagnostics

If anyone has suggestions or would like to contribute, I would love to hear from you!