Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.
Exploratory spatial data analysis and statistics help us to interpret maps in a more efficient way by finding trends, enabling pattern mining in space and time, identifying spatial outliers, etc. This information beyond the maps can help us to understand the characteristics of places, phenomena, and the relationships between them. Therefore, we can use it for predication and, decision-making.
For analyzing the spatial data, many tools and libraries are available such as ArcGIS, Gdal , etc. that utilize raster and vector geospatial data formats. In my next blogposts, I am going to introduce some types of spatial statistics using different libraries in R. In this blogpost, I am going to explore an auto-correlation between county-level 10-years average (2006-2016) of total potential evapotranspiration (ET) from March to the end of August. These datasets are aggregated from 4*4 km grid cells for each county across the US and can be downloaded from here. Applying spatial autocorrelation explores if the potential ET in counties near each other are more similar. In other word, it measures how distance can affect our variable of interest. The existence of autocorrelation in data might lead to incorrect statistical inferences for some spatial statistical analysis. Therefore, it is important to test for spatial autocorrelation.
First, we are going to inspect the distribution of the data. To read the spatial data (shapefile), we need to use the “rgdal” library. Set your working directory to the path where you downloaded the data.
We can see the name of columns by names(ET), and the coordinate system of the data by crs(ET).To plot the spatial objects, I am using “spplot” from the “sp” library, which is the specialized plot methods for spatial data with attributes. The first argument is object of spatial-class, which has coordinates, and the second argument “zcol” is the attribute name or column number in the attribute table associated with the data.
spplot(ET, zcol='MEAN_ETp_n',col.regions = topo.colors(20), main="Average Potential ET during March-August (mm)")
We can also use quantiles in order to better distinguish the distribution of data and potentially dilute the effect of outliers. To do this we need another library:
breaks_quantile <- classIntervals(ET$MEAN_ETp_n, n = 10, style = "quantile")
breaks <- breaks_quantile $brks
ET$MEAN_ETp_qun<- cut(ET$MEAN_ETp_n, breaks)
p2<- spplot(ET, "MEAN_ETp_qun", col.regions = topo.colors(20), main = "Average Potential ET during March-August (mm)_Quantile")
As shown in both maps, the counties in the southern half of the US have higher potential ET during March-August. To explore the spatial autocorrelation we need to define the neighbors of counties.
neighbor<- poly2nb(ET,queen = FALSE)
#With “queen” option we define if a single shared boundary point meets the contiguity condition (queen =TRUE), or more than one shared point is required (queen =FALSE).
plot(ET, border = 'lightgrey')
plot(neighbor, coordinates(ET), add=TRUE, col='red')
Now, we are going to present local spatial autocorrelation that explores spatial clustering across the US. First, we need to convert the neighbor data to a listw object. The “nb2listw” adds a neighbor list with spatial weights for the chosen coding scheme. Here, I chose “W”, which is row standardized (sums over all links to n).
lw <- nb2listw(neighbor, style = "W")
To get the autocorrelations a Moran’s test is used. The local spatial statistic Moran’s index estimates a correlation for each county based on the spatial weight object used. It is a statistical way to find out the potential local clusters and spatial outliers. A positive index indicates that a feature has neighboring features with similar high or low attributes that can be part of a cluster. A negative value indicates the outlier feature.
From this, we get some statistical measures: Ii: local moran statistic, E.Ii: expectation of local moran statistic, Var.Ii: variance of local moran statistic, Z.Ii: standard deviate of local moran statistic, and Pr() p-value of local moran statistic. Now we can add this result to our shapefile (ET) and plot the local moran statistic. I am going to add the Us_States boundary to the map.
moran_local_map <- cbind(ET, local_moran)
spplot(moran_local_map, zcol='Ii',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))
Again, we can present this data with quantiles:
breaks_quantile_new <- classIntervals(moran_local_map$Ii, n = 10, style = "quantile")
breaks_new <- breaks_quantile_new$brks
moran_local_map$Ii_qun <- cut(moran_local_map$Ii, breaks_new)
spplot(moran_local_map, zcol='Ii_qun',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)_Quantile", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))
Since the indices include negative and zero values and there are a large variations between the positive values, I modified the breaks manually in order to be able to see the variations and potential clustering with the higher Moran values:
Note that the p-value should be small enough (statistically significant) for the feature to be considered as part of a cluster.
Anselin, L. 1995. Local indicators of spatial association, Geographical Analysis, 27, 93–115; Getis, A. and Ord, J. K. 1996 Local spatial statistics: an overview. In P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment (Cambridge: Geoinformation International), 261–277; Sokal, R. R, Oden, N. L. and Thomson, B. A. 1998. Local Spatial Autocorrelation in a Biological Model. Geographical Analysis, 30. 331–354; Bivand RS, Wong DWS 2018 Comparing implementations of global and local indicators of spatial association. TEST, 27(3), 716–748 https://doi.org/10.1007/s11749-018-0599-x
In my previous blog post, I explained information theory. I also talked about the application of information theory in sensitivity analysis. In this blog post, I briefly explain how information theory can be used in causal inference of time-series data.
First, I’ll offer a little perspective on causal analysis. We’re often interested to know what factors can cause a specific phenomenon. For example, let’s say that we want to understand what environmental conditions can cause a significant flood event in a specific region. Also, let’s assume that we have a time series of flood, precipitation, soil moisture, snow cover in upstream mountains, and temperature. There are many ways that this problem can be addressed. We can use regression analysis and correlation, principle component analysis, and variance-based analysis approaches to find out which situation best explains the flood events. However, these methods can potentially overlook many of the relationships that might exist in complex systems. Although the methods may give you a statistically meaningful equation, they cannot provide any causal insights. For example, there could be a strong relationship between the flood and precipitation two weeks prior to the flood event, or there might even be a relationship between snowfalls in winter and flood. Information-theoretic causal analysis can provide an alternative approach to exploring questions like these. Such methods have been used in many research areas, including environmental science (Rouge et al., 2019), neuroscience (Tononi et al., 2016), and economics (Yao and Li 2020).
There are two schools of thought in these causal analyses (Goodwell et al., 2020): (1) Pearl causality and (2) Granger causality. Pearl causality was introduced by Judea Pearl (here) and is based on the idea that interventions in a complex system provide information that can potentially lead to a causal inference. However, Granger causality (here) focuses on exploring the transfer of information from the past to current states of the system. This blog post concentrates on Granger causality.
Many methods have been developed and used to study how factors can Granger-cause other variables. For example, the simplest way, which I discussed in the previous blog post, is mutual information. It is a pairwise causal-analysis method and basically tells us which lag times of variable X provide information to the current state of variable Y. In other words, we gain information about which time lag of variable X can improve the prediction of variable Y. However, mutual information alone is not informative enough for problems that deal with multiple variables and variables that depend on their own conditions at previous time-steps—conditions that usually exist in the real world. Partial information decomposition provides a way to incorporate unique information that each variable provides as well as mutual and redundant information that each variable can provide. Transfer entropy (Schreiber 2000), on the other hand, is for a pairwise analysis that also takes into account the information transfer from the previous time-step of that variable. In mathematical language,
Where, Y and X are time series, t denotes time-step, and τ is the time lag.
In this example, I am using an R package called “TransferEntropy” that calculates transfer entropy. There is a Python package called “TIGRAMITE” that has been developed for causal analysis of time-series data. The following will install and load the R library.
Then, you can use the following to activate an R dataset that includes information about the European stock market. This is the dataset that we are using to see if data about past states of two stock indicators—FTSE and SMI—were systematically related.
The following code can be used to calculate the transfer entropy:
The results show that there is a significant relationship between the two variables (in both directions), as the following figure also indicates.
Finally, studies have also discussed potential issues that causal inference using these methods could create. For example, James et al. (2016) reported a poor level of reliability of these methods and suggested network science–oriented analysis as an alternative for causal inference.
This is an introductory blogpost on how to work with Application Programming Interfaces (APIs) in Python. Seeing as this is our blog’s first post on this topic I’ll spend some time explaining some basic information I’ve had to learn in the process of doing this and why it might be useful in your research. There are several blogs and websites with tutorials online and there’s no point repeating, but I’ll try to explain this for an audience like me, i.e., (a) has no formal training in computer science/web servers/HTTP but competent with scripting, and (b) interested in using this for research purposes.
What is an API?
Many sites (e.g., Facebook, Twitter, many many more) make their data available through what’s called Application Programming Interfaces or APIs. APIs basically allow software to interact, and send and receive data from servers so as to typically provide additional services to businesses, mobile apps and the like, or allow for additional analysis of the data for research or commercial purposes (e.g., collecting all trending Twitter hashtags per location to analyze how news and information propagates).
I am a civil/environmental engineer, what can APIs do for me?
APIs are particularly useful for data that changes often or that involves repeated computation (e.g., daily precipitation measurements used to forecast lake levels). There’s a lot of this kind of data relevant for water resources systems analysts, easily accessible through APIs:
I’ll demonstrate some basic information scripts using Python and the Requests library in the section below. There are many different API requests one can perform, but the most common one is GET, which is a request to retrieve data (not to modify it in any way). To retrieve data from an API you need two things: a base URL and an endpoint. The base URL is basically the static address to the API you’re interested in and an endpoint is appended to the end of it as the server route used to collect a specific set of data from the API. Collecting different kinds of data from an API is basically a process of manipulating that endpoint to retrieve exactly what is needed for a specific computation. I’ll demo this using the USGS’s API, but be aware that it varies for each one, so you’d need to figure out how to construct your URLs for the specific API you’re trying to access. They most often come with a documentation page and example URL generators that help you figure out how to construct them.
For some interpretation of how this was constructed, every parameter option is separated by &, and they can be read like so: data in json format; indented so I can read them more easily; site number is the USGS gage number; start and end dates; a USGS parameter code to indicate streamflow; site type ‘stream’; any status (active or inactive). This is obviously the format that works for this API, for a different API you’d have to figure out how to structure these arguments for the data you need. If you paste the URL in your browser you’ll see the same data I’m using Python to retrieve.
Object response now contains several attributes, the most useful of which are response.content and response.status_code. content contains the content of the URL, i.e. your data and other stuff, as a bytes object. status_code contains the status of your request, with regards to whether the server couldn’t find what you asked for or you weren’t authenticated to access that data, etc. You can find what the codes mean here.
To access the data contained in your request (most usually in json format) you can use the json function contained in the library to return a dictionary object:
data = response.json()
The Python json library is also very useful here to help you manipulate the data you retrieve.
How can I use this for multiple datasets with different arguments?
So obviously, the utility of APIs comes when one needs multiple datasets of something. Using a Python script we can iterate through the arguments needed and generate the respective endpoints to retrieve our data. An easier way of doing this without manipulating and appending strings is to set up a dictionary with the parameter values and pass that to the get function:
This produces the same data, but we now have an easier way to manipulate the arguments in the dictionary. Using simple loops and lists to iterate through one can retrieve and store multiple different datasets from this API.
Other things to be aware of
Multiple other people use these APIs and some of them carry datasets that are very large so querying them takes time. If the server needs to process something before producing it for you, that takes time too. Some APIs limit the number of requests you can make as a result and it is general good practice to try to be as efficient as possible with your requests. This means not requesting large datasets you only need parts of, but being specific to what you need. Depending on the database, this might mean adding several filters to your request URL to be as specific as possible to only the dates and attributes needed, or combining several attributes (e.g., multiple gages in the above example) in a single request.
This semester, I have had the opportunity to build a multi-objective optimization model in R that uses the Borg evolutionary algorithm framework for the optimization component to discover tradeoffs of the modeled system. The R wrapper historically has been the most under-developed of the wrappers offered for Borg, which has primarily been limiting for visiting students or groups on our campus in atmospheric or biological science that tend to use R. While fuller/faster Borg capabilities may be realized by transitioning to C++ or Python, I spent some time fixing up the R wrapper for serial Borg in an effort to help broaden our target users.
Acquiring Access to Borg
To request access to the Borg MOEA, complete step 2 of Jazmin’s introduction to Borg, found here. Once you receive access, within the Borg folder, navigate to /plugins/R. You will find borg.R (the borg R wrapper), DTLZ2.R (a test function), and a script called test.R that calls the wrapper and performs an optimization.
Setting up the Borg R Wrapper
The R wrapper pulls out relevant files, libraries, and functions from borg.c, borg.dll, and libborg.so, so these files must be in the same directory as the above files in order for the R wrapper to work. The R wrapper requires an R package called rdyncall, which allows R to call C libraries that Borg uses. This package is unfortunately not offered within the current CRAN repository, so you can find an archived version of the package here and download it manually. In order to install a tar.gz package, you can do so using the following R command: install.packages(“directory/to/package”, type=”source”). This command has other variants that work specifically for Windows vs. Mac vs. Unix users and has given other students trouble in the past, so feel free to reach out if you can’t get this step to work. Once this is done, you can run test.R and get your optimization results. The way the wrapper is set up, it will return the decision variables and objective function values in a dataframe called “result”.
Running Multiple Seeds
The R wrapper is set up to only run with Borg’s default parameterization, so you have to go into the wrapper and adjust Borg’s random state in order to facilitate using a different random initialization of Borg’s population or operator dynamics. Borg is a “seeded” algorithm, so the initialization of the algorithm is determined by a number shown in “rngstate”. This allows the runs to be reproducible, if the same seed is specific. In order to facilitate new seeds, you can just change this number. If we want to run 50 seeds, we wouldn’t want to change this number manually, so instead I subtract a variable “s” that will run from 1-50 depending on the seed.
We can feed this number in as an argument to specify seeds. Assuming you want to run this on the cluster, a bash script can be written to loop through the seeds and fed in as inputs into the test.R script as shown below:
The variable, “s” must be defined in your test.R script at the top, in the correct argument number. This number happens to be the 6th value that is found when calling commandArgs( ). This may differ across computers, so it’s important to check that you’re pulling in the correct argument. Provided you are running the seeds in parallel, you’ll also want to save the result in a file for its respective seed. The arguments into the Borg call are (number of variables, number of objectives, number of constraints, the name of the function that you are trying to optimize in DTLZ2.R which is named “DTLZ2”, NFE, and epsilons for each objective).
As shown at the top of DTLZ.R, you need to actually define nvars and nobjs and nconstrs if you have them. Objective functions must be listed as obj=… etc. and the function must return(objs, constrs) if necessary.
Creating Runtime Files
Another important tool that has been missing is the creation of runtime files which allow you to track how the Pareto front is evolving and to get a sense of convergence. In order to implement your own runtime files, you need to navigate to the part of the wrapper where the Borg MOEA is actually running (Lines 314-318). Inside this while loop, you need to implement the following statement block. First initialize the runtime file and write the problem characteristics at the time. Then, create a runtime vector that outlines the frequency that you want your output. Here I’m running 1000 FE and I want the runtime every 100 FE. Because Borg can be a little fickle and not print at the specific FE that you want, I specify a range of +2 around each value, so that if something doesn’t print quite at 100 FE, you’ll likely get output at 101 or 102 FE. When running jobs on the Cube, it’s important to have your output file stored somewhere that any external node can access; therefore I create my file in my scratch directory instead of my home directory.
When you land on a FE value in your output frequency range, you have to use the appropriate getter functions to first get the archive and get the variables and objectives out of that result. The NFE is printed along with the archive and then a # is added at the end to separate out solutions. An example runtime file is shown below.
This style of file might not be the easiest to pull data from, but it’s the structure that is necessary to be recognized by MOEAFramework to create a reference set and runtime metrics. Happy coding!
During many-objective search we seek to find approximate Pareto fronts that are convergent, meaning they are close to the “true” Pareto front, and diverse, meaning they contain solutions that they uniformly cover the full range of the “true” front. This post will focus on examining the latter criteria for high dimensional problems using Radial Visualization (RadVis; Hoffman et al., 1997), a visualization technique that transforms a solution set into radial coordinates. I’ll discuss the theory behind RadVis, provide some example code, and discuss potential applications, limitations and extensions.
When performing runtime diagnostics (a subject of two of my recent posts, which you can find here and here), we often asses MOEA performance using hypervolume because it captures both the convergence and diversity of the approximate Pareto front. Though tracking hypervolume provides us with an aggregate measure of diversity, it does not provide information on how the solutions are spread across the objective space. For low dimensional problems (two or three objectives), the diversity of the approximation set can easily be discerned through visualization (such as 2-D or 3-D scatter plots). For many-objective problems however, this becomes challenging. Tools such as parallel coordinate plots, bubble plots and glyph plots can allow us to visualize the solutions within an approximation set in their full dimensionality, but they are limited in their ability to convey information on diversity in high dimensional space. An alternative strategy for examining diversity within a solution set is to transform the objective space into a form coordinate system that highlights diversity. One common transformation is to view the set using radial coordinates.
RadVis uses an analogy from physics to plot many dimensional data. Each of n-objectives are plotted uniformly around the circumference of a unit circle while, each solution is represented as a point that is attached to all objectives by a set of n hypothetical springs (one spring is attached to each objective-solution pair). A solution’s objective value for the ith objective defines the spring constant exerted by the ith spring. A point’s location thus represents the equilibrium point between all springs (Hoffman et al., 1997). For example, a point that has equal objective values for all n-objectives will lie directly in the center of the plot, while a point that maximizes one objective and minimizes all others will lie on the circumference of the circle at the location specified for the maximized objective.
Below, I’ve plotted RadVis plots for a 5 objective instance of DTLZ2 at runtime snapshots from 1,000 function evaluations (NFE), 5,000 NFE, 10,000 NFE and 50,000 NFE. Notice how the approximate front for 1,000 NFE is sparse and skewed toward one side, while the front for 50,000 NFE is fairly uniform and centered. DTLZ2 is a test problem with a known true Pareto front which is continuous and uniformly distributed, using this information, we can see that the algorithm has likely improved its approximation set over the course of the run.
Caution! RadVis provides no information on convergence
It’s important to note that RadVis provides no information about solution convergence since the location of each point is dependent on equilibrium between all objectives. A set of solutions that performs poorly across all objectives will look similar to a set of solutions that performs well across all objectives. For this reason, RadVis should never be used alone to assess the quality of a Pareto front approximation. To include convergence information in RadVis, Ibrahim et al,. (2016) and Ibrahim et al., 2018 have proposed 3d-RadVis and 3D-RadVis antenna, extensions of RadVis methodology which augment Radvis by providing convergence information to decision makers. These methodologies may be subject to future posts.
Hoffman, P., Grinstein, G., Marx, K., Grosse, I., & Stanley, E. (1997, October). DNA visual and analytic data mining. In Proceedings. Visualization’97 (Cat. No. 97CB36155) (pp. 437-441). IEEE.
Ibrahim, A., Rahnamayan, S., Martin, M. V., & Deb, K. (2016, July). 3D-RadVis: Visualization of Pareto front in many-objective optimization. In 2016 IEEE Congress on Evolutionary Computation (CEC) (pp. 736-745). IEEE.
Ibrahim, A., Rahnamayan, S., Martin, M. V., & Deb, K. (2018). 3D-RadVis Antenna: Visualization and performance measure for many-objective optimization. Swarm and evolutionary computation, 39, 157-176.
Sensitivity analysis is a powerful tool to find out which parameters in a model have the largest effect on the results. Besides the impact of the main parameters, exploring the interactions between parameters is also important because complex systems usually have many active interactions. Learning from them will improve our understanding of how a model works at different underlying conditions. This could potentially lead to improving our inferences of the model’s results and model evaluation. Previously, Antonio showed how to create a Radial convergence plot for visualizing Sobol indices in Python. In this blog post, I am going to use a library in R to create a Chord diagram for Sobol interaction indices for annual baseflow. Then, I will create an animated GIF file that shows the interactions for a few years. The dataset (download from here) that I am using came from the simulation of an agro-biophysical model that has 33 parameters.
To work with this library, the format of the data should be a matrix or list.
As I mentioned above, the interactions are between 33 parameters, so we have 33 rows and columns in this dataset. I am going to assign a name to each row and column based on the actual order in the dataset:
Now, we are going to assign a color for each parameter. For better visualization and interpretability of the results, these parameters are classified into six main groups (H, Ph, B, T, W, and Phy) based on how they are related to the model. I am going to assign the same color to all of the parameters within each group.
As we see, there are many small interactions between parameters. Therefore, you may want to set a limit to show some specific interactions that are large, or you may just want to clean up some of the interactions that are less than specific values. To do this task, we can use the result of the above function without any more adjustments in order to get the final format of a dataset that is created within the function. Then, we can modify it based on what we want to show. The “interactions_plot” is actually a plot, but you can also look at its data frame. The “col” column has the color code that is assigned to each interaction. We can extract that column.
Then get the index of values that are less than for example 0.005:
And we can replace the actual color in “col2” with “transparent” for those rows for which their index was selected above.
col2[idx] <- 'transparent'
Now, we can create a final graph with some more adjustments: with “order,” we can manually sort a grid (sectors). “link.sort” and “link.decreasing” are used to sort the links based on the value of the interaction (which is shown by the width of the sector). This might help in detecting interesting interactions: “annotationTrack” set to “grid”, print the sectors, “preAllocateTracks”, specified the track, “circos.track” creates plotting regions for a track: track.index is the index for the track, which is going to be updated. “panel.fun” is used to add graphics and customize sector labels (x and y correspond to the data points in each cell).
We may want to add another layer to highlight a specific sector and add a group name. In this case, we can use “highlight.sector” to specify the color, label, and components of this new layer. With “padding,” we can also change the width of this layer.
To save the graph use this line of code before calling the chordDiagram():
png(file = paste("…(your path)…./sobol_Baseflow_S2_1989.png",sep="") ,height = 7, width = 7, units = "in", res = 500)
And at the end run dev.off(). Now, I am going to run this code for all of the other years and save them to make a GIF file that shows the interactions for all years. To make an animation from these plots, first we need to install “magick package.”
This simple code reads the first image and then adds the rest of the images to that in a loop. You can also add a message for each image. (Here, I added the year.) The “delay” option is used to adjust the delay after each frame.
Writing fully reproducible papers is something everyone talks about but very few people actually do. Following nice examples I’ve seen developed by others (see here and here), I wanted to develop a GitHub template that I could easily use to organize the analysis I perform for each paper. I wanted it to be useful for the Reed group in general, but also anyone else who’d like to use it, so the version I’m presenting today is an initial version that will be adapted and evolve as our needs grow.
The template can be found here: https://github.com/antonia-had/paper_template and this blogpost will discuss its contents. The repository is set up as a template, so you can use “Import repository” when you create a new repository for your project or click on the green “Use this template” button on the top right.
The idea is that everything is organized and documented well so that another person can easily replicate your work. This will help with your own tools being more widely used and cited, but also future group members to easily pick up from where you left. The other selfish way in which this has helped me is that it forces me to spend some time and arrange things from the beginning so I can be more organized (and therefore more productive) during the project. Most importantly, when a paper does get accepted you don’t need to go back and organize everything so it looks halfway decent for a public repository. For these reasons I try to use a template like this from the early stages of a project.
A lot of the template is self explanatory, but I’ll go through to explain what is in it in short. The idea is you take it and just replace the text with your own in the README files and use it as a guide to organize your paper analysis and results.
There are directories to organize your content to code, data, and results (or anything else that works for you). Every directory has its own README listing its contents and how they should be used. All code that you didn’t write and data that you didn’t generate need to be cited. Again, this is useful to document from the beginning so you don’t need to search for it later.
Most of my work is done in Python, so I wrote up how to handle Python dependencies. The way I suggest going about it is through a ‘.yml‘ file that specifies all the dependencies (i.e. all the packages and versions your script uses) for your project. I believe the best way to handle this is by creating a Python environment for every project you work on so you can create a separate list of dependencies for each. We have a nice blogpost on how to create and manage Python environments here.
When the project is done and you’re ready to submit or publish your paper, export all dependencies by running:
conda env export > environment.yml --no-builds
and store your environment.yml in the GitHub repository. When someone else needs to replicate your results, they would just need to create the same Python environment (running conda env create --file environment.yml) before executing your scripts.
Finally, you could automate the analysis and figure production with a makefile that executes your scripts so who ever is replicating does not need to manually execute all of them. This also helps avoiding somebody executing the scripts in the wrong order. An example of this can be found in this template. The makefile can also be in Python, like Julie did here.
In recognition that this is not an exhaustive template of everything one might need, the aim is to have this blog post and the template itself evolve as the group identifies needs and material to be added.
In this post, I will go over the concept of information theory and its relevance to sensitivity analysis. Then, I will talk about two other methods that are closely related to the concept of information theory. These methods are also often categorized as moment-independent sensitivity analysis methods.
What are the advantages of moment-independent sensitivity analysis indices?
Many methods exist for sensitivity analysis: for example, variance-based methods, such as the one proposed by Sobol (1990), that clarify how much of the variance of outputs can be explained by each input variable (Saltelli et al., 2008). They have been broadly implemented, and numerous studies show their capabilities. However, as Auder and Iooss (2008) and Zadeh et al., (2017) explain, as these approaches are not moment indifferent, variance-based methods might be inefficient in heavy-tailed distributions or when the outputs tend to have outliers. They also become less reliable when they encounter multi-modal distributions. To respond to these issues, moment-independent methods have been developed that are based on PDFs and CDFs. In this blog post, I go over some of them.
But before I introduce these methods, I would like to mention that there are studies that show that these moment-independent methods can be outperformed by other sensitivity analysis methods (e.g., variance-based Sobol). For example, Puy et al., (2020) shows that a moment-independent method (i.e., PAWN, which is explained below) does not necessarily produce better answers. Also, Auder and Iooss (2008) discuss how, at least in some cases, moment-independent methods can significantly increase computational costs.
Shannon, the founder of information theory, first introduced the concept of entropy in his famous 1948 paper, A Mathematical Theory of Communication. To put it simply, entropy is the opposite of information. Higher entropy indicates higher uncertainty. In the last several decades, the concept of entropy has found its way into many scientific areas, including economics, statistics, engineering, environmental science, and machine learning.
A famous example that can help to explain the concept of entropy is coin tossing. Let’s say that you have two coins. One of them is a fair coin, meaning that the probabilities of getting heads or tails are equal (P=0.5). The other coin is not fair, as the probability of getting heads (P=0.8) is higher than that of getting tails (P=0.2). Now, you are going to flip one of these coins; which one has a higher uncertainty? The answer is that the fair coin has a higher entropy. Entropy has also been described as a measure of surprise. In our coin example, a coin with higher entropy has a higher chance to surprise you. The following explains some of the basic concepts related to information theory.
The following formula can be used to calculate entropy:
Another important concept in information theory is conditional entropy. Conditional entropy is the amount of information needed to estimate a random variable Y when a random variable X is known. Intuitively, low-conditional entropy implies higher dependence of random variable Y on random variable X. The following formula can be used to calculate conditional entropy:
Mutual information is another concept that is closely related to entropy and explains how much of the information in Y can be estimated when X is known. The following shows how mutual information is related to entropy and conditional entropy:`
Entropy-Based Sensitivity Analysis
The concept of entropy has been used in sensitivity analysis. There are two popular entropy-based indices that have been used for sensitivity analysis: i) Krzykacz-Hausmann (2001) and ii) Liu et al., (2006). Both of them are based on analysis of PDFs of inputs and outputs. In this blog post, I will only explain the first one, Liu et al., adapted Kullback-Leibler (K-L) relative entropy concept which is conceptually similar and has a more complex mathematical formulation that can be solved using Monte Carlo sampling. More information on K-L-sensitivity methods can be found in Liu et al., (2006) and Auder and Iooss (2008).
Krzykacz-Hausmann Sensitivity Index
These authors use the direct definitions of Shannon’s information theory and use the following formula to calculate it:
Higher values of η indicate higher sensitivity of RV Y to RV X.
There are some other moment-independent methods that have been widely used in recent years, and I am including two of them here: i) PAWN and ii) δ-moment independent.
PAWN is another moment-independent sensitivity analysis metric, developed by Pianosi and Wagener (2015). The main difference between PAWN and other moment-independent approaches is that PAWN calculates the difference between Cumulative Density Function (CDF) instead of PDF. The main advantage is that CDF is generally easier and faster to calculate. PAWN can be also used to focus on specific parts of the distribution.
where KS (xi) is the Kolmogorov–Smirnov index for factor xi, Fy is the unconditional CDF, Fy|x is the conditional CDF, and stat is either the maximum or the median of values calculated for different values that xi was conditioned on. The PAWN index varies between 0 and 1, while higher values of Ti indicate a higher importance for a factor. Pawn can also be used to zoom into a specific range of the output surface, as has been explained in Pianosi and Wagener (2015).
The delta (δ) moment-independent method is conceptually similar to the PAWN method. The main difference is that, instead of the CDF curve, it calculates the difference between unconditional and conditional PDFs. The method was first introduced by Borgonovo (2006) and has become very popular ever since. The following equation is used to calculate the δ index.
The δ index has all of the advantages of the PAWN index, but in many cases, it can be more computationally expensive.
In my next two blog posts, I will introduce some open-source moment-independent and entropy-based software packages and will give you some practical examples. I will also go over the application of information theory in causal analysis and inference.
This post is an introduction to Altair, a python package derived from Vega and Vega-Lite, which allows high level visualization using a JSON syntax. I came across Altair while trying to look for a package that would allow interactive linking of plots. Altair allows the user to build sophisticated static and interactive plots using very little code. It is modular and has a simple “visualization grammar” that allows plots to be built up layer by layer. Initially, there is a learning curve to understand the syntax and almost every plot type requires data to be in a Pandas dataframe. Furthermore, Altair has yet to introduce 3D plotting functionality. However, I believe that it can be a very useful interactive tool for teaching. The Altair API does not render figures, but rather outputs JSON data structures that can be saved as HTML files or viewed in Jupyter Notebooks.
In this blog post, I re-create/create some figures for the Lake Problem using Altair. Below you will see videos (be sure to hit “enable HD” on each video), but after installing Altair (instructions in Blog Post.ipynb) you can create the figures yourself by downloading the following GitHub tutorial and stepping through the Jupyter Notebook or you can simply click on the html files in the GitHub folder to open the interactive figures in your browser.
Linked Scatter and Line Plot
This type of figure allows you to link data across two figures using a common id, so when you click a point, the corresponding line is shown in the second figure. I used this figure to highlight the discharge policy that is associated with each point in the DPS Pareto front. I find this to be very useful to be able to understand the general objective characteristics that lead to different policy shapes. Note how the high economic benefit solutions keep a steady release, while the solutions that favor low P concentration adjust accordingly. This style of figure creates a more wholesome experience; without it, the user is left to manually output a figure for each policy and keep track of which point on the front is associated with their respective policies.
Linked Scatter and Histogram
Using this type of figure allows the user to highlight sections of the scatter plot which contains points from different categories and to see a histogram of the highlighted points. I used this type of plot to recreate the first panel of Figure 10, which shows the parameter combinations of b and q that lead to success and failure in different states of the world for the criteria of economic benefit > 0.2 and reliability > 0.95. For any given rectangular cross section you can see the breakdown of success and failure in the linked histogram. The histogram provides a more quantitative interpretation of the results, which may not be clear just looking a the scatter plot.
Linked Scatter Plots
By brushing one scatter plot, you see the corresponding behavior of points on the other linked plot. This style of interactive figure might be useful to view 2D sub-problems of more complex multi-objective problems side by side.
Linked Legend Histograms
This type of plot allows you to isolate parts of the figure by clicking on the legend. Here we have histograms of the objective values for the Intertemporal (orange) and DPS (blue) formulations. By clicking on the legend label, the respective histogram will be kept while the other becomes transparent. This is just a another convenient way of representing the objectives rather than in a scatter plot, gives a sense of which values are most common, and also draws the user’s attention to the shape of each histogram through the isolation technique.
Multi-objective evolutionary algorithms have become an essential tool for decision support in water resources systems. A core challenge we face when applying them to real world problems is that we don’t have analytic solutions to evaluate algorithmic performance, i.e. since we don’t know what solutions are possible before hand, we don’t have a point of reference to assess how well our algorithm is performing. One way we can gain insight into algorithmic performance is by examining runtime dynamics. To aid our understanding of the dynamics of the Borg MOEA, I’ve written a small Python library to read Borg runtime files and build a dynamic dashboard that visualizes algorithmic progress.
The Borg MOEA produces runtime files which track algorithmic parameters and the evolving Pareto approximate set over an optimization run. Often we use these data to calculate performance metrics, which provide information on the relative convergence of an approximation set and the diversity of solutions within it (for background on metrics see Joe’s post here). Commonly, generational distance, epsilon indicator and hypervolume are used to examine quality of the approximation set. An animation of these metrics for the 3 objective DTLZ2 test problem is shown in Figure 1 below.
While these metrics provide a helpful picture of general algorithmic performance, they don’t provide insight into how the individual objectives are evolving or Borg’s operator dynamics.
Figure 2 shows a diagnostic dashboard of the same 3 objective DTLZ2 test problem run. I used the Celluloid python package to animate the figures. I like this package because it allows you to fully control each frame of the animation.
One thing we can learn from this dashboard is that though hypervolume starts to plateau around 3500 NFE, the algorithm is still working to to find solutions that represent an adequately diverse representation of the Pareto front. We can also observe that for this DTLZ2 run, the SPX and SBX operators were dominant. SBX is an operator tailored to problems with independent decision variables, like DTLZ2, so this results make sense.