Potential Biases and Pitfalls of Machine Learning

Artificial intelligence has already changed the world in many ways. It is very difficult to find a manmade tool that does not use AI in one way or another. It can be argued that many of these changes have been positive; however, it can also easily be shown that AI has not benefited everybody evenly.

In terms of the application of AI in science, the number of publications on this topic has significantly increased during the last few years. I searched for the words water, machine, and learning on Google Scholar and limited the search to only count the studies with all of these words in their titles. I created the following figure from the results of that simple search. As you can see, the number of publications has significantly increased in the last few years. This is obviously not a thorough study, but it shows how important artificial intelligence has become in academic research.

However, there are many pitfalls that can degrade the usefulness of AI in academic and non-academic applications. Some people even argue that these pitfalls could potentially lead us to a new AI winter that nobody will enjoy. In this blog post, I decided to go over some of these issues and difficulties. Some of the problems have emerged recently, but others have been known by statisticians for decades. For many reasons, they continue to exist and negatively affect artificial intelligence projects.

Any machine learning project usually includes the following components: 1) sample collection and preparation, 2) ML model generation, and 3) result interoperation and extraction of actionable insights. Here, I will talk about some of the mistakes and biases that can happen in each of these steps.

1- Sample Collection and Preparation

The input dataset is a crucial source of bias in any ML project. The problem basically occurs when the dataset is not a good representation of the real world. Here are a few variants of this issue:

Exclusion Bias

This happens when specific groups that exist in the real world are systematically absent in the dataset. For example, a specific part of the population can be absent in the dataset, or the dataset can only contain information about particular countries, climates, etc. Having a diverse group that represents various segments of society and different worldviews can reduce the consequences of this type of bias.

Measurement Bias

There are different reasons that observations can be faulty and unreliable. For example, observations can be sensed and collected through malfunctioning measurement devices that can bias the entire dataset. Also, human reading errors and mistakes can cause measurement errors. Additionally, there are many things that can go wrong during the post-processing of raw measurements and can lead to measurement bias.

Other Sampling Biases

There are also other types of data collection biases that have been discussed in statistical literature for years. One example of these is self-selection bias. Let’s imagine that you are running an online poll and want to analyze the results, but your participants are not fully random. People who choose to participate in your poll might have specific personalities or worldviews and not represent the population at large. Other biases that have been widely discussed in statistical literature include survivor bias, prejudice bias, observer bias, and recall bias. You can find more information about these biases here, here, and here. If possible, these biases should be avoided. Otherwise, their effects should be estimated and removed from the analysis. Also, acquiring more comprehensive datasets can reduce these types of biases.

Imbalanced Datasets

If the data collection process is fair and reliable, the imbalanced dataset is not really a bias: it’s more of a challenge that is inherent to specific problems. Some events just don’t happen frequently. The problems and research questions that focus on rare events need their own appropriate treatments. In a previous blogpost, I explained this issue and the ways that we can deal with this problem. You can also refer to this article and this article for information about imbalance datasets.

Data Leakage

Data leakage is a situation in which we mistakenly share a portion (or the entirety) of the data between the training and validation datasets. Data leakage can happen due to different types of mistakes during data preparation. Timeseries can also implicitly include data leakage. For example, if training and testing datasets are selected in a way that allows the training dataset to gain information about the removed period, we have implicitly leaked data. For example, if we extract the testing dataset from the middle of the original timeseries, we basically give the model insight about what happens before and after the testing period, which is basically cheating and can reduce the accuracy of the model. One way to avoid this is to make sure that our training data period precedes the validation period.

Another type of leakage is called feature-leakage, and it happens when our feature list includes a variable that is either closely correlated with data labels or is a duplicate of the data label. Obviously, we won’t have that feature during the prediction. Therefore, our trained model will be highly biased.

More information about data leakage can be found here and here.

2- ML and Statistical Models

Lack of Physical Understanding of the System

In machine learning, we use statistical relationships to explore how the system would behave under a specific condition. However, ML models do not gain any explicit knowledge about the actual physical laws that govern the system. For example, in hydrology, it is very possible to violate water and energy conservation laws while training and using machine learning-driven models. However, there are studies (here and here) that have been trying to incorporate these insights into their ML model. This issue is likely to attract more attention in the future.

Reproducibility

There is a general consensus among ML researchers that many ML-based studies are not reproducible. This means that an individual researcher cannot replicate the ML process and find the same results. Many recent studies have shown that ML scientists are often not able to reproduce the results of other studies. While some people refer to this as the reproducibility crisis in ML studies, it is more of a general academic research issue. Based on a 2016 survey published by Nature, 70% of scientists across various disciplines were not able to reproduce the results of other studies, and more than 50% even failed to reproduce their own previously developed results. This is obviously a big challenge and needs to be taken into consideration. More information about reproducibility challenge can be find blog posts (here and here).

Instability of the Model

Model stability is also an important challenge. As discussed earlier, models can be very efficient and accurate under specific random conditions (e.g., random seed or data sample) and will fail under different conditions. This problem is also closely related to the reproducibility challenge. Using different random seeds and developing model ensembles can provide a solution to this problem. The solution also depends on the inherent nature of the ML algorithm. For example, this blog post nicely explains the issue of stability and its solution in K-Means clustering.

Simpson’s Paradox

Simpson’s Paradox describes a situation in which our analysis generates a different conclusion if we investigate each individual group separately while comparing them to the situation where these groups are all combined. The following figure shows an example of this situation under a simple linear regression. The implication of this phenomenon should be taken into account when we make decisions about labels and when we interpret our results.

Correlation Bias

It is important to make sure that there is no significant correlation between different features, which can lead to the creation of unstable and unreliable ML models. This is the same issue as the widely discussed multicollinearity problem in regression. However, removing feature columns needs to be done carefully in order to avoid significant loss of information.

Other Statistical and ML Biases

There are various ways that machine learning studies and statistical analysis studies can be biased and misleading, and P-hacking is one of the widely discussed pathways of that in hypothesis testing. P-hacking basically means that the researcher changes the definition of the problem or the significance threshold to prove a positive relationship. However, when reporting the results of the analysis, only the successful experiment gets reported to the scientific community.

Another issue occurs when the researcher works on a big dataset with many features. The researcher keeps exploring different parts of the dataset to find a statistically significant positive relationship and fails to do so; after enough digging, though, a positive relationship can be found. However, when reporting the data through scientific publications, only the significant positive results get reported and not the number of times that the research attempted to find them. This issue is often referred to as the multiple comparisons problem or the look-elsewhere effect.

There are many studies that show that p-hacking is a common problem and has significantly biased many scientific findings, but other studies exist that argue that, while p-hacking is common in scientific studies, it does not significantly change the big picture of what we eventually learn from scientific literature.

We should also always keep in mind that ML and generally data mining analyses are inherently exploratory, not confirmatory. It means that ML studies can only generate a hypothesis, while separate statistical testing is necessary to confirm the theory (more on this can be found here and here). This can also be seen as another aspect of the reproducibility problem in ML. Machine learning methods do not provide insights into causal relationships. Therefore, ML-driven methods can give us high predication accuracy under a specific condition, but their accuracy might significantly drop under a different condition, and we need to pay attention to this issue.

Performance Metrics

It is crucial to take into account different performance aspects of an ML model, and, to do that, we often need to consider various metrics (refer to this blog post for more information). As I discussed in this previous blog post, it is also important to carefully think about which metrics are really desirable for the problem at hand. For example, the decision about focusing on specificity vs. sensitivity should be determined by examining the nature of our problem. The multiple metrics issue has been widely discussed in past hydrology literature, but it has also been ignored in many water-related studies.

3- Results Interoperation and Extraction of Actionable Insights

Overinterpretation and Generalization Bias

The ML models are trained and tested for a specific dataset. Therefore, when they are used in the real-world application for prediction, they may or may not be as reliable. Therefore, we should also be careful about how we extrapolate the fidelity of our model to new datasets.

Confirmation Bias

This bias takes various forms. For example, we are subconsciously biased toward what we already think is the truth that will show up in the final results of our analysis. Also, we can unknowingly lean toward the results that are shinier and more groundbreaking. In addition, we might want to make our results more interesting for our stakeholders, clients, and the organizations we work for. These are some of the biases that we need to avoid in any type of research and certainly in ML projects.

Interpretability

ML models can benefit from powerful and sophisticated algorithms that are built based on comprehensive lists of features. However, that does not guarantee a decent level of interoperability. In fact, it might make the model and its results harder to understand. I won’t get into the details of this here, but there are many approaches that have been used to control model complexity and interpretability (here and here).

Publication Bias

Scientific journals and media outlets are less interested in publishing results that show non-significant causal relationships. In other words, research efforts that find negative results are less likely to get published. This phenomenon affects academic culture and creates pressure to find and submit positive results, and it also leads to the publication of positive results that have been found by chance. Publication bias has been recognized and talked about for a long time. The issue was first introduced by Theodore Sterling (1959), but it continues to harm the products of academia today.

There is another type of bias that frequently happens in publications. There is always a tendency toward showing that the model developed in the study outperforms popular state-of-the-art algorithms. The issue is that the models developed by the authors often benefit from careful fine-tuning of hyperparameters, while a generic type of the popular model is used in the comparison. This creates an unfair comparison. Refer to this blog post and this publication for more on this type of bias.

Also, ML models can yield a very good response under a specific condition (e.g., a random seed) and fall apart under a different condition (e.g., a different random seed). For example, this problem frequently happens in unsupervised clustering. Running the model with multiple random seeds and starting points is one of the ways to avoid this.

How Can We Address These Limitations?

As discussed in this blog post, ML studies can be negatively affected by various dimensions of these problems, but there are ways to respond to them. For example, biases in data collection can be improved through promoting diversity in research teams (more on this here, and here) or paying more attention to the design of the experiment. Careful definition of the problem and scope of the study also seems to be critical. It would be idealistic, but a cultural shift in the academic world that results in giving more credit to studies that find negative results would improve some of these problems. Finally, effectively using multiple performance metrics, synthetic data, and visual analytics should also benefit ML models.

How do we deal with extreme events and imbalanced datasets in machine learning?

One of the most popular goals of machine learning is binary classification. Many of the problems with binary classification arise when we need to identify and classify rare events in our datasets. The binary classification of rare events happens frequently in the detection of rare diseases, fraudulent financial activities, and manufactured products. In water system research, we can take the detection of flood risk as an example. We might want to identify flood days from a combination of precipitation, snow coverage, temperature, time of the year, soil moisture, and many other factors.

Training a machine learning model such as this can be very challenging because, in our historical record, we might only have 1% extreme flood days (or even much less), and the rest of the days are non-flood days (normal days). In this blog post, I go over some of the considerations that need to be taken into account when dealing with rare events. I will also discuss some of the techniques that can help us address the issues of imbalanced datasets.

Confusion Matrix

In machine learning, we often use confusion matrices (also known as error matrices) to investigate the performance of classification models. To make this more understandable, I am going to use the example of floods. For example, let’s imagine that our goal is to predict whether each day is a flood day or a non-flood day. We will train a machine learning model to identify flood days. In this case, the predicted label of each day can fall into one of the following categories:

True positive (TP): We correctly classify a flood day as a flood day.

True negative (TN): We correctly classify a non-flood day as a normal day.

False positive (FP): We misclassify a normal day as a flood day.

False negative (FN): we misclassify a flood day as a normal day.

We then count the number of TPs, TNs, FPs, and FNs. After that, we can draw the following table, called a confusion matrix, in order to visualize the outcomes of our binary classification model and use this to better understand the performance of our model.

Accuracy of Prediction

One of the most intuitive ways to perform error calculation is to count how many times our model classification is right and divide it by the total number of events. This is called accuracy and can be calculated from the following equation:

Accuracy Paradox

By definition, extreme and rare events are a small portion of our dataset, and using a single accuracy measure as the performance metric can cause significant bias in our model evaluation. I’ll provide an example to make this clearer. Let’s say that we need to identify a specific fraudulent activity in an online purchase dataset that only happens 0.1% of the time on average; as such, our goal is to create a high-accuracy model to identify these events. Let’s assume that we have a model that classifies all activities as normal activities. In this case, the accuracy of our prediction is 99.9%, which looks quite decent. However, this model misclassifies all of the rare activities that we are looking for. Therefore, we have a very bad model with a very high accuracy. This is called the accuracy paradox. To respond to this paradox, different methods and error metrics have been introduced.

Other Performance Indicators

  1. True Positive Rate (TPR), which is also known as sensitivity or recall:

2- True Negative Rate (TNR), which is also known as specificity:

3- False Positive Rate (FPR):

4- Positive Predictive Value (PPV):

There are many other performance measures that focus on different aspects of model performance, and more information on these binary classification metrics can be found in here.

The selection of these metrics depends on the actual problem at hand and the concerns of stakeholders. In some cases, we have to give the priority to reducing false negatives. In the flood example, reducing false negatives might improve the reliability of the system in flood protection. However, in some other situations, reducing false positives can be more important because having incorrect flood alarms can trigger flood safety measures such as reducing the water volume in dams, which can put more pressure on irrigation supply systems.

Therefore, one of the main issues that we need to take into account in classifying rare events is carefully selecting the performance metrics that we use to train our model.

There are other ways to investigate the performance of binary classification models that I am not covering in this blog post, such as developing a Receiver Operating Characteristic (ROC) curve or calculating the area under the ROC curve (AUC). However, more information about these methods can be found at here and at here.

Manipulating the Training Dataset

As expected, rare events do not appear in our dataset frequently; however, they can cause important ramifications. When we train our model for an imbalanced dataset that only has 5% TRUE labels, for example, the model tends to learn more about FALSE labels (the majority) and do poorly in identifying TRUE labels. To avoid this, we can modify or resample our training dataset to force the model to focus on the TRUE labels (the minority class). There are various approaches to do that, and I am going to introduce four of these methods here: 1) undersampling; 2) oversampling; 3) SMOTE (Synthetic Minority Over-sampling TEchnique); and 4) increasing the cost of misclassifying the minority class.

Undersampling

Undersampling reduces the size of the majority class in our training dataset in order to balance the minority and majority classes. There are different resampling methods; for example, we can randomly select from the majority dataset and remove them. However, this method can lead to a loss of useful information and reduce our model’s performance. There are other, more intelligent ways that tend to preserve the useful information and focus on removing redundant samples. Condensed Nearest Neighbor Rule (CNN), Tomek Links, and One-Sided Selection are some examples of these methods. You can refer to this blog post for more information about resampling techniques.

Oversampling

Another way of modifying the training dataset is to increase the size of the minority class. The most basic way of doing that is to duplicate the minority class members until the data is not imbalanced anymore.

SMOTE

SMOTE (Nitesh Chawla et al., 2002) is an oversampling technique that synthetically generates new samples from the original minority group. Here is how the SMOTE algorithm works. 1) The algorithm first selects a random sample. 2) It finds k nearest neighbors of the selected point and randomly chooses one of them. 3) The algorithm then finds the distance between the two points. 4) It generates a random number between 0 and 1. 5) The algorithm then multiplies the feature vector by that random number to generate a new point that is a reasonable distance from our original minority class. 7) We start over and continue this process until the minority class reaches the desired size. There are some other variants of the basic SMOTE technique, which this blog post discusses. We can also use a combination of oversampling and undersampling to achieve a better result.

Penalized Models

Another method to reduce the impact of imbalanced datasets is to increase the cost of misclassifying the minority class. To do that, we can add a new column to our feature list that has a low number for the majority class (e.g., 1) and has much greater values for the minority class (e.g., 100). This forces the model to pay more attention to the minority class in order to reduce the errors during training.

Software Packages for Manipulation of Imbalanced Datasets

Here, I introduce some of the Python and R libraries that have been developed to address the problems that arise when dealing with imbalanced datasets.

Python:

  1. scikit-learn offers confusion_matrix, roc_curve and auc modules that can be used to generate the confusion matrix.
  2. imbalanced-learn offers several over- and undersampling techniques as well as combined methods.  

R:

  1. pROC can be used for calculating ROC curve and AUC.
  2. ROSE performs simple over-, under-, and combined sampling.
  3. DMwR performs SMOTE.

Taylor Diagram

The evaluation of model performance is a widely discussed and yet extremely controversial topic in hydrology, atmospheric science, and environmental studies. Generally, it is not super straightforward to quantify the accuracy of tools that simulate complex systems. One of the reasons is that these systems usually have various sub-systems. For example, a complex river system will simulate regulated streamflow, water allocation, dam operations, etc. This can imply that there are multiple goodness-of-fit objectives that cannot be fully optimized simultaneously. In this situation, there will always be tradeoffs in the accuracy of the system representation.

The other reason is that these complex systems are often state-dependent, and their behavior non-linearly changes as the system evolves. Finally, there are variables in the system that have several properties, and it is usually impossible to improve all of their properties at the same time (Gupta et al., 1998). Take streamflow as an example. In calculating model accuracy, we can focus on the daily fluctuation of streamflow or look into the changes weekly, monthly, and annually. We can also concentrate on the seasonality of streamflow, extreme low and high cases, and the persistence of these extreme events. Therefore, our results are not fully defensible if we only focus on one performance metric and make inferences that ignore many other components of the system. In this blog post, I am going to explain a visualization technique that allows us to draw three or more metrics of the system on the same plot. The plot is called a Taylor Diagram. The Taylor Diagram is not really a solution to the problem that I explained, but it does provide a systematic and mathematically sound way of demonstrating a few popular goodness-of-fit measures. The original version of the Taylor Diagram includes three performance metrics: i) standard deviation of simulated vs. observed; ii) correlation coefficient between observed and simulated; and iii) centered root mean square error (CRMSE). However, there are other versions of the Taylor Diagram that include more than these three metrics (e.g., here). The Taylor Diagram was developed by Karl E. Taylor (original paper) and has been frequently used by meteorologists and atmospheric scientists. In this blog post, I focus on streamflow.

Underlying Mathematical Relationships

As mentioned above, the Taylor Diagram draws the following three statistical metrics on a single plot: standard deviation, CRMSE, and correlation. The equation below relates these three:

Equation – 1

In this equation:

This relationship can be easily derived using the definition of CRMSE, which is:

Equation – 2

In this equation:

We can expand this equation to the following:

Equation – 3

The first two components of the equation are standard deviations of observed and simulated time series. To understand the third component of the equation, we need to recall the mathematical definition of correlation.

Equation – 4

If we multiply both sides of the above equation by standard deviation of observed and simulated, we see that the third components of the equations 3 and equation 4 are actually the same. The Taylor Diagram uses polar coordinates to visualize all of these components. Readers can refer to the original paper to find more discussion and the trigonometric proofs behind this form of plot.

Code Example

In this blog post, I am presenting a simple R package that can be used to create Taylor Diagrams. In this example, I use the same streamflow datasets that I explained in the previous blog post. First, you need to download the time series from GitHub and use the following commands to read these two time series into R.

Observed_Arrow<-read.table("~/Taylor Diagram/Arrow_observed.txt", header = T)
Simulated_Arrow<-read.table("~/Taylor Diagram/Arrow_simulated.txt", header = T)

In this post, I use the “openair” R library, which was originally developed for atmospheric chemistry data analysis. In addition to the Taylor Diagram, the package provides many helpful visualization options that can be accessed from here. Note that I was able to find at least one more R package that can be used to create Taylor Diagrams (i.e., plotrix ). There are also Python and MATLAB libraries that can be used to create Taylor Diagrams.

You can use the following to install the “openair” package and activate the library.

install.packages("openair")
library(openair)

The following command can be used to create a Taylor Diagram.

combined_dataset<-cbind(Observed_Arrow[ 18263:length(Observed_Arrow[,1]),], Arrow_sim=Simulated_Arrow[1:10592,4])

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim")

Interpretation of the Taylor Diagram

The Taylor Diagram indicates the baseline observed point where correlation is 1 and CRMSE is 0 (purple color). If the simulation point is close to the observed point, it means that they are similar in terms of standard deviation, their correlation is high, and their CRMSE is close to zero. There is also a black dashed line that represents the standard deviation of the observed time series. If the red dot is above the line, it means that the simulated data set has a higher variation. The other helpful information that we gain from the Taylor Diagram is the correlation between observed and simulated values. Higher correlation shows a higher level of agreement between observed and simulated data. The correlation goes down as a point moves toward higher sectors in the graph. Centered root mean square error also shows the quality of the simulation process; however, it puts more weight on outliers. The golden contour lines on the polar plot show the value of CRMSE. Basically, on this figure, the red point has a higher standard deviation, the correlation is around 90%, and the CRMSE is close to 20,000.

The package also allows us to look at the performance of the model during different months or years, and we can compare different simulation scenarios with the observed data. Here, I use the function to draw a point for each month. The “group” argument can be used to do that.

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month")

The function also provides a simple way to break down the data into different plots for other attributes. For example, I created four panels for four different annual periods:

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month", type = "Year")

Apply Functions in R and Python

In this post, I will go over some simple tools which can be used to create more efficient and concise R and Python scripts. First, I will explain the apply function in R and Python. Then, I will briefly go over anonymous functions in Python.

The Apply Function in R

The apply function is used to manipulate data frames, matrices, and lists. It takes a data frame and a function as inputs, and applies that function to each row or column of the data frame. In essence, the apply function is an alternative to “for” loops. 

The apply function has three main inputs: a data object, a margin variable, and a function. As mentioned earlier, the data object can have different formats. The margin variable specifies if the function applies to rows (MARGIN=1) or columns (MARGIN=2). The function can either be an built-in R function (e.g., sum or max) or a function that the user defines. The function can be defined both inside and outside the apply function.  

Example Problem

Here I will define a simple problem as our test case. The task is to find the maximum of each column and divide all the elements of that column by the maximum. We will use the iris data set, because it is available in R and in Python’s seaborn package.

# Load the iris data set
data(iris)

# Assign first four columns of the iris data set to a data frame
iris_df<-as.data.frame(iris[,1:4])

# Use the apply function to do the calculations of the example problem
output_max=as.data.frame(apply(iris_df, MARGIN = 2, FUN = function (x) x/max(x)))

Sometimes there are other, easier ways to do these calculations. However, when what you want to do is more complicated, this method comes in handy. The apply function has some other variants such as lapply, sapply, and mapply. Refer to this post (here) for more information about these functions.

The Apply Function in Python

The pandas package for Python also has a function called apply, which is equivalent to its R counterpart; the following code illustrates how to use it. In pandas, axis=0 specifies columns and axis=1 specifies rows. Note that in this example I have defined a function outside of the apply, and imported it to calculate the maximum and the ratio-to-maximum. In the next section, I will present an alternative way of defining in-line functions in Python.

# The iris data set is available in the seaborn package in python
import seaborn as sns
import pandas

# The following script loads the iris data set into a data frame
iris = sns.load_dataset('iris')

# Define an external function to calculate the ratio-to-maximum 
def ratio_to_max (data):
    maximum=max(data)
    print(maximum)
    ratio=data/maximum
    return ratio

# Use the built-in apply function in Python to calculate the ratio-to-maximum for all columns
output_df=iris.iloc[:,0:4].apply(ratio_to_max, axis=0)


Anonymous Functions in Python

Python provides an easy alternative to external functions like the one used above. This method is called an anonymous or “lambda” function. A lambda is a tool to conduct a specific task on a data object, similar to a regular function; however, it can be defined within other functions and doesn’t need to be assigned a name. Therefore, in many cases, lambdas offer a cleaner and more efficient alternative to regular functions. A history of the lambda function can be found in this post (here), which also provides a comprehensive list of lambda’s functionalities. Here is an example of the lambda function used instead of the regular function defined before:

# The iris data set is available in the seaborn package in python
import seaborn as sns
import pandas

# The following script loads the iris data set into a data frame
iris = sns.load_dataset('iris')

# Here we use lambda to create an anonymous function and use that within panda's apply function 
output_df=iris.iloc[:,0:4].apply(lambda x:x/max(x), axis=0)

Note that, although R does not have a tool like lambda, it does provide a way of defining anonymous functions such as the one defined within the apply function. Also, there are other widely used Python built-in functions which work nicely with lambdas. For example, the map, filter, and reduce functions can take advantage of lambda’s simplicity in complex data mining tasks. You can refer to here and here for more information about these functions.

Introducing the R Package “biascorrection”

For variety of reasons, we need hydrological models for our short- and long-term predictions and planning.  However, it is no secret that these models always suffer from some degree of bias. This bias can stem from many different and often interacting sources. Some examples are biases in underlying model assumptions, missing processes, model parameters, calibration parameters, and imperfections in input data (Beven and Binley, 1992).

The question of how to use models, given all these uncertainties, has been an active area of research for at least 50 years and will probably remain so for the foreseeable future, but going through that is not the focus of this blog post.

In this post, I explain a technique called bias correction that is frequently used in an attempt to improve model predictions. I also introduce an R package for bias correction that I recently developed; the package is called “biascorrection.” Although most of the examples in this post are about hydrological models, the arguments and the R package might be useful in other disciplines, for example with atmospheric models that have been one of the hotspots of bias correction applications (for example, here, here and here). The reason is that the algorithm follows a series of simple mathematical procedures that can be applied to other questions and research areas.

What Do I Mean by Bias Correction?

Bias correction in the context of hydrological modeling usually means “manual” improvement of model simulations to ensure that they match the observed data. Here’s a simple example: Let’s say we know that our model tends to underestimate summer flow by 30%. Then we might reason that to improve our predictions, we could add 30% to our simulated summer flows. In this blog post, I use a more sophisticated method, but this example should give you an idea of what I’m talking about.

Bias Correction using Quantile Mapping

Quantile mapping is a popular bias correction method that has been used in various applications. In this method, we first create quantiles of observed and simulated data. After that, whenever we have a simulated value we can find its simulated quantile and replace it with the value of the closest observed quantile.

Figure 1 – A simplified workflow of the quantile mapping technique

We generally follow the following steps to do the bias correction (see here):

  • Monthly Quantiles

We need to have two monthly time series of observed and simulated streamflow. If both series use daily time steps, we must aggregate the daily values to monthly values first, to create the monthly quantiles. We then sort the observed and simulated streamflows and assign each value to a quantile. This process generates streamflow quantiles for each month (January, February, etc.).

  • Monthly Bias Correction

When the quantiles are ready, we can start from the first month of the simulated results and determine what quantile its values belong to. Then we can go to the same quantile of the observed values and use it instead of the simulated one. This creates a monthly bias-corrected stream.

  • Annual Adjustment

Hamlet and Lettenmaier 1999, (also here) argue that the monthly bias correction can dramatically change the magnitude of annual streamflow predictions. However, although hydrologic models usually perform poorly at monthly time steps, they are pretty good at capturing annual variations. Therefore, we tend to rely on them. We calculate the annual difference between the bias-corrected and simulated flows and then we apply that to each individual month. This way, we can make sure that while the monthly variations are consistent with the recorded streamflow, our model is still able to determine how the average annual flow looks.

  • Disaggregation to Daily

After we apply the annual adjustments, we can use simulated or historical observed values to disaggregate the monthly time series to a daily one. The “biascorrection” package provides two methods for doing that: (1) Rescaling the raw simulated daily time series to match the monthly bias-corrected values. For example, if the total simulated streamflow for a month is half of the bias-corrected values for that month, the disaggregation module multiplies the raw daily simulated values by two for that specific month. (2) Sampling from the daily observed historical times series. In this case, the model uses KNN to find some of the closest months in the historical period (in terms of average monthly values) and randomly selects one of them.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In climate change studies, if the historical, simulated, and observed quantiles do not differ widely from the projected future scenarios, we can use the historical quantiles. However, if the future values are fundamentally different from the historical time series, this might not be justifiable. In such a case, synthetic generation of future data may be the way to go. The R package doesn’t include this either.

There are just two quick disclaimers:

  1. The R package has been tested and seems to perform well, but its complete soundness is not guaranteed.
  2. This blog post and the R package only introduce this bias correction as a common practice; I do not endorse it as a remedy for all the problems of hydrological models. Readers should keep in mind that there are serious criticisms of bias correction (for example here). I will discuss some in the following sections.

Arguments Against Bias Correction

One of the main advantages of hydrological models is that they can simultaneously and, arguably, consistently simulate different water and energy balance processes. However, bias correction manually perturbs streamflow without taking into account its effects on other components of the system. Therefore, it takes away one of the main advantages of hydrological models. In some cases, this can even distort climate change signals.

The other problem is that bias correction tries only to match the overall, aggregate statistics of the simulated flow with the observed flow, although streamflow has many more attributes. For example, current bias-correction algorithms can systematically ignore extremes that occur in daily to weekly time steps.

The “biascorrection” Package

I recently developed an R package that can be used for bias correction in simulated streamflow. The package has four main functions. Its workflow is consistent with the four bias-correction steps described above: (1) quantile creation, (2) monthly quantile mapping, (3) annual adjustment, (4) disaggregation to daily. The package doesn’t have a prescribed unit, and it can be used in different applications that require bias correction.

How to Install “biascorrection” Package

The package is available on GitHub (here) and it can be installed using the following command:

devtools::install_github("keyvan-malek/biascorrection")

You can also go to my “blog” folder on GitHub to download simulated and observed datasets of streamflow at inlet of the Arrow dam in British Columbia (AKA Keenleyside Dam). The simulated flow has been generated using the Variable Infiltration Capacity model (VIC), and the observed flow is from Bonneville Power Administration’s No Regulation-No Irrigation streamflow datasets.

observed_input<-read.table("sample_data/Arrow_observed.txt", header = T)
simulated_input<-read.table("sample_data/Arrow_simulated.txt", header = T)

Note that the two datasets have different starting and ending dates. I intentionally used them to show how biascorrection package handles these types of datasets.

However, I plotted the overlapping period of the two datasets to demonstrate the difference between them. The simulated data set tends to underestimate streamflow during low-flow periods and overestimate during the high-flow periods.

Figure 2 – Observed vs simulated inflow to Arrow dam

Monthly Bias-Corrections

To run the monthly bias-correction function, first, you need to define the following starting and ending dates of your observed and simulated data frames:

start_date_observed<-"1929-01-01"
end_date_observed<-"2007-12-31"
start_date_simulated<-"1979-01-01"
end_date_simulated<-"2015-12-31"

You also need to define the following two conditions:

time_step<-"day"
date_type<-"JY" ## Water year (WY) or Julian Year (JY)

Finally, we can use the following to calculate the monthly bias corrected flow:

observed_flow=observed_input$ARROW_obs
simulatred_flow=simulated_input$ARROW_sim

df_bc_month<-bias_correct_monthly(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type)

Note that the format of observed and simulated inputs to the “bias_correct_monthly” function are not data frames. Here is how bias correction changes the simulated streamflow. As you see in the followin figure the bias-corrected flow doesn’t seem to have the underestimation problem during the low-flow anymore.

Figure 3- Monthly VIC simulated flow vs. bias-corrected flow

Annual Adjustment

You can use the following command to return the average annual flow back to what model originally simulated. Note that the inputs to the function is output of the monthly function.

df_bc_annual<-bias_correct_annual(df_bc_month$simulated, df_bc_month$bias_corrected, start_date_simulated, end_date_simulated)

Daily Disaggregation

The following function first uses the monthly function, then applies annual adjustment, and finally disaggregate the monthly streamflow to daily. The package has two options to convert monthly to daily: 1- it can multiply the simulated streamflow of each month by its bias-correction coefficient 2- it can use the K-nearest Neighbors method to find the closest month in the observed record

disaggregation_method<-"scaling_coeff" # "scaling_coeff" or "knn"

df_bc_daily<-bias_correct_daily(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type, disaggregation_method)

Here is how the entire bias-correction procedure affect the simulated streamflow:

Figure 4 – Simulated vs. bias-corrected flow after annual adjustment and daily disaggregation

Known Issues and Future Plans

  • Currently the biascorrection package only accepts complete years. For example, if your year starts in January it has to end in December, and it can not continue to, let’s say, next February.
  • I am thinking about adding some more functionalities to the biascorrection package in a few months. Some built-in post-processing options, built-in example datasets, and at least one more bias-correction techniques are some of the options.
  • For the next version, I am also thinking about releasing the model through CRAN.

Visualizing High Dimensional Data Using the T-SNE Method

In this blog post, I am going to go over a machine learning method that is used to visualize high dimensional datasets. The method is called T-distributed Stochastic Neighbor Embedding, or T-SNE for short. The method was developed by the Google Brain Team (van der Maaten & Hinton, 2008). In this blog post, I describe T-SNE and provide a simple example in R.

Two weeks ago, Dave introduced a few useful dimensionality reduction methods to visualize the Pareto front. Last week, Rohini’s blog post provided some very useful information about self-organizing maps. I guess this blog post will officially make this a series of posts on dimensionality reduction and the visualization of high dimensional datasets. If you haven’t already read these two super informative blog posts, I highly recommend you take a look at them.

T-SNE is an extension of the Local Linear Embedding technique, and it uses an interesting method to maintain the local relationships and similarities that exist in the original high-dimensional datasets when transferring that into a 2-D plane. As van der Maaten and Hinton (2008) argue, other dimensionality reduction techniques (e.g., PCA) usually focus on macro distances in the datasets, which might lead to overlooking short-distance relationships. However, T-SNE is able to capture both local and macro distances in the data structure. Therefore, it can give more reasonable results when you are dealing with non-linear high dimensional datasets.

T-SNE uses the following formula, which basically creates a Gaussian distribution over each data point and computes the density of all other data points in respect to the distribution around that point. The denominator of the equation normalizes the nominator.

In this equation, xi is the reference point, and xj is its neighboring points. Also, Sigma is the bandwidth that returns the same perplexity for each point. Perplexity is a measure of uncertainty that has a direct relationship with entropy. For more information about it, you can read this Wikipedia page. Basically, perplexity is a hyper parameter of T-SNE, and the final outcome might be very sensitive to its value. Also, in this equation, pij is the probability of the similarity of each two points in the high dimensional space. Basically, if two points are close to each other, pij will be high. If they are far from each other, the probability of similarity is low.

To make the low dimensional map consistent with its original high dimensional dataset, T-SNE also calculates a similar point-to-point probability of similarity in the 2D map.

Basically, T-SNE moves the point on the 2D plain to find the locations that minimize the Kullback-Leibler divergence metric between the Pi distributions of the original data and the Qi distributions of the 2D plane.

As you probably noticed, in the low dimensional space, T-SNE uses the Student T-distribution curve (with one degree of freedom). The reason is that Student T (compared to Gaussian) is a fat-tailed distribution. This allows the T-SNE to separate dissimilar points with higher margins, which makes the map easier to interpret.

T-SNE Example in R

Here, I provide a simple R code example that demonstrates how you can use T-SNE to visualize high dimensional datasets. This example only has four dimensions, so it doesn’t really represent a high dimensional problem. As such, keep in mind that T-SNE shows its true capabilities when there are tens or even hundreds of dimensions.

First, you need to install “tsne” library and load it.

install.packages("Rtsne”)
library(Rtsne)

# We also need to load ggplot2 and ggpubr libraries

library(ggplot2)
library(ggpubr)

In this example, I use R’s famous “iris” data set. Here is how you can activate it.

# Load the iris dataset
data(iris)

Now you can use the following scrip to visualize the data using T-SNE.

# Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=30, max_iter=1000, check_duplicates = FALSE) # Run TSNE

# Create a data frame using IRIS values and the T-SNE values
df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))

# Specify column names
names(df_to_gg)<-c("Species", "Y.1", "Y.2")

# Show the objects in the 2D T-SNE representation
ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
  labs(title=paste("T-SNE Visualization- Perplexity Number = 30" )) +
  scale_color_manual(values = c("darkblue", "orange", "pink3"))

Here is the 2-D map that our code generates

Although T-SNE is a powerful method and has become quite popular in recent years, it might have some pitfalls. This blog post discusses some of these issues, including the sensitivity of the final map to perplexity hyperparameters or the fact that it can be tricky to interpret T-SNE maps because they might not represent the actual relationships between different clusters.

To explore the impacts of the perplexity hyperparameter on the final clusters, I use the following code to create T-SNE maps for different perplexity values.

# Here are the perplexity values that I am taking into account in my T-SNE analysis
perplexity_number_values<-c(2,5,10, 15, 30, 40)

# Initializing a plot list object
plot_list = list()

# This loop explores the effect of perplexity values on the T-SNE results
for (i_prp in 1:6){
  
  # I am setting a seed to makes sure that my results for different perplexities are not sensitive to any random factors
  set.seed(10)
  
  # Perplexity number
  perplexity_number<- perplexity_number_values[i_prp]
  
  # Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
  y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=perplexity_number, max_iter=1000, check_duplicates = FALSE) # Run TSNE
  
  # Create a data frame using IRIS values and the T-SNE values
  df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))
  
  # Specify Column names
  names(df_to_gg)<-c("Species", "Y.1", "Y.2")
  
  # Show the objects in the 2D T-SNE representation
  plt<-ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
    labs(title=paste("T-SNE Visualization- Perplexity Number = ", perplexity_number )) +
    scale_color_manual(values = c("darkblue", "orange", "pink3"))
  
  # Save figure in the plot list object
  plot_list[[i_prp]]=plt
  
  
}
# Combine all the plot objects
plt_combined<-ggarrange(plotlist = plot_list[1:6], ncol = 2, nrow = 3, common.legend = T)

Here is the final result that underlines the high sensitivity of the location of our clusters on 2-D map to the choice of the perplexity values. There are other hyperparameters that can be tried as well, for example, the maximum number of iterations might change the final output in some cases.

There are many other very good tutorials on T-SNE, such as here and here. Also, for example, this tutorial provides a nice and easy-to-follow Python code for T-SNE. Finally, there are some great YouTube videos that clearly explain the logic behind T-SNE and its algorithm (for example, here and here).

Causal Inference Using Information-Theoretic Approaches

In my previous blog post, I explained information theory. I also talked about the application of information theory in sensitivity analysis. In this blog post, I briefly explain how information theory can be used in causal inference of time-series data.

First, I’ll offer a little perspective on causal analysis. We’re often interested to know what factors can cause a specific phenomenon. For example, let’s say that we want to understand what environmental conditions can cause a significant flood event in a specific region. Also, let’s assume that we have a time series of flood, precipitation, soil moisture, snow cover in upstream mountains, and temperature. There are many ways that this problem can be addressed. We can use regression analysis and correlation, principle component analysis, and variance-based analysis approaches to find out which situation best explains the flood events. However, these methods can potentially overlook many of the relationships that might exist in complex systems. Although the methods may give you a statistically meaningful equation, they cannot provide any causal insights. For example, there could be a strong relationship between the flood and precipitation two weeks prior to the flood event, or there might even be a relationship between snowfalls in winter and flood. Information-theoretic causal analysis can provide an alternative approach to exploring questions like these. Such methods have been used in many research areas, including environmental science (Rouge et al., 2019), neuroscience (Tononi et al., 2016), and economics (Yao and Li 2020).

Readers of this blog post can refer to these papers (Rouge et al., 2019, Goodwell et al., 2020, and Schreiber 2000) for further details about the basics and applications of information-theoretic causal analysis in different areas of science.

There are two schools of thought in these causal analyses (Goodwell et al., 2020): (1) Pearl causality and (2) Granger causality. Pearl causality was introduced by Judea Pearl (here) and is based on the idea that interventions in a complex system provide information that can potentially lead to a causal inference. However, Granger causality (here) focuses on exploring the transfer of information from the past to current states of the system. This blog post concentrates on Granger causality.

Many methods have been developed and used to study how factors can Granger-cause other variables. For example, the simplest way, which I discussed in the previous blog post, is mutual information. It is a pairwise causal-analysis method and basically tells us which lag times of variable X provide information to the current state of variable Y. In other words, we gain information about which time lag of variable X can improve the prediction of variable Y. However, mutual information alone is not informative enough for problems that deal with multiple variables and variables that depend on their own conditions at previous time-steps—conditions that usually exist in the real world. Partial information decomposition provides a way to incorporate unique information that each variable provides as well as mutual and redundant information that each variable can provide. Transfer entropy (Schreiber 2000), on the other hand, is for a pairwise analysis that also takes into account the information transfer from the previous time-step of that variable. In mathematical language,

Where, Y and X are time series, t denotes time-step, and τ is the time lag.

Example

In this example, I am using an R package called “TransferEntropy” that calculates transfer entropy. There is a Python package called “TIGRAMITE” that has been developed for causal analysis of time-series data. The following will install and load the R library.

install.packages("RTransferEntropy")
library(RTransferEntropy)

Then, you can use the following to activate an R dataset that includes information about the European stock market. This is the dataset that we are using to see if data about past states of two stock indicators—FTSE and SMI—were systematically related.

data(EuStockMarkets)

The following code can be used to calculate the transfer entropy:

data_TE=as.data.frame(EuStockMarkets, header=T)
TE<-transfer_entropy(x=data_TE$SMI, y=data_TE$FTSE, 
                 lx = 1, ly = 1, q = 0.1, 
                 entropy = c('Shannon', 'Renyi'), shuffles = 100, 
                 type = c('quantiles', 'bins', 'limits'),
                 quantiles = c(5, 95), bins = NULL, limits = NULL,
                 nboot = 300, burn = 50, quiet = FALSE, seed = NULL)

And here is the output that I got:

The results show that there is a significant relationship between the two variables (in both directions), as the following figure also indicates.

Finally, studies have also discussed potential issues that causal inference using these methods could create. For example, James et al. (2016) reported a poor level of reliability of these methods and suggested network science–oriented analysis as an alternative for causal inference.

Information Theory and Moment-Independent Sensitivity Indices

In this post, I will go over the concept of information theory and its relevance to sensitivity analysis. Then, I will talk about two other methods that are closely related to the concept of information theory. These methods are also often categorized as moment-independent sensitivity analysis methods.

What are the advantages of moment-independent sensitivity analysis indices?

Many methods exist for sensitivity analysis: for example, variance-based methods, such as the one proposed by Sobol (1990), that clarify how much of the variance of outputs can be explained by each input variable (Saltelli et al., 2008). They have been broadly implemented, and numerous studies show their capabilities. However, as Auder and Iooss (2008) and Zadeh et al., (2017) explain, as these approaches are not moment indifferent, variance-based methods might be inefficient in heavy-tailed distributions or when the outputs tend to have outliers. They also become less reliable when they encounter multi-modal distributions. To respond to these issues, moment-independent methods have been developed that are based on PDFs and CDFs. In this blog post, I go over some of them.

But before I introduce these methods, I would like to mention that there are studies that show that these moment-independent methods can be outperformed by other sensitivity analysis methods (e.g., variance-based Sobol). For example, Puy et al., (2020) shows that a moment-independent method (i.e., PAWN, which is explained below) does not necessarily produce better answers. Also, Auder and Iooss (2008) discuss how, at least in some cases, moment-independent methods can significantly increase computational costs.

Entropy-Based Methods

Shannon, the founder of information theory, first introduced the concept of entropy in his famous 1948 paper, A Mathematical Theory of Communication. To put it simply, entropy is the opposite of information. Higher entropy indicates higher uncertainty. In the last several decades, the concept of entropy has found its way into many scientific areas, including economics, statistics, engineering, environmental science, and machine learning.

A famous example that can help to explain the concept of entropy is coin tossing. Let’s say that you have two coins. One of them is a fair coin, meaning that the probabilities of getting heads or tails are equal (P=0.5). The other coin is not fair, as the probability of getting heads (P=0.8) is higher than that of getting tails (P=0.2). Now, you are going to flip one of these coins; which one has a higher uncertainty? The answer is that the fair coin has a higher entropy. Entropy has also been described as a measure of surprise. In our coin example, a coin with higher entropy has a higher chance to surprise you. The following explains some of the basic concepts related to information theory.

Entropy

The following formula can be used to calculate entropy:

Conditional Entropy

Another important concept in information theory is conditional entropy. Conditional entropy is the amount of information needed to estimate a random variable Y when a random variable X is known. Intuitively, low-conditional entropy implies higher dependence of random variable Y on random variable X. The following formula can be used to calculate conditional entropy:

Mutual Information

Mutual information is another concept that is closely related to entropy and explains how much of the information in Y can be estimated when X is known. The following shows how mutual information is related to entropy and conditional entropy:`

Entropy-Based Sensitivity Analysis

The concept of entropy has been used in sensitivity analysis. There are two popular entropy-based indices that have been used for sensitivity analysis: i) Krzykacz-Hausmann (2001) and ii) Liu et al., (2006). Both of them are based on analysis of PDFs of inputs and outputs. In this blog post, I will only explain the first one, Liu et al., adapted Kullback-Leibler (K-L) relative entropy concept which is conceptually similar and has a more complex mathematical formulation that can be solved using Monte Carlo sampling. More information on K-L-sensitivity methods can be found in Liu et al., (2006) and Auder and Iooss (2008).

Krzykacz-Hausmann Sensitivity Index

These authors use the direct definitions of Shannon’s information theory and use the following formula to calculate it:

 Higher values of η indicate higher sensitivity of RV Y to RV X.

Moment-Independent Methods

There are some other moment-independent methods that have been widely used in recent years, and I am including two of them here: i) PAWN and ii) δ-moment independent.

PAWN

PAWN is another moment-independent sensitivity analysis metric, developed by Pianosi and Wagener (2015). The main difference between PAWN and other moment-independent approaches is that PAWN calculates the difference between Cumulative Density Function (CDF) instead of PDF. The main advantage is that CDF is generally easier and faster to calculate. PAWN can be also used to focus on specific parts of the distribution.

To calculate the  index, they used Kolmogorov–Smirnov statistics, which calculate the distance between the conditional CDF and unconditional CDF.

where KS (xi) is the Kolmogorov–Smirnov index for factor xi, Fy is the unconditional CDF, Fy|x is the conditional CDF, and stat is either the maximum or the median of values calculated for different values that xi was conditioned on. The PAWN index varies between 0 and 1, while higher values of Ti indicate a higher importance for a factor. Pawn can also be used to zoom into a specific range of the output surface, as has been explained in Pianosi and Wagener (2015).

δ-Moment Independent Method

The delta (δ) moment-independent method is conceptually similar to the PAWN method. The main difference is that, instead of the CDF curve, it calculates the difference between unconditional and conditional PDFs. The method was first introduced by Borgonovo (2006) and has become very popular ever since. The following equation is used to calculate the δ index.

The δ index has all of the advantages of the PAWN index, but in many cases, it can be more computationally expensive.

In my next two blog posts, I will introduce some open-source moment-independent and entropy-based software packages and will give you some practical examples. I will also go over the application of information theory in causal analysis and inference.

Publication-Quality Illustrations in R

There are many ways to create high-quality figures for scientific publications, such as user-friendly and code-free tools like Adobe Illustrator. There are also some informative blog post in the Water programming blog that have gone over the basics of Illustrator and different tricks to make decent figures in that program (fore example, here and here). One issue about Adobe Illustrator is that it’s not free. However, there are some open-source design alternatives to Illustrator such as Inscape. One broader problem with using these types of programs for figure design, though, is the fact that figures can be challenging to produce and time-consuming to reproduce using these code-free methods.

The ggbur package in R provides a script-based way to create and combine high-quality figures. The pros of this method are that it is free, the figures are easily reproducible, and combining figure would not be very time-consuming if you are comfortable coding in R.

A few drawbacks do exist, however. For one, there are several ways to make figures in R, but, when using ggpubr, all of the figures need to have a ggplot format, which limits the application of other figure formats in R. The other disadvantage is that ggpubr might not be as flexible as Illustrator in customizing the details of figures. In this blogpost, I explain some of the main feature of ggpubr and will show some examples of how figures can be created using this method. Since ggpubr is a part of ggplot, if you are not familiar with ggplot, it might be helpful to take a look at some ggplot basics first (for example, here, here, and here).

In this blog post, I used R’s “airquality” dataset to showcase some of the applications of the ggpubr. You can use the following code to activate the airqulaity dataset.

library(MASS)
data(airquality)

# You can use the head() function to look at the first 6 columns of our dataset
head(airquality)

I used this dataset to create seven different ggplot2 figures and then I used ggpubr to merge them into a combined figure.

library(ggplot2)
# uncomment the next line if you haven't installed ggpubr yet 
# install.packages("ggpubr") 
library(ggpubr) 

p1<-ggplot(airquality, aes(x=as.factor(Month), y=Ozone, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() + theme(legend.title = element_blank()) +
  labs(x="Month", y="Ozone concentration (PPM)", title = "a. Monthly Ozone concentration in New York")

p2<-ggplot(airquality, aes(x=as.factor(Month), y=Solar.R, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() +theme(legend.title = element_blank()) +
  labs(x="Month", y="Solar Radiation (W/m2)", title = "b. Monthly Solar Radiation in New York")

p3<-ggplot(airquality, aes(x=as.factor(Month), y=Temp, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() +theme(legend.title = element_blank()) +
  labs(x="Month", y="Temperature (Fahrenheit)", title = "c. Monthly Tmperature in New York")


p4<-ggplot(airquality, aes(x=Solar.R, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() +
scale_color_hue() +theme(legend.title = element_blank()) +
  labs(x="Solar Radiation (W/m2)", y="Ozone concentration (PPM)", title = "d. Ozone concentration vs. solar radiation")

p5<-ggplot(airquality, aes(x=Wind, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Wind Speed (Miles per Hour)", y="Ozone concentration (PPM)", title = "e. Ozone concentration vs. wind speed")

p6<-ggplot(airquality, aes(x=Temp, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Temperature (Fahrenheit)", y="Ozone concentration (PPM)", title = "f. Ozone concentration vs. temperature") 



p7<-ggplot(airquality, aes(x=Ozone)) + geom_density(fill="gold") + theme_bw() + theme(legend.title = element_blank()) +
  labs(x="Ozone concentration (PPM)", y="Density", title = "g. PDF of Ozone concentration in New York") +
  annotate("this is an example")

Here is an example of these plots (p1):

The main function of ggpubr (in my opinion, at least) is ggarrange. This function provides a nice and fast way to merge different graphs and tables. Therefore, if you use this function, you don’t need to worry about whether different panels of your plots are properly aligned. If it’s applicable to your plot, you can also include a common legend.

# ggarrange is used to combine p1 to p7
combined_plot<-ggarrange(
ggarrange(p1, p2, p3, p4, p5, p6, ncol=2, nrow = 3, common.legend =T # TRUE: remove individual legends and add a common legend), 
          p7, ncol=1, nrow =2,
          heights = c(2, 0.7))

# ggsave can be used to save ggplot and ggpubr figures.
ggsave("YOUR FAVORITE DIRECTORY/CombinedPlot.png", combined_plot, width = 12, height = 14) 

Keep in mind that, ggpubr provides some extra functions for generating plots that are slightly different from ggplot2 functions. For example, instead of geom_density, you can use ggdensity.

Also, although ggplot2 provides several different plot themes, ggpubr allows users to use its own publication style theme. The following can be used to apply that theme to the graph. Note that ggpubr’s original functions have this theme as their default plot theme.

Sometimes, we need to draw the attention of our readers to a specific part of the graph or add some extra information to the figure. You can use the Annotate function to do that.

p6<-ggplot(airquality, aes(x=Temp, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Temperature (Fahrenheit)", y="Ozone concentration (PPM)", title = "f. Ozone concentration vs. temperature") 

p6<-p6+annotate( "rect", xmin = 80, xmax = 95, ymin = 50, ymax = 100, alpha = .3, color="black", fill="grey")+
  annotate("text", x=67, y=80, label="This is how you can add extra \n information to the plot", fontface=2)

Sensitivity Analysis Tools

Sensitivity analysis (SA) is one of the main themes of the Water Programming Blog. There are several decent blog posts that go over theoretical aspects of sensitivity analysis (for example, here , here, and here). Also, many blog posts explain how to efficiently and elegantly visualize sensitivity analysis results (for example, here, and here). In addition, there are many blog posts related to SALib, a widely used Python library developed at Cornell University by former members of Dr. Reed research group (for example, here, here, and here).

Recently, I have been trying to put together a comprehensive list of other SA tools, and I thought it might be useful to write a blog post on this topic. I organized the following list based on the platforms I have explored so far, including MATLAB, Python, and R. After that, I will introduce some other open-source and commercialized SA tools.

MATLAB

Many MATLAB packages have been developed to perform sensitivity analysis and uncertainty quantification. As the following table shows, they have been created by a variety of universities and research institutes. Also, several of them cover different sensitivity analysis methods, such as Regression-based SA, Variance-based SA (e.g., Sobol), and derivative-based SA. All of them support at least two sampling techniques, such as Latin Hyper Cube Sampling. Many of them are generic (discipline-free) and can be used to answer different types of questions; however, a few of them (e.g., PeTTSy and DyGloSA) have been tailored to specific applications, such as biological models. Also, almost all of them include some post-processing and visualization components.

There are two toolboxes that work in platforms other than MATLAB. The SAFE package developed by Pianosi et al. (2015) has R and Python versions, and the SaSAT package developed at the University of New South Wales works in Microsoft Excel.

AbbreviationFull NameExample of Methods SupportedInstitution
GSATGlobal Sensitivity Analysis ToolboxSobol and FASTMATLAB
SAFESensitivity Analysis For EverybodyEET, or Morris method,RSA, Sobol, FAST, and PAWN University of Bristol
GSUAGlobal Sensitivity and Uncertainty Analysis ToolboxSobolMATLAB
GUI-HDMR Global Sensitivity Analysis ToolboxGlobal Sensitivity Analysis using HDMRUniversity of Leeds
DyGloSADynamical Global Sensitivity Analysis ToolboxDynamical Global parameter Sensitivity Analysis (GPSA) of ODE modelsUniversity of Luxembourg
PeTTSyPerturbation Theory Toolbox for SystemsPerturbation analysis of complex systems biology modelsUniversity of Cambridge
SaSATSampling and Sensitivity Analysis ToolsRegression-based (Pearson, Spearman, and Partial Rank Correlation Coefficients)The University of New South Wales
SensSBSensitivity Analysis in Systems Biology modelsLocal SA, derivative and variance based global sensitivity analysisProcess Engineering Group at IIM-CSIC (Vigo, Spain)
SobolGSAGlobal Sensitivity Analysis and Metamodeling SoftwareMorris, Sobol FAST and derivative-based Imperial College London
SUMO SUrrogate Modeling ToolboxSurrogate models, sensitivity analysisGhent University
UQLabThe Framework for Uncertainty QuantificationMorris, Kucherenko,ANCOVA, Borgonovo, SobolETH Zurich
FAST: Fourier Amplitude Sensitivity Testing
EET: Elementary Effects Test
RSA: Regional Sensitivity Analysis 

Python

Interestingly, I was not able to find many Python libraries, and most of the ones that I did find were developed for specific applications. Please leave a comment if you are aware of any other packages that have not been listed here. Among these packages, SALib seems to be the one that covers more SA and sampling methods. There are two SA and QU packages that have C++ versions (OpenTURNS and UQTk). Also, uncertainpy have been originally developed for neuroscience applications.

AbbreviationDescriptionExample of Methods SupportedInstitution
SALibPython sensitivity analysis library Sobol, Morris, FAST, RBD-FAST, Delta Moment-Independent Measure, Derivative-based, FactorialCornell University
uncertainpyUncertainty quantification and sensitivity analysis librarySobolUniversity of Oslo
MATKModel analysis toolKit FAST, SobolLos Alamos National Laboratory
UQTkQuantification of uncertainty in numerical modelsSobol Sandia National Lab
OpenTURNSOpen source initiative for the Treatment of UncertaintiesSpearman Correlation Coefficients, Sobol, ANCOVA, UQTechnical University of Denmark
varsensVariance Based Sensitivity AnalysisSobolVanderbilt University
FAST: Fourier Amplitude Sensitivity Testing
QU: Quantification of Uncertainty

R

I was able to find about fifty R packages that have sensitivity analysis features. The following table lists the ones that have the most comprehensive SA functionalities. It seems that the rest of them were developed for specific areas of science and have limited SA functionality. I list some of these here (RMut, pksensi, ivmodel, FME, episensr, pse).

Based on what I found, sensitivity package seems to cover a wider range of SA methods. Reader can refer to this blog post for more information about the sensitivity package.

NameExample of Methods Supported
sensobolThird-order Sobol
sensitivitySobol, Morris, FAST, RBD-FAST, Delsa, Derivative-based , Factorial
ODEsensitivityMorris, Sobol
multisensiSA on models with multivariate outputs
konfoundRobustness and sensitivity of empirical models
fastFAST
BASSSobol
FAST: Fourier Amplitude Sensitivity Testing

Other Platforms

There are many other SA tools that have been developed in other platforms, and the following table lists only a few of them. There are also several commercial SA platforms such as SDI, VISYOND, and SMARTUQ that seem to have nice graphical user interfaces (GUIs), but, because they are not freeware and the source codes are not available, they might have limited applications in academic research.

AbbreviationMain applicationsProgramming LanguageInstitution
DakotaOptimization, QU, SA (Sobol, FAST, Morris)C++Sandia National Laboratory
PSUADEQU, Spearman, Pearson Correlation Coefficient, Sobol,  Morris, FASTC++Lawrence Livermore National Laboratory
SIMLabSobol, FAST, MorrisGUI-based The European Commission’s science and knowledge service
QU: Quantification of Uncertainty
FAST: Fourier Amplitude Sensitivity Testing

Please leave a comment and let me know if you are aware of any other useful tools that I did not list here.