Potential Biases and Pitfalls of Machine Learning

Artificial intelligence has already changed the world in many ways. It is very difficult to find a manmade tool that does not use AI in one way or another. It can be argued that many of these changes have been positive; however, it can also easily be shown that AI has not benefited everybody evenly.

In terms of the application of AI in science, the number of publications on this topic has significantly increased during the last few years. I searched for the words water, machine, and learning on Google Scholar and limited the search to only count the studies with all of these words in their titles. I created the following figure from the results of that simple search. As you can see, the number of publications has significantly increased in the last few years. This is obviously not a thorough study, but it shows how important artificial intelligence has become in academic research.

However, there are many pitfalls that can degrade the usefulness of AI in academic and non-academic applications. Some people even argue that these pitfalls could potentially lead us to a new AI winter that nobody will enjoy. In this blog post, I decided to go over some of these issues and difficulties. Some of the problems have emerged recently, but others have been known by statisticians for decades. For many reasons, they continue to exist and negatively affect artificial intelligence projects.

Any machine learning project usually includes the following components: 1) sample collection and preparation, 2) ML model generation, and 3) result interoperation and extraction of actionable insights. Here, I will talk about some of the mistakes and biases that can happen in each of these steps.

1- Sample Collection and Preparation

The input dataset is a crucial source of bias in any ML project. The problem basically occurs when the dataset is not a good representation of the real world. Here are a few variants of this issue:

Exclusion Bias

This happens when specific groups that exist in the real world are systematically absent in the dataset. For example, a specific part of the population can be absent in the dataset, or the dataset can only contain information about particular countries, climates, etc. Having a diverse group that represents various segments of society and different worldviews can reduce the consequences of this type of bias.

Measurement Bias

There are different reasons that observations can be faulty and unreliable. For example, observations can be sensed and collected through malfunctioning measurement devices that can bias the entire dataset. Also, human reading errors and mistakes can cause measurement errors. Additionally, there are many things that can go wrong during the post-processing of raw measurements and can lead to measurement bias.

Other Sampling Biases

There are also other types of data collection biases that have been discussed in statistical literature for years. One example of these is self-selection bias. Let’s imagine that you are running an online poll and want to analyze the results, but your participants are not fully random. People who choose to participate in your poll might have specific personalities or worldviews and not represent the population at large. Other biases that have been widely discussed in statistical literature include survivor bias, prejudice bias, observer bias, and recall bias. You can find more information about these biases here, here, and here. If possible, these biases should be avoided. Otherwise, their effects should be estimated and removed from the analysis. Also, acquiring more comprehensive datasets can reduce these types of biases.

Imbalanced Datasets

If the data collection process is fair and reliable, the imbalanced dataset is not really a bias: it’s more of a challenge that is inherent to specific problems. Some events just don’t happen frequently. The problems and research questions that focus on rare events need their own appropriate treatments. In a previous blogpost, I explained this issue and the ways that we can deal with this problem. You can also refer to this article and this article for information about imbalance datasets.

Data Leakage

Data leakage is a situation in which we mistakenly share a portion (or the entirety) of the data between the training and validation datasets. Data leakage can happen due to different types of mistakes during data preparation. Timeseries can also implicitly include data leakage. For example, if training and testing datasets are selected in a way that allows the training dataset to gain information about the removed period, we have implicitly leaked data. For example, if we extract the testing dataset from the middle of the original timeseries, we basically give the model insight about what happens before and after the testing period, which is basically cheating and can reduce the accuracy of the model. One way to avoid this is to make sure that our training data period precedes the validation period.

Another type of leakage is called feature-leakage, and it happens when our feature list includes a variable that is either closely correlated with data labels or is a duplicate of the data label. Obviously, we won’t have that feature during the prediction. Therefore, our trained model will be highly biased.

More information about data leakage can be found here and here.

2- ML and Statistical Models

Lack of Physical Understanding of the System

In machine learning, we use statistical relationships to explore how the system would behave under a specific condition. However, ML models do not gain any explicit knowledge about the actual physical laws that govern the system. For example, in hydrology, it is very possible to violate water and energy conservation laws while training and using machine learning-driven models. However, there are studies (here and here) that have been trying to incorporate these insights into their ML model. This issue is likely to attract more attention in the future.

Reproducibility

There is a general consensus among ML researchers that many ML-based studies are not reproducible. This means that an individual researcher cannot replicate the ML process and find the same results. Many recent studies have shown that ML scientists are often not able to reproduce the results of other studies. While some people refer to this as the reproducibility crisis in ML studies, it is more of a general academic research issue. Based on a 2016 survey published by Nature, 70% of scientists across various disciplines were not able to reproduce the results of other studies, and more than 50% even failed to reproduce their own previously developed results. This is obviously a big challenge and needs to be taken into consideration. More information about reproducibility challenge can be find blog posts (here and here).

Instability of the Model

Model stability is also an important challenge. As discussed earlier, models can be very efficient and accurate under specific random conditions (e.g., random seed or data sample) and will fail under different conditions. This problem is also closely related to the reproducibility challenge. Using different random seeds and developing model ensembles can provide a solution to this problem. The solution also depends on the inherent nature of the ML algorithm. For example, this blog post nicely explains the issue of stability and its solution in K-Means clustering.

Simpson’s Paradox

Simpson’s Paradox describes a situation in which our analysis generates a different conclusion if we investigate each individual group separately while comparing them to the situation where these groups are all combined. The following figure shows an example of this situation under a simple linear regression. The implication of this phenomenon should be taken into account when we make decisions about labels and when we interpret our results.

Correlation Bias

It is important to make sure that there is no significant correlation between different features, which can lead to the creation of unstable and unreliable ML models. This is the same issue as the widely discussed multicollinearity problem in regression. However, removing feature columns needs to be done carefully in order to avoid significant loss of information.

Other Statistical and ML Biases

There are various ways that machine learning studies and statistical analysis studies can be biased and misleading, and P-hacking is one of the widely discussed pathways of that in hypothesis testing. P-hacking basically means that the researcher changes the definition of the problem or the significance threshold to prove a positive relationship. However, when reporting the results of the analysis, only the successful experiment gets reported to the scientific community.

Another issue occurs when the researcher works on a big dataset with many features. The researcher keeps exploring different parts of the dataset to find a statistically significant positive relationship and fails to do so; after enough digging, though, a positive relationship can be found. However, when reporting the data through scientific publications, only the significant positive results get reported and not the number of times that the research attempted to find them. This issue is often referred to as the multiple comparisons problem or the look-elsewhere effect.

There are many studies that show that p-hacking is a common problem and has significantly biased many scientific findings, but other studies exist that argue that, while p-hacking is common in scientific studies, it does not significantly change the big picture of what we eventually learn from scientific literature.

We should also always keep in mind that ML and generally data mining analyses are inherently exploratory, not confirmatory. It means that ML studies can only generate a hypothesis, while separate statistical testing is necessary to confirm the theory (more on this can be found here and here). This can also be seen as another aspect of the reproducibility problem in ML. Machine learning methods do not provide insights into causal relationships. Therefore, ML-driven methods can give us high predication accuracy under a specific condition, but their accuracy might significantly drop under a different condition, and we need to pay attention to this issue.

Performance Metrics

It is crucial to take into account different performance aspects of an ML model, and, to do that, we often need to consider various metrics (refer to this blog post for more information). As I discussed in this previous blog post, it is also important to carefully think about which metrics are really desirable for the problem at hand. For example, the decision about focusing on specificity vs. sensitivity should be determined by examining the nature of our problem. The multiple metrics issue has been widely discussed in past hydrology literature, but it has also been ignored in many water-related studies.

3- Results Interoperation and Extraction of Actionable Insights

Overinterpretation and Generalization Bias

The ML models are trained and tested for a specific dataset. Therefore, when they are used in the real-world application for prediction, they may or may not be as reliable. Therefore, we should also be careful about how we extrapolate the fidelity of our model to new datasets.

Confirmation Bias

This bias takes various forms. For example, we are subconsciously biased toward what we already think is the truth that will show up in the final results of our analysis. Also, we can unknowingly lean toward the results that are shinier and more groundbreaking. In addition, we might want to make our results more interesting for our stakeholders, clients, and the organizations we work for. These are some of the biases that we need to avoid in any type of research and certainly in ML projects.

Interpretability

ML models can benefit from powerful and sophisticated algorithms that are built based on comprehensive lists of features. However, that does not guarantee a decent level of interoperability. In fact, it might make the model and its results harder to understand. I won’t get into the details of this here, but there are many approaches that have been used to control model complexity and interpretability (here and here).

Publication Bias

Scientific journals and media outlets are less interested in publishing results that show non-significant causal relationships. In other words, research efforts that find negative results are less likely to get published. This phenomenon affects academic culture and creates pressure to find and submit positive results, and it also leads to the publication of positive results that have been found by chance. Publication bias has been recognized and talked about for a long time. The issue was first introduced by Theodore Sterling (1959), but it continues to harm the products of academia today.

There is another type of bias that frequently happens in publications. There is always a tendency toward showing that the model developed in the study outperforms popular state-of-the-art algorithms. The issue is that the models developed by the authors often benefit from careful fine-tuning of hyperparameters, while a generic type of the popular model is used in the comparison. This creates an unfair comparison. Refer to this blog post and this publication for more on this type of bias.

Also, ML models can yield a very good response under a specific condition (e.g., a random seed) and fall apart under a different condition (e.g., a different random seed). For example, this problem frequently happens in unsupervised clustering. Running the model with multiple random seeds and starting points is one of the ways to avoid this.

How Can We Address These Limitations?

As discussed in this blog post, ML studies can be negatively affected by various dimensions of these problems, but there are ways to respond to them. For example, biases in data collection can be improved through promoting diversity in research teams (more on this here, and here) or paying more attention to the design of the experiment. Careful definition of the problem and scope of the study also seems to be critical. It would be idealistic, but a cultural shift in the academic world that results in giving more credit to studies that find negative results would improve some of these problems. Finally, effectively using multiple performance metrics, synthetic data, and visual analytics should also benefit ML models.

MORDM Basics II: Risk of Failure Triggers and Table Generation

Previously, we demonstrated the key concepts, application, and validation of synthetic streamflow generation. A historical inflow timeseries from the Research Triangle region was obtained, and multiple synthetic streamflow scenarios were generated and validated using the Kirsch Method (Kirsch et. al., 2013). But why did we generate these hundreds of timeseries? What is their value within the MORDM approach, and how do we use them?

These questions will be addressed in this blog post. Here, we will cover how risk of failure (ROF) triggers use these synthetic streamflow timeseries to dynamically assess a utility’s ability to meet its performance objectives on a weekly basis. Once more, we will be revisiting the Research Triangle test case.

Some clarification

Before proceeding, there are some terms we will be using frequently that require definition:

  1. Timeseries – Observations of a quantity (e.g.: precipitation, inflow) recorded within a pre-specified time span.
  2. Simulation – A set of timeseries (synthetic/historical) that describes the state of the world. In this test case, one simulation consists of a set of three timeseries: historical inflow and evaporation timeseries, and one stationary synthetic demand timeseries.
  3. State of the world (SOW) – The “smallest particle” to be observed, or one fully realized world that consists of the hydrologic simulations, the set deeply-uncertain (DU) variables, and the system behavior under different combinations of simulations and DU variables.
  4. Evaluation – A complete sampling of the SOW realizations. One evaluation can sample all SOWs, or a subset of SOWs.

About the ROF trigger

In the simplest possible description, risk of failure (ROF) is the probability that a system will fail to meet its performance objective(s). The ROF trigger is a measure of a stakeholder’s risk tolerance and its propensity for taking necessary action to mitigate failure. The higher the magnitude of the trigger, the more risk the stakeholder must be willing to face, and the less frequent an action is taken.

The ROF trigger feedback loop.

More formally, the concept of Risk-of-Failure (ROF) was introduced as an alternative decision rule to the more traditional Days-of-Supply-Remaining (DSR) and Take-or-Pay (TOP) approaches in Palmer and Characklis’ 2009 paper. As opposed to the static nature of DSR and TOP, the ROF method emphasizes flexibility by using rule-based logic in using near-term information to trigger actions or decisions about infrastructure planning and policy implementation (Palmer and Characklis, 2009).

Adapted from the economics concept of risk options analysis (Palmer and Characklis, 2009), its flexible, state-aware rules are time-specific instances, thus overcoming the curse of dimensionality. This flexibility also presents the possibility of using ROF triggers to account for decisions made by more than one stakeholder, such as regional systems like the Research Triangle.

Overall, the ROF trigger is state-aware, system-dependent probabilistic decision rule that is capable of reflecting the time dynamics and uncertainties inherent in human-natural systems. This ability is what allows ROF triggers to aid in identifying how short-term decisions affect long-term planning and vice versa. In doing so, it approximates a closed-loop feedback system in which decisions inform actions and the outcomes of the actions inform decisions (shown below). By doing so, the use of ROF triggers can provide system-specific alternatives by building rules off historical data to find triggers that are robust to future conditions.

ROF triggers for water portfolio planning

As explained above, ROF triggers are uniquely suited to reflect a water utility’s cyclical storage-to-demand dynamics. Due to their flexible and dynamic nature, these triggers can enable a time-continuous assessment (Trindade et. al., 2019) of:

  1. When the risks need to be addressed
  2. How to address the risk

This provides both operational simplicity (as stakeholders only need to determine their threshold of risk tolerance) and system planning adaptability across different timescales (Trindade et. al., 2019).

Calculating the ROF trigger value, α

Cary is located in the red box shown in the figure above
(source: Trindade et. al., 2019).

Let’s revisit the Research Triangle test case – here, we will be looking at the data from the town of Cary, which receives its water supply from the Jordan Lake. The necessary files to describe the hydrology of Cary can be found in ‘water_balance_files’ in the GitHub repository. It is helpful to set things up in this hypothetical scenario: the town of Cary would like to assess how their risk tolerance affects the frequency at which they need to trigger water use restrictions. The higher their risk tolerance, the fewer restrictions they will need to implement. Fewer restrictions are favored as deliberately limiting supply has both social and political implications.

We are tasked with determining how different risk tolerance levels, reflected by the ROF trigger value α, will affect the frequency of the utility triggering water use restrictions. Thus, we will need to take the following steps:

  1. The utility determines a suitable ROF trigger value, α.
  2. Evaluate the current risk of failure for the current week m based on the week’s storage levels. The storage levels are a function of the historical inflow and evaporation rates, as well as projected demands.
  3. If the risk of failure during m is at least α, water use restrictions are triggered. Otherwise, nothing is done and storage levels at week m+1 is evaluated.

Now that we have a basic idea of how the ROF triggers are evaluated, let’s dive in a little deeper into the iterative process.

Evaluating weekly risk of failure

Here, we will use a simple analogy to illustrate how weekly ROF values are calculated. Bernardo’s post here provides a more thorough, mathematically sound explanation on this method.

For now, we clarify a couple of things: first we have two synthetically-generated datasets for inflow and evaporation rates that are conditioned on historical weekly observations (columns) and SOWs (rows). We also have one synthetically-generated demand timeseries conditioned on projected demand growth rates (and yes, this is were we will be using the Kirsch Method previously explained). We will be using these three timeseries to calculate the storage levels at each week in a year.

The weekly ROFs are calculated as follows:

We begin on a path 52 steps from the beginning, where each step represents weekly demand, dj where week j∈[1, 52]

We also have – bear with me, now – a crystal ball that let’s us gaze into n-multiple different versions of past inflows and evaporation rates.

At step mj:

  1. Using the crystal ball, we look back into n-versions of year-long ‘pasts’ where each alternative past is characterized by:
    • One randomly-chosen annual historical inflow timeseries, IH beginning 52 steps prior to week mj
    • One randomly-chosen annual historical evaporation timeseries, EH beginning 52 steps prior to week mj
    • The chosen demand timeseries, DF beginning 52 steps prior to week mj
    • An arbitrary starting storage level 52 weeks prior to mj, S0
  2. Out of all the n-year-long pasts that we have gazed into, count the total number of times the storage level dropped to below 20% of maximum at least once, f.
  3. Obtain the probability that you might fail in the future (or ROF), pf = ROF =  f/n
  4. Determine if ROF > α.
  5. Take your next step:

This process is repeated for all the k-different hydrologic simulations.

Here, the “path” represents the projected demand timeseries, the steps are the individual weekly projected demands, and the “versions of the past” are the n-randomly selected hydrologic simulations that we have chosen to look into. It is important that n ≥ 50 for the ROF calculation to have at least 2% precision (Trindade et. al., 2019).

An example

For example, say you (conveniently) have 50 years of historical inflow and evaporation data so you choose n=50. You begin your ROF calculation in Week 52. For n=1, you:

  1. Select the demands from Week 0-51.
  2. Get the historical inflow and evaporation rates for Historical Year 1.
  3. Calculate the storage for each week, monitoring for failure.
  4. If failure is detected, increment the number of failures and move on to n=2. Else, complete the storage calculations for the demand timeseries.

This is repeated n=50 times, then pf is calculated for Week 52.

You then move on to Week 53 and repeat the previous steps using demands from Week 1-52. The whole process is completed once the ROFs in all weeks in the projected demand timeseries has been evaluated.

Potential caveats

However, this process raises two issues:

  1. The number of combinations of simulations and DU variables are computationally expensive
    • For every dj DF, n-simulations of inflows and evaporation rates must be run k-times, where k is the total number of simulations
    • This results in (n × k) computations
    • Next, this process has to be repeated for as many SOWs that exist (DU reevaluation). This will result in (n × k × number of DU variables) computations
  2. The storage values are dynamic and change as a function of DF, IH and EH

These problems motivate the following question: can we approximate the weekly ROF values given a storage level?

ROF Tables

To address the issues stated above, we generate ROF tables in which approximate ROF values for a given week and given storage level. To achieve this approximation, we first define storage tiers (storage levels as a percentage of maximum capacity). These tiers are substituted for S0 during each simulation.

Thus, for each hydrologic simulation, the steps are:

  1. For each storage tier, calculate the ROF for each week in the timeseries.
  2. Store the ROF for a given week and given storage level in an ROF table unique to the each of the k-simulations
  3. This associates one ROF value with a (dj, S0) pair

The stored values are then used during the DU reevaluation, where the storage level for a given week is approximated to its closest storage tier value, Sapprox in the ROF table, negating the need for repeated computations of the weekly ROF value.

The process of generating ROF tables can be found under rof_table_generator.py in the GitHub repository, the entirety of which can be found here.

Conclusion

Previously, we generated synthetic timeseries which were then applied here to evaluate weekly ROFs. We also explored the origins of the concept of ROF triggers. We also explained how ROF triggers encapsulate the dynamic, ever-changing risks faced by water utilities, thus providing a way to detect the risks and take adaptive and mitigating action.

In the next blog post, we will explore how these ROF tables can be used in tandem with ROF triggers to assess if Cary’s water utility will need to trigger water use restrictions. We will also dabble in varying the value of ROF triggers to assess how different risk tolerance levels, action implementation frequency, and individual values can affect a utility’s reliability by running a simple single-actor, two-objective test.

References

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406. doi:10.1061/(asce)wr.1943-5452.0000287

Palmer, R. N., & Characklis, G. W. (2009). Reducing the costs of meeting regional water demand through risk-based transfer agreements. Journal of Environmental Management, 90(5), 1703-1714. doi:10.1016/j.jenvman.2008.11.003

Trindade, B., Reed, P., & Characklis, G. (2019). Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management. Advances in Water Resources, 134, 103442. doi:10.1016/j.advwatres.2019.103442

Automate remote tasks with Paramiko

This is a short blogpost to demonstrate a the Paramiko Python package. Paramiko allows you to establish SSH, SCP or SFTP connections within Python scripts, which is handy when you’d like to automate some repetitive tasks with on remote server or cluster from your local machine or another cluster you’re running from.

It is often used for server management tasks, but for research applications you could consider situations where we have a large dataset stored at a remote location and are executing a script that needs to transfer some of that data depending on results or new information. Instead of manually establishing SSH or SFTP connections, those processes could be wrapped and automated within your existing Python script.

To begin a connection, all you need is a couple lines:

import paramiko

ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname='remotehose',username='yourusername',password='yourpassword')

The first line creates a paramiko SSH client object. The second line tells paramiko what to do if the host is not a known host (i.e., whether this host should be trusted or not)—think of when you’re setting up an SSH connection for the first time and get the message:

The authenticity of host ‘name’ can’t be established. RSA key fingerprint is ‘gibberish’. Are you sure you want to continue connecting (yes/no)?

The third line is what makes the connection, the hostname, username and password are usually the only necessary things to define.

Once a connection is established, commands can be executed with exec_command(), which creates three objects:

stdin, stdout, stderr = ssh_client.exec_command("ls")

stdin is write-only file which can be used for commands requiring input, stdout contains the output of the command, and stderr contains any errors produced by the command—if there are no errors it will be empty.

To print out what’s returned by the command, use can use stdout.readlines(). To add inputs to stdin, you can do so by using the write() function:

stdin, stdout, stderr = ssh.exec_command(“sudo ls”)
stdin.write(‘password\n’)

Importantly: don’t forget to close your connection, especially if this is an automated script that opens many of them: ssh_client.close().

To transfer files, you need to establish an SFTP or an SCP connection, in a pretty much similar manner:

ftp_client=ssh_client.open_sftp()
ftp_client.get('/remote/path/to/file/filename','/local/path/to/file/filename')
ftp_client.close()

get() will transfer a file to a local directory, put(), used in the same way, will transfer a file to a remote directory.

Introduction to Convergent Cross Mapping

This post is a follow-on to my previous post on Granger causality. Granger causality has well-known limitations.  As previously discussed, the test can only find “predictive causality” and not true causality. Further, a key requirement of Granger causality is separability, meaning the casual variable is independent of the variable that it influences. This separability tends to be characteristic of stochastic and linear systems. For systems in which separability is not satisfied or which there are shared driving variables, Granger causality is not applicable. A new approach has been suggested to try to understand causality in dynamical systems where effects of casual variables cannot be separated or uncoupled from the variables that they influence.

This approach, called Convergent Cross Mapping (CCM), was first tested in 1991 but was later evolved by Sugihara et al., 2012 for identifying causation in ecological time series. CCM relies on the fundamental assumption that dynamics in the world aren’t purely stochastic and that in some applications, there are governing dynamics that can be represented by some underlying manifold, M, shown in Figure 1. If two variables, X and Y are causally linked, they should share an underlying M as a common attractor manifold. This means that the state of the variables can be used to inform each other. Furthermore, if X is an environmental driver of a population variable Y, information about the states of X can be recovered from Y, but not vice versa. The CCM approach intends to test how much Y can be a reliable estimate of states of  X [1].

Figure 1: Figure 2 from Sugihara et al., 2012 demonstrating CCM on shadow manifolds Mx and My

 In essence, the approach seeks to create “shadow” manifolds of M, termed Mx and My which are constructed using lagged information from each time series. At some time, t, the nearest neighbors on each manifold are determined. Weights are assigned to each nearest neighbor and Y is estimated using the weighted nearest neighbors from X. Then, the correlation between Y and Y|Mx is computed. The length of the time series is important; a longer time series results in closer nearest neighbors and better predictability [2].

Sugihara et al., 2012 demonstrates situations in which CCM is applicable:

  1. Unidirectional Causality: Species X influences Y, but the reverse is not true.
  2. Transivity/External Forcing: Two species, X and Y, do not interact, but are driven by a common external variable Z. This type of interaction has been studied extensively in the famous anchovy and sardine example, where the two species tend to have linked dynamics (one peaks when the other is in a trough). Some theories suggest that the species are directly related and influence each other’s dynamics, but research has shown that neither species interacts with the other and rather that the population dynamics are shaped by an external driver (temperature). However, using correlation or applying Granger Causality tends to imply a causal link. Using CCM, Sugihara et al., 2012 confirms that there is no cross-map signal between the anchovy and sardine time series, but there is a clear mapping between sea surface temperature and anchovies and sea temperature and sardines.

An R package called multispatialCCM is available to implement CCM but the author, Adam Clark, adapts the methods to be applicable to ecological time series that are shorter (<30 sequential observations which is usually required for the traditional CCM method) and potentially disjoint [3]. In the documentation of the package, there is a full example demonstrating the application of the method to some synthetic time series, A and B, meant to demonstrate A being causally forced by B, but the opposite not being true. In that spirit, I tried out the package with a very simple example and using the same dataset from the last post. Here I use precipitation and outflow where precipitation should causally force outflow, but the opposite should not be true. The following code demonstrates this:

library(multispatialCCM)

#Import data
setwd("D:/Documents/")
data=read.delim("Minimal_Train.txt")
Accm=as.vector(data$observed_cms)[1:200]
Bccm=as.vector(data$Precip)[1:200]

#Simulate data to use for multispatial CCM test
#See function for details - A is causally forced by B,
#but the reverse is not true.
#Calculate optimal E
maxE<-5 #Maximum E to test
#Matrix for storing output
Emat<-matrix(nrow=maxE-1, ncol=2); colnames(Emat)<-c("A", "B")

#Loop over potential E values and calculate predictive ability
#of each process for its own dynamics
for(E in 2:maxE) {
  #Uses defaults of looking forward one prediction step (predstep)
  #And using time lag intervals of one time step (tau)
  Emat[E-1,"A"]<-SSR_pred_boot(A=Accm, E=E, predstep=1, tau=1)$rho
  Emat[E-1,"B"]<-SSR_pred_boot(A=Bccm, E=E, predstep=1, tau=1)$rho
}

#Look at plots to find E for each process at which
#predictive ability rho is maximized
matplot(2:maxE, Emat, type="l", col=1:2, lty=1:2,
        xlab="E", ylab="rho", lwd=2)
legend("bottomleft", c("A", "B"), lty=1:2, col=1:2, lwd=2, bty="n")
E_A<-2
E_B<-3
#Check data for nonlinear signal that is not dominated by noise
#Checks whether predictive ability of processes declines with

#increasing time distance
#See manuscript and R code for details
signal_A_out<-SSR_check_signal(A=Accm, E=E_A, tau=1,
                               predsteplist=1:10)
signal_B_out<-SSR_check_signal(A=Bccm, E=E_B, tau=1,
                               predsteplist=1:10)
#Run the CCM test
#E_A and E_B are the embedding dimensions for A and B.
#tau is the length of time steps used (default is 1)
#iterations is the number of bootstrap iterations (default 100)
# Does A "cause" B?
#Note - increase iterations to 100 for consistent results

CCM_boot_A<-CCM_boot(Accm, Bccm, E_A, tau=1, iterations=100)
# Does B "cause" A?
CCM_boot_B<-CCM_boot(Bccm, Accm, E_B, tau=1, iterations=100)
#Test for significant causal signal
#See R function for details
(CCM_significance_test<-ccmtest(CCM_boot_A,
                                CCM_boot_B))
#Plot results
plotxlimits<-range(c(CCM_boot_A$Lobs, CCM_boot_B$Lobs))
#Plot "A causes B"
plot(CCM_boot_A$Lobs, CCM_boot_A$rho, type="l", col=1, lwd=2,
     xlim=c(plotxlimits[1], plotxlimits[2]), ylim=c(0,1),
     xlab="L", ylab="rho")
#Add +/- 1 standard error
matlines(CCM_boot_A$Lobs,
         cbind(CCM_boot_A$rho-CCM_boot_A$sdevrho,
               CCM_boot_A$rho+CCM_boot_A$sdevrho),
         lty=3, col=1)
#Plot "B causes A"
lines(CCM_boot_B$Lobs, CCM_boot_B$rho, type="l", col=2, lty=2, lwd=2)
#Add +/- 1 standard error
matlines(CCM_boot_B$Lobs,
         cbind(CCM_boot_B$rho-CCM_boot_B$sdevrho,
               CCM_boot_B$rho+CCM_boot_B$sdevrho),
         lty=3, col=2)
legend("topleft",
       c("A causes B", "B causes A"),
       lty=c(1,2), col=c(1,2), lwd=2, bty="n")


First we read in the appropriate columns. I also decided to use just the first 200 observations from my 5-year long dataset. While longer time series are supposed to lead to better convergence, I found that for my time series (characterized by a lot of natural variability and noise), that it was best to limit the sequence to cleaner signal.

The next steps involve determining an appropriate E matrix for A and B. This matrix is an embedding dimension matrix and is similar to determining the number of time steps that a single process needs to predict its own dynamics.

Figure 2: Predictability (rho) as a function of E

When looking at Figure 2, you can see that the rho (predictability) is maximized for A=3 and B=2 (for the value of E). The CCM_Boot function runs the actual Convergent Cross Mapping and implements a bootstrapping method where a subset of the time series is used to determine E and the predictability in order to make sure that the ordering of the sample is not a defining factor of the outcome and to address uncertainty in the prediction. Finally, the results are plotted in Figure 3.  There are two main takeaways from this figure: The predictability of A causing B (outflow causing precipitation) is a flat line around 0 while the predictability of B causing A (precipitation causing outflow) is much higher. Furthermore, note how the predictability increases with an increasing time series length which allows for better convergence on the system dynamics. Clark et al., 2015 contains many other more well-defined ecological examples to demonstrate the package as well.

Figure 3: Predictability as a function of L for A causes B and B causes A

References:

[1] Sugihara, G., May, R., Ye, H., Hsieh, C. H., Deyle, E., Fogarty, M., & Munch, S. (2012). Detecting causality in complex ecosystems. science338(6106), 496-500.

[2] McCracken, J. M., & Weigel, R. S. (2014). Convergent cross-mapping and pairwise asymmetric inference. Physical Review E90(6), 062903.

[3] Clark, A. T., Ye, H., Isbell, F., Deyle, E. R., Cowles, J., Tilman, G. D., & Sugihara, G. (2015). Spatial convergent cross mapping to detect causal relationships from short time series. Ecology96(5), 1174-1181.

Special thanks to @ecoquant who suggested to look into this topic!

Writing sharable Python code part II: Unit Testing

When writing Python functions that may be used by others, it is important to demonstrate that your functions work as intended. In my last post I discussed how a proper function specification establishes a sort of contract between developers and users of functions that delineates errors in implementation (user error) from bugs in the code (developer error). While this contract is provides an important chain of accountability, it does not actually contain any proof that the function works as advertised. This is where unit testing comes in. A unit test simply runs a function over a suite of test cases (set of inputs that produce known results) to verify performance. As its name implies, a single unit test is meant to test one basic component of a code. Large or complex codes will have many sets of unit tests, each testing different elements (usually individual functions). Unit testing provides users with proof that your code works as intended, and serves as a powerful tool for finding and removing any errors in your code.

In this post I’ll provide guidance on the development of unit tests and demonstrate an example for a simple python function. Material in this post represents my interpretation of content taught by Professor Walker White at Cornell, in a Python Fundamentals course that I proctored in the summer of 2020.

An example script

To illustrate unit testing I’ll examine the a function called “check_satisficing.py” which tests whether a set of performance objectives meets a set of satisficing criteria (for background on satisficing, see this post). The function returns a boolean and has three parameters:

  • objs: a numpy array containing the two objective values
  • criteria: a numpy array two criteria,
  • directions: a list of strings that specify whether the criteria is meant to be a lower or upper bound.

Note that the actual code for this function only takes eight lines, the rest is the function specification and precondition checks.

def check_satisficing(objs, criteria, directions):
    """
    Returns whether a set of objectives meets a set of satisficing criteria
    
    Value return has type bool
    
    Parameters:
        objs: objective values from a given solution
        criteria: satisficing criteria for each objective
        directions: a list of strings containing "ge" or "le" for each 
            criteria, where ge indicates satisfication for values greater than
            or equal to the criteria and le indicates satisfaction for values 
            less than or equal to the criteria
    
    
    Examples:
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'ge'] 
            returns True
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'le'] 
            returns False     
        objs = [.4, .4], criteria = [0.5, 0.5], directions = ['le', 'le'] 
            returns True
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'le'] 
            returns False
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'ge'] 
            returns False
        objs = [.5, .5], criteria = [0.5, 0.5], directions = ['ge', 'ge'] 
            returns True
        objs = [.5, .5], criteria = [0.5, 0.5], directions = ['le', 'le'] 
            returns True
    
    Preconditions:
        objs is a numpy array of floats with length 2
        criteria is a numpy array of floats with length 2
        directions is a list of strings with length 2 containing either 
            "ge" or "le"
    """
    
    # check preconditions
    assert len(objs) == 2, 'objs has length ' + repr(len(objs)) + ', should be 2'
    assert len(criteria) == 2, 'criteria has length ' + repr(len(criteria)) + \
    ', should be 2'
    assert len(directions) == 2, 'directions has length ' + \
    repr(len(directions)) + ', should be 2'
    
    # check to make sure
    for i in range(2):
        assert type(objs[i])== np.float64, 'objs element ' + str(i) + ': ' + \
        repr(objs[i]) + ', is not a numpy array of floats'
        assert type(criteria[i])== np.float64, 'criteria element ' + str(i) + \
        ': ' + repr(criteria[i]) + ', is not a numpy array of floats'
        assert type(directions[i])== str, 'directions element ' + str(i) + \
        ': ' + repr(directions[i]) + ', is not a string'
        assert directions[i] == 'ge' or directions[i] == 'le', 'directions ' + \
        str(i) + ' is ' + repr(directions[i]) + ', should be either "ge" or "le"' 
    
    
    # loop through objectives and check if each meets the criteria
    meets_criteria = True
    for i in range(2):
        if directions[i] == 'ge':
            meets_criteria = meets_criteria and (objs[i] >= criteria[i])
        else:
            meets_criteria = meets_criteria and (objs[i] <= criteria[i])
    
    return meets_criteria

Developing test cases

If you read my last post, you may notice that this specification includes an extra section called “Examples”. These examples show the user how the function is supposed to perform. They also represent the suite of test cases used to validate the function. Creating test cases is more of an art than a science, and test cases will be unique to each function you write. However, there is a set of basic rule you can follow to guide your implementation of unit testing which I’ll outline below.

  1. Pick the simplest cases first. In this example, the simplest cases are when both objectives are both above or below the criteria
  2. Move to more complex cases. In this example, a more complex case could be when one objective is above and the other below, or vice versa
  3. Think about “weird” possibilities. One “weird” case for this code could be when one or both objectives are equal to the criteria
  4. Never test a precondition violation. Precondition violations are outside the scope of the function and should not be included in unit tests

Test cases should be exhaustive and even simple codes may have many test cases. In my example above I provide seven, can you think of any more that are applicable to this code?

Implementing a testing strategy

After you’ve developed your test cases, it’s time to implement your unit test. For demonstration purposes I’ve written my own unit test code which can be used to test the cases developed above. This function simply utilizes assert statements to check if each test input generates the correct output. A danger of writing your own testing function is that the test function itself may have errors. In practice, it’s easiest to use an established tool such as PyTest to perform unit testing (for in-depth coverage of PyTest see Bernardo’s post from 2019).

def test_check_satisficing():
    """
    Unit test for the function check_satisficing
    """
    import numpy as np
    from check_satisficing import check_satisficing
    
    print("Testing check_satisficing")
    
    
    # test case 1:
    objs1 = np.array([0.5, 0.5])
    criteria1 = np.array([0.4, 0.4])
    directions1 = ['ge','ge']
    result1 = True
    
    assert (check_satisficing(objs1, criteria1, directions1)) == result1, \
    'Test 1 failed ' + repr(objs1) + ', ' + repr(criteria1) + ', ' + \
    repr(directions1) + ' returned False, should be True'
    
    # test case 2:
    objs2 = np.array([0.5, 0.5])
    criteria2 = np.array([0.4, 0.4])
    directions2 = ['ge','le']
    result2 = False
    
    assert (check_satisficing(objs2, criteria2, directions2)) == result2, \
    'Test 2 failed ' + repr(objs2) + ', ' + repr(criteria2) + ', ' + \
    repr(directions2) + ' returned True, should be False'
    
    
     # test case 3:
    objs3 = np.array([0.4, 0.4])
    criteria3 = np.array([0.5, 0.5])
    directions3 = ['le','le']
    result3= True
    
    assert (check_satisficing(objs3, criteria3, directions3)) == result3, \
    'Test 3 failed ' + repr(objs3) + ', ' + repr(criteria3) + ', ' + \
    repr(directions3) + ' returned False, should be True'
    
    
     # test case 4:    
    objs4 = np.array([0.5, 0.5])
    criteria4 = np.array([0.4, 0.4])
    directions4 = ['ge','le']
    result4 = False
    
    assert (check_satisficing(objs4, criteria4, directions4)) == result4, \
    'Test 4 failed ' + repr(objs4) + ', ' + repr(criteria4) + ', ' + \
    repr(directions4) + ' returned True, should be False'
    
    
    # test case 5    
    objs5 = np.array([0.5, 0.5])
    criteria5 = np.array([0.4, 0.4])
    directions5 = ['le','ge']
    result5 = False
    
    assert (check_satisficing(objs5, criteria5, directions5)) == result5, \
    'Test 5 failed ' + repr(objs5) + ', ' + repr(criteria5) + ', ' + \
    repr(directions5) + ' returned True, should be False'
    
    
    # test case 6: 
    objs6 = np.array([0.5, 0.5])
    criteria6 = np.array([0.5, 0.5])
    directions6 = ['ge','ge']
    result6 = True
    
    assert (check_satisficing(objs6, criteria6, directions6)) == result6, \
    'Test 6 failed ' + repr(objs6) + ', ' + repr(criteria6) + ', ' + \
    repr(directions6) + ' returned False, should be True'
    
    # test case 7: 
    objs7 = np.array([0.5, 0.5])
    criteria7 = np.array([0.5, 0.5])
    directions7 = ['le','le']
    result7 = True
    
    assert (check_satisficing(objs7, criteria7, directions7)) == result7, \
    'Test 7 failed ' + repr(objs7) + ', ' + repr(criteria7) + ', ' + \
    repr(directions7) + ' returned False, should be True'
    
    
    print("check_satisficing has passed all tests!")

    return    

Concluding thoughts

When developing a new Python function, it’s good practice to code the test cases while you’re writing the function and test each component during the development cycle. Coding in this matter will allow you to identify and fix errors early an can save a lot of time and headache. Creating test cases during code development may also improve the quality of your code by helping you conceptualize and avoid potential bugs.