Artificial intelligence has already changed the world in many ways. It is very difficult to find a manmade tool that does not use AI in one way or another. It can be argued that many of these changes have been positive; however, it can also easily be shown that AI has not benefited everybody evenly.
In terms of the application of AI in science, the number of publications on this topic has significantly increased during the last few years. I searched for the words water, machine, and learning on Google Scholar and limited the search to only count the studies with all of these words in their titles. I created the following figure from the results of that simple search. As you can see, the number of publications has significantly increased in the last few years. This is obviously not a thorough study, but it shows how important artificial intelligence has become in academic research.
However, there are many pitfalls that can degrade the usefulness of AI in academic and non-academic applications. Some people even argue that these pitfalls could potentially lead us to a new AI winter that nobody will enjoy. In this blog post, I decided to go over some of these issues and difficulties. Some of the problems have emerged recently, but others have been known by statisticians for decades. For many reasons, they continue to exist and negatively affect artificial intelligence projects.
Any machine learning project usually includes the following components: 1) sample collection and preparation, 2) ML model generation, and 3) result interoperation and extraction of actionable insights. Here, I will talk about some of the mistakes and biases that can happen in each of these steps.
1- Sample Collection and Preparation
The input dataset is a crucial source of bias in any ML project. The problem basically occurs when the dataset is not a good representation of the real world. Here are a few variants of this issue:
This happens when specific groups that exist in the real world are systematically absent in the dataset. For example, a specific part of the population can be absent in the dataset, or the dataset can only contain information about particular countries, climates, etc. Having a diverse group that represents various segments of society and different worldviews can reduce the consequences of this type of bias.
There are different reasons that observations can be faulty and unreliable. For example, observations can be sensed and collected through malfunctioning measurement devices that can bias the entire dataset. Also, human reading errors and mistakes can cause measurement errors. Additionally, there are many things that can go wrong during the post-processing of raw measurements and can lead to measurement bias.
Other Sampling Biases
There are also other types of data collection biases that have been discussed in statistical literature for years. One example of these is self-selection bias. Let’s imagine that you are running an online poll and want to analyze the results, but your participants are not fully random. People who choose to participate in your poll might have specific personalities or worldviews and not represent the population at large. Other biases that have been widely discussed in statistical literature include survivor bias, prejudice bias, observer bias, and recall bias. You can find more information about these biases here, here, and here. If possible, these biases should be avoided. Otherwise, their effects should be estimated and removed from the analysis. Also, acquiring more comprehensive datasets can reduce these types of biases.
If the data collection process is fair and reliable, the imbalanced dataset is not really a bias: it’s more of a challenge that is inherent to specific problems. Some events just don’t happen frequently. The problems and research questions that focus on rare events need their own appropriate treatments. In a previous blogpost, I explained this issue and the ways that we can deal with this problem. You can also refer to this article and this article for information about imbalance datasets.
Data leakage is a situation in which we mistakenly share a portion (or the entirety) of the data between the training and validation datasets. Data leakage can happen due to different types of mistakes during data preparation. Timeseries can also implicitly include data leakage. For example, if training and testing datasets are selected in a way that allows the training dataset to gain information about the removed period, we have implicitly leaked data. For example, if we extract the testing dataset from the middle of the original timeseries, we basically give the model insight about what happens before and after the testing period, which is basically cheating and can reduce the accuracy of the model. One way to avoid this is to make sure that our training data period precedes the validation period.
Another type of leakage is called feature-leakage, and it happens when our feature list includes a variable that is either closely correlated with data labels or is a duplicate of the data label. Obviously, we won’t have that feature during the prediction. Therefore, our trained model will be highly biased.
2- ML and Statistical Models
Lack of Physical Understanding of the System
In machine learning, we use statistical relationships to explore how the system would behave under a specific condition. However, ML models do not gain any explicit knowledge about the actual physical laws that govern the system. For example, in hydrology, it is very possible to violate water and energy conservation laws while training and using machine learning-driven models. However, there are studies (here and here) that have been trying to incorporate these insights into their ML model. This issue is likely to attract more attention in the future.
There is a general consensus among ML researchers that many ML-based studies are not reproducible. This means that an individual researcher cannot replicate the ML process and find the same results. Many recent studies have shown that ML scientists are often not able to reproduce the results of other studies. While some people refer to this as the reproducibility crisis in ML studies, it is more of a general academic research issue. Based on a 2016 survey published by Nature, 70% of scientists across various disciplines were not able to reproduce the results of other studies, and more than 50% even failed to reproduce their own previously developed results. This is obviously a big challenge and needs to be taken into consideration. More information about reproducibility challenge can be find blog posts (here and here).
Instability of the Model
Model stability is also an important challenge. As discussed earlier, models can be very efficient and accurate under specific random conditions (e.g., random seed or data sample) and will fail under different conditions. This problem is also closely related to the reproducibility challenge. Using different random seeds and developing model ensembles can provide a solution to this problem. The solution also depends on the inherent nature of the ML algorithm. For example, this blog post nicely explains the issue of stability and its solution in K-Means clustering.
Simpson’s Paradox describes a situation in which our analysis generates a different conclusion if we investigate each individual group separately while comparing them to the situation where these groups are all combined. The following figure shows an example of this situation under a simple linear regression. The implication of this phenomenon should be taken into account when we make decisions about labels and when we interpret our results.
It is important to make sure that there is no significant correlation between different features, which can lead to the creation of unstable and unreliable ML models. This is the same issue as the widely discussed multicollinearity problem in regression. However, removing feature columns needs to be done carefully in order to avoid significant loss of information.
Other Statistical and ML Biases
There are various ways that machine learning studies and statistical analysis studies can be biased and misleading, and P-hacking is one of the widely discussed pathways of that in hypothesis testing. P-hacking basically means that the researcher changes the definition of the problem or the significance threshold to prove a positive relationship. However, when reporting the results of the analysis, only the successful experiment gets reported to the scientific community.
Another issue occurs when the researcher works on a big dataset with many features. The researcher keeps exploring different parts of the dataset to find a statistically significant positive relationship and fails to do so; after enough digging, though, a positive relationship can be found. However, when reporting the data through scientific publications, only the significant positive results get reported and not the number of times that the research attempted to find them. This issue is often referred to as the multiple comparisons problem or the look-elsewhere effect.
There are many studies that show that p-hacking is a common problem and has significantly biased many scientific findings, but other studies exist that argue that, while p-hacking is common in scientific studies, it does not significantly change the big picture of what we eventually learn from scientific literature.
We should also always keep in mind that ML and generally data mining analyses are inherently exploratory, not confirmatory. It means that ML studies can only generate a hypothesis, while separate statistical testing is necessary to confirm the theory (more on this can be found here and here). This can also be seen as another aspect of the reproducibility problem in ML. Machine learning methods do not provide insights into causal relationships. Therefore, ML-driven methods can give us high predication accuracy under a specific condition, but their accuracy might significantly drop under a different condition, and we need to pay attention to this issue.
It is crucial to take into account different performance aspects of an ML model, and, to do that, we often need to consider various metrics (refer to this blog post for more information). As I discussed in this previous blog post, it is also important to carefully think about which metrics are really desirable for the problem at hand. For example, the decision about focusing on specificity vs. sensitivity should be determined by examining the nature of our problem. The multiple metrics issue has been widely discussed in past hydrology literature, but it has also been ignored in many water-related studies.
3- Results Interoperation and Extraction of Actionable Insights
Overinterpretation and Generalization Bias
The ML models are trained and tested for a specific dataset. Therefore, when they are used in the real-world application for prediction, they may or may not be as reliable. Therefore, we should also be careful about how we extrapolate the fidelity of our model to new datasets.
This bias takes various forms. For example, we are subconsciously biased toward what we already think is the truth that will show up in the final results of our analysis. Also, we can unknowingly lean toward the results that are shinier and more groundbreaking. In addition, we might want to make our results more interesting for our stakeholders, clients, and the organizations we work for. These are some of the biases that we need to avoid in any type of research and certainly in ML projects.
ML models can benefit from powerful and sophisticated algorithms that are built based on comprehensive lists of features. However, that does not guarantee a decent level of interoperability. In fact, it might make the model and its results harder to understand. I won’t get into the details of this here, but there are many approaches that have been used to control model complexity and interpretability (here and here).
Scientific journals and media outlets are less interested in publishing results that show non-significant causal relationships. In other words, research efforts that find negative results are less likely to get published. This phenomenon affects academic culture and creates pressure to find and submit positive results, and it also leads to the publication of positive results that have been found by chance. Publication bias has been recognized and talked about for a long time. The issue was first introduced by Theodore Sterling (1959), but it continues to harm the products of academia today.
There is another type of bias that frequently happens in publications. There is always a tendency toward showing that the model developed in the study outperforms popular state-of-the-art algorithms. The issue is that the models developed by the authors often benefit from careful fine-tuning of hyperparameters, while a generic type of the popular model is used in the comparison. This creates an unfair comparison. Refer to this blog post and this publication for more on this type of bias.
Also, ML models can yield a very good response under a specific condition (e.g., a random seed) and fall apart under a different condition (e.g., a different random seed). For example, this problem frequently happens in unsupervised clustering. Running the model with multiple random seeds and starting points is one of the ways to avoid this.
How Can We Address These Limitations?
As discussed in this blog post, ML studies can be negatively affected by various dimensions of these problems, but there are ways to respond to them. For example, biases in data collection can be improved through promoting diversity in research teams (more on this here, and here) or paying more attention to the design of the experiment. Careful definition of the problem and scope of the study also seems to be critical. It would be idealistic, but a cultural shift in the academic world that results in giving more credit to studies that find negative results would improve some of these problems. Finally, effectively using multiple performance metrics, synthetic data, and visual analytics should also benefit ML models.