This post is meant to be an introduction to the concept of data augmentation. Data augmentation is the process of increasing the size of your data through small modifications to the original dataset. In instances where data availability are small (basically every ML application), this technique is especially useful to create more training data that can lead to a more robust model that isn’t as susceptible to overfitting. Let’s begin with an example that will demonstrate why data augmentation is useful in image classification. Imagine that you have trained your model to distinguish between images of cats and dogs. The figure on the left is of a very good boy named Lincoln and this image resides in the training set. Let’s suppose that the image in the middle is in the test set. To humans, this is very clearly Lincoln (and a dog) once again, but if the algorithm hasn’t seen many images of dogs in this position, there is a chance that it won’t classify this image correctly and may think that Lincoln looks more like this cat in the training set that has a similar orientation and ears.
However, if I were to augment my original image in the training set by rotating, scaling, and shifting it, as shown below, perhaps my model would be more likely to classify Lincoln correctly as a dog having been trained on these variations. Various studies have demonstrated the benefits of this augmentation in image processing applications.
This is a very simple example to demonstrate that limited data availability need not preclude the ability to make robust predictions. It is not a far stretch to wonder how data augmentation may be utilized for regression-based prediction problems, especially in the water resources field where we have limited data. Particularly, it is hard for us to predict extremes because we have such few data points to characterize them. This style of problem is inherently more complicated than classification because time series have a temporal structure and are connected to underlying (sometimes physical) relationships. Thus, this requires that any augmentation does not completely change the fundamental characteristics of the data. Below are some examples of techniques that could be useful, but these are extremely case-specific and require a strong understanding of the behavior of your system. Before you implement any of these techniques, first make sure to split your data into the training and test set. Then feel free to add variations to the training set and test them out!
Block Bootstrapping
Bootstrapping (sampling with replacement) single points your dataset can only be done if each point is independent. This is not the case with time series data that has a temporal structure. Thus, it is more appropriate to utilize block-bootstrapping. This technique involves resampling blocks of continuous data from the original training data to make a new training dataset. By using large continuous blocks, we are preserving the inherent structure in the data, while allowing our algorithm to see new data (the original data in a new order).
Jittering with Noise
A small sample size doesn’t give us the opportunity to map out the rich input-output space that characterizes our system. Often, adding a little bit of random noise to your training data can help expand your understanding of the space. If your system exhibits highly non-linear behavior, you have to be extra careful that the noise that you are adding is realistic. For example, in a rainfall-runoff model, the fluctuations of temperature and precipitation are very different. Small changes in precipitation can result in very large overall streamflow changes, whereas temperature often fluctuates widely during the day with very little effect on streamflow. Therefore, adding the same amount of noise to each feature and the output may not make sense. It is a non-trivial effort, but it could be interesting to determine how to appropriately add noise to features that exhibit different behavior.
Interpolation
If you want to augment training data that has clear trends, interpolation between data points can be a viable option that won’t distort these trends. However, using a linear interpolation method sets the underlying assumption that your data are linear; for instance that a linear change in precipitation and temperature leads to a linear change in streamflow. This is likely not the case, so interpolation may not be a useful data augmentation technique for a rainfall-runoff regression-based model. However, interpolation could be useful in a less sensitive classification-based model.
Decomposition
Decomposition methods generally decompose time series signals by extracting features or underlying patterns from the training data. These features can either be used independently or recombined with noise and the old training data to generate new training data. Decomposition can be preformed in either the time or frequency domain. Within the decomposition domain lies manifold-based techniques as well. A study by Forestier et al., 2017 calculate a weighted average that reflects the manifold of the
original data and use it as new data with high success.
Implementation
All of these techniques have shown success in very specific time series applications: those related to speech, audio, and gait recognition, and specifically for classification-based models. Very little has been published on regression-based models and the use of data augmentation in the water resources community seems nonexistent.
Below, I implemented one of these techniques using the CNN that I fit in my prior post. The results for the baseline prediction of streamflow are shown in the first panel. Then I tried a data augmentation scenario. I took the training set and reversed it and kept the first 1000 data points (which have more extremes). I then took these points and concatenated them to the original training set. The new results are shown in the second panel. The CNN trained with the additional augmented data does a much better job of capturing the extremes in the test set, which is what we often are interested in. There is a lot of work to be done and more complicated methods to explore, but these initial results look interesting and suggest that data augmentation can be useful in our field!
References
Forestier, G., Petitjean, F., Dau, H. A., Webb, G. I., & Keogh, E. (2017, November). Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM) (pp. 865-870). IEEE.
Oh, C., Han, S., & Jeong, J. (2020). Time-series Data Augmentation based on Interpolation. Procedia Computer Science, 175, 64-71.