CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

    

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

Potential Biases and Pitfalls of Machine Learning

Artificial intelligence has already changed the world in many ways. It is very difficult to find a manmade tool that does not use AI in one way or another. It can be argued that many of these changes have been positive; however, it can also easily be shown that AI has not benefited everybody evenly.

In terms of the application of AI in science, the number of publications on this topic has significantly increased during the last few years. I searched for the words water, machine, and learning on Google Scholar and limited the search to only count the studies with all of these words in their titles. I created the following figure from the results of that simple search. As you can see, the number of publications has significantly increased in the last few years. This is obviously not a thorough study, but it shows how important artificial intelligence has become in academic research.

However, there are many pitfalls that can degrade the usefulness of AI in academic and non-academic applications. Some people even argue that these pitfalls could potentially lead us to a new AI winter that nobody will enjoy. In this blog post, I decided to go over some of these issues and difficulties. Some of the problems have emerged recently, but others have been known by statisticians for decades. For many reasons, they continue to exist and negatively affect artificial intelligence projects.

Any machine learning project usually includes the following components: 1) sample collection and preparation, 2) ML model generation, and 3) result interoperation and extraction of actionable insights. Here, I will talk about some of the mistakes and biases that can happen in each of these steps.

1- Sample Collection and Preparation

The input dataset is a crucial source of bias in any ML project. The problem basically occurs when the dataset is not a good representation of the real world. Here are a few variants of this issue:

Exclusion Bias

This happens when specific groups that exist in the real world are systematically absent in the dataset. For example, a specific part of the population can be absent in the dataset, or the dataset can only contain information about particular countries, climates, etc. Having a diverse group that represents various segments of society and different worldviews can reduce the consequences of this type of bias.

Measurement Bias

There are different reasons that observations can be faulty and unreliable. For example, observations can be sensed and collected through malfunctioning measurement devices that can bias the entire dataset. Also, human reading errors and mistakes can cause measurement errors. Additionally, there are many things that can go wrong during the post-processing of raw measurements and can lead to measurement bias.

Other Sampling Biases

There are also other types of data collection biases that have been discussed in statistical literature for years. One example of these is self-selection bias. Let’s imagine that you are running an online poll and want to analyze the results, but your participants are not fully random. People who choose to participate in your poll might have specific personalities or worldviews and not represent the population at large. Other biases that have been widely discussed in statistical literature include survivor bias, prejudice bias, observer bias, and recall bias. You can find more information about these biases here, here, and here. If possible, these biases should be avoided. Otherwise, their effects should be estimated and removed from the analysis. Also, acquiring more comprehensive datasets can reduce these types of biases.

Imbalanced Datasets

If the data collection process is fair and reliable, the imbalanced dataset is not really a bias: it’s more of a challenge that is inherent to specific problems. Some events just don’t happen frequently. The problems and research questions that focus on rare events need their own appropriate treatments. In a previous blogpost, I explained this issue and the ways that we can deal with this problem. You can also refer to this article and this article for information about imbalance datasets.

Data Leakage

Data leakage is a situation in which we mistakenly share a portion (or the entirety) of the data between the training and validation datasets. Data leakage can happen due to different types of mistakes during data preparation. Timeseries can also implicitly include data leakage. For example, if training and testing datasets are selected in a way that allows the training dataset to gain information about the removed period, we have implicitly leaked data. For example, if we extract the testing dataset from the middle of the original timeseries, we basically give the model insight about what happens before and after the testing period, which is basically cheating and can reduce the accuracy of the model. One way to avoid this is to make sure that our training data period precedes the validation period.

Another type of leakage is called feature-leakage, and it happens when our feature list includes a variable that is either closely correlated with data labels or is a duplicate of the data label. Obviously, we won’t have that feature during the prediction. Therefore, our trained model will be highly biased.

More information about data leakage can be found here and here.

2- ML and Statistical Models

Lack of Physical Understanding of the System

In machine learning, we use statistical relationships to explore how the system would behave under a specific condition. However, ML models do not gain any explicit knowledge about the actual physical laws that govern the system. For example, in hydrology, it is very possible to violate water and energy conservation laws while training and using machine learning-driven models. However, there are studies (here and here) that have been trying to incorporate these insights into their ML model. This issue is likely to attract more attention in the future.

Reproducibility

There is a general consensus among ML researchers that many ML-based studies are not reproducible. This means that an individual researcher cannot replicate the ML process and find the same results. Many recent studies have shown that ML scientists are often not able to reproduce the results of other studies. While some people refer to this as the reproducibility crisis in ML studies, it is more of a general academic research issue. Based on a 2016 survey published by Nature, 70% of scientists across various disciplines were not able to reproduce the results of other studies, and more than 50% even failed to reproduce their own previously developed results. This is obviously a big challenge and needs to be taken into consideration. More information about reproducibility challenge can be find blog posts (here and here).

Instability of the Model

Model stability is also an important challenge. As discussed earlier, models can be very efficient and accurate under specific random conditions (e.g., random seed or data sample) and will fail under different conditions. This problem is also closely related to the reproducibility challenge. Using different random seeds and developing model ensembles can provide a solution to this problem. The solution also depends on the inherent nature of the ML algorithm. For example, this blog post nicely explains the issue of stability and its solution in K-Means clustering.

Simpson’s Paradox

Simpson’s Paradox describes a situation in which our analysis generates a different conclusion if we investigate each individual group separately while comparing them to the situation where these groups are all combined. The following figure shows an example of this situation under a simple linear regression. The implication of this phenomenon should be taken into account when we make decisions about labels and when we interpret our results.

Correlation Bias

It is important to make sure that there is no significant correlation between different features, which can lead to the creation of unstable and unreliable ML models. This is the same issue as the widely discussed multicollinearity problem in regression. However, removing feature columns needs to be done carefully in order to avoid significant loss of information.

Other Statistical and ML Biases

There are various ways that machine learning studies and statistical analysis studies can be biased and misleading, and P-hacking is one of the widely discussed pathways of that in hypothesis testing. P-hacking basically means that the researcher changes the definition of the problem or the significance threshold to prove a positive relationship. However, when reporting the results of the analysis, only the successful experiment gets reported to the scientific community.

Another issue occurs when the researcher works on a big dataset with many features. The researcher keeps exploring different parts of the dataset to find a statistically significant positive relationship and fails to do so; after enough digging, though, a positive relationship can be found. However, when reporting the data through scientific publications, only the significant positive results get reported and not the number of times that the research attempted to find them. This issue is often referred to as the multiple comparisons problem or the look-elsewhere effect.

There are many studies that show that p-hacking is a common problem and has significantly biased many scientific findings, but other studies exist that argue that, while p-hacking is common in scientific studies, it does not significantly change the big picture of what we eventually learn from scientific literature.

We should also always keep in mind that ML and generally data mining analyses are inherently exploratory, not confirmatory. It means that ML studies can only generate a hypothesis, while separate statistical testing is necessary to confirm the theory (more on this can be found here and here). This can also be seen as another aspect of the reproducibility problem in ML. Machine learning methods do not provide insights into causal relationships. Therefore, ML-driven methods can give us high predication accuracy under a specific condition, but their accuracy might significantly drop under a different condition, and we need to pay attention to this issue.

Performance Metrics

It is crucial to take into account different performance aspects of an ML model, and, to do that, we often need to consider various metrics (refer to this blog post for more information). As I discussed in this previous blog post, it is also important to carefully think about which metrics are really desirable for the problem at hand. For example, the decision about focusing on specificity vs. sensitivity should be determined by examining the nature of our problem. The multiple metrics issue has been widely discussed in past hydrology literature, but it has also been ignored in many water-related studies.

3- Results Interoperation and Extraction of Actionable Insights

Overinterpretation and Generalization Bias

The ML models are trained and tested for a specific dataset. Therefore, when they are used in the real-world application for prediction, they may or may not be as reliable. Therefore, we should also be careful about how we extrapolate the fidelity of our model to new datasets.

Confirmation Bias

This bias takes various forms. For example, we are subconsciously biased toward what we already think is the truth that will show up in the final results of our analysis. Also, we can unknowingly lean toward the results that are shinier and more groundbreaking. In addition, we might want to make our results more interesting for our stakeholders, clients, and the organizations we work for. These are some of the biases that we need to avoid in any type of research and certainly in ML projects.

Interpretability

ML models can benefit from powerful and sophisticated algorithms that are built based on comprehensive lists of features. However, that does not guarantee a decent level of interoperability. In fact, it might make the model and its results harder to understand. I won’t get into the details of this here, but there are many approaches that have been used to control model complexity and interpretability (here and here).

Publication Bias

Scientific journals and media outlets are less interested in publishing results that show non-significant causal relationships. In other words, research efforts that find negative results are less likely to get published. This phenomenon affects academic culture and creates pressure to find and submit positive results, and it also leads to the publication of positive results that have been found by chance. Publication bias has been recognized and talked about for a long time. The issue was first introduced by Theodore Sterling (1959), but it continues to harm the products of academia today.

There is another type of bias that frequently happens in publications. There is always a tendency toward showing that the model developed in the study outperforms popular state-of-the-art algorithms. The issue is that the models developed by the authors often benefit from careful fine-tuning of hyperparameters, while a generic type of the popular model is used in the comparison. This creates an unfair comparison. Refer to this blog post and this publication for more on this type of bias.

Also, ML models can yield a very good response under a specific condition (e.g., a random seed) and fall apart under a different condition (e.g., a different random seed). For example, this problem frequently happens in unsupervised clustering. Running the model with multiple random seeds and starting points is one of the ways to avoid this.

How Can We Address These Limitations?

As discussed in this blog post, ML studies can be negatively affected by various dimensions of these problems, but there are ways to respond to them. For example, biases in data collection can be improved through promoting diversity in research teams (more on this here, and here) or paying more attention to the design of the experiment. Careful definition of the problem and scope of the study also seems to be critical. It would be idealistic, but a cultural shift in the academic world that results in giving more credit to studies that find negative results would improve some of these problems. Finally, effectively using multiple performance metrics, synthetic data, and visual analytics should also benefit ML models.

Making the Most of Predictor Data for Machine Learning Applications

This blog post intends to demonstrate some machine learning tips that I have acquired through my PhD so far that I hope can be helpful to others, especially if you are working in water resources applications/in basins with complex hydrology/with limited data.

Addressing Correlation and Multicollinearity

The first step in any machine learning problem is often to identify the information that will be most relevant to what you are predicting for your output. If you are in the rare situation where you have a large number of inputs available, the best course of action is not to assume that every single one of those inputs are improving the predictability of your model. If your inputs are highly collinear, then ultimately, you are adding dimensionality to your model without receiving any predicting power in return. A couple suggestions include:

  1. Assess the correlation of your inputs with your outputs in the training set. It’s a good idea to just note which variables are most correlated with your output, because those will likely be extremely important for prediction.
  2. As you add inputs, you should check for multicollinearity among the selected input variables. You can calculate a correlation matrix that will show correlation among the variables and develop a criterion for dropping highly correlated variables. You can also calculate the Variance Inflation Factor (VIF) which represents how the variance of the output is attributed to variance of single inputs.

There are many algorithms available for information selection, including an Iterative Input Selection (IIS) algorithm which uses a regression tree approach to iteratively select candidate inputs based on correlation and returns the most predictive subset of inputs.

Incorporation of Time as an Input Variable

Water resources applications tend to be characterized by some periodicity due to seasonality so it may be obvious that including the day/month of the year as inputs can provide information about the cyclic nature the data. What may be less obvious is making sure that the machine learning model understands the cyclic nature of the data. If raw day or month values are implemented (1-365 or 1-12) as predictor variables, the gives the impression that day 1 is vastly different from day 365 or that January and December are the least related months, which is not actually the case and can send the wrong message to the algorithm. The trick is to create two new features for each of the day and month time series that take the sin and sin and cosine of each value. Both sin and cosine are needed to get a unique mapping for each. If we plot this, you can see that the cyclic nature is preserved compared to the uneven jumps in the first figure.

Aggregation of Variables

When working with rainfall-runoff prediction, you might see peaks in runoff that are somewhat lagged from the precipitation event of interest. This can create difficulties for the machine learning model, which potentially could see a zero value for precipitation on a day with an increased outflow. This can be due to residence time and different types of storage in the basin. Using a memory-based model such as an LSTM and/or an aggregation of past precipitation over various timescales can help to capture these effects.

  1. Immediate Effects: Aggregation of precipitation over the last few hours to a few days will likely capture run-off responses and a few initial ‘buckets’ of storage (e.g. canopy, initial abstraction, etc.)
  2. Medium-Term Effects: Aggregation over a few days to weeks might capture the bulk of soil storage layers (and movement between them and the stream (interflow) or deep groundwater)
  3. Long-Term Effects: Ideally, we would have snow information like SWE or snowpack depth if it is relevant, but in my test cases, this information was not available. Aggregation of precipitation over the past a water year is a proxy that can be used to capture snow storage.
Interaction Variables

Often time, classifiers consider variables to be independent, but often times, the interaction of variables can explain specific events that would not be captured by the variables alone. Think of an instance in which you are trying to predict yield based on water and fertilizer input variables. If you have no water, but you have fertilizer, the yield will be zero. If you have water and no fertilizer, the yield will be some value greater than zero. However, if you have both water and fertilizer, the yield will be greater than what each variable can produce alone. This is a demonstration of interaction effects, and the way to encode this as a variable is to add another input that is amount of water *amount of fertilizer. With this term, the model can attempt to capture these secondary effects. In a snow-driven watershed with limited information, interactions terms can be used to help predict runoff as follows:

  1. For a given amount of precipitation the flow will change based on temperature. For instance, if a watershed receives 100mm of precipitation at 1C there will be lots of runoff. If a watershed receives 100mm of precipitation at -10C, there will be very little runoff because that precipitation will be stored a snow. However, if there is 100mm of precipitation at 10C, this will equate to lots of runoff + lots of snowmelt. (interaction term=precipitation*temperature)
  2. For a given amount of precipitation and temperature the flow will change based on time of year. For instance, 100mm of precipitation at 10C in October will lead to lots of run-off. However, that same amount of precipitation at 10C in February can create flooding events not only from the precipitation but the additional rain-on-snow events. (interaction term=precipitation*temperature*sin(month)*cos(month))
  3. Even independent of precipitation the response of the watershed to various temperatures will be affected by the month of the water year/day of year. A temperature of 30C in April will probably lead to a flood event just from the snowmelt it would trigger. That same temperature in September will likely lead to no change in runoff. (interaction term=temperature*sin(month)*cos(month))

Hopefully these tips can be useful to you on your quest for making the most of your machine learning information!