Making the Most of Predictor Data for Machine Learning Applications

This blog post intends to demonstrate some machine learning tips that I have acquired through my PhD so far that I hope can be helpful to others, especially if you are working in water resources applications/in basins with complex hydrology/with limited data.

The first step in any machine learning problem is often to identify the information that will be most relevant to what you are predicting for your output. If you are in the rare situation where you have a large number of inputs available, the best course of action is not to assume that every single one of those inputs are improving the predictability of your model. If your inputs are highly collinear, then ultimately, you are adding dimensionality to your model without receiving any predicting power in return. A couple suggestions include:

1. Assess the correlation of your inputs with your outputs in the training set. It’s a good idea to just note which variables are most correlated with your output, because those will likely be extremely important for prediction.
2. As you add inputs, you should check for multicollinearity among the selected input variables. You can calculate a correlation matrix that will show correlation among the variables and develop a criterion for dropping highly correlated variables. You can also calculate the Variance Inflation Factor (VIF) which represents how the variance of the output is attributed to variance of single inputs.

There are many algorithms available for information selection, including an Iterative Input Selection (IIS) algorithm which uses a regression tree approach to iteratively select candidate inputs based on correlation and returns the most predictive subset of inputs.

Incorporation of Time as an Input Variable

Water resources applications tend to be characterized by some periodicity due to seasonality so it may be obvious that including the day/month of the year as inputs can provide information about the cyclic nature the data. What may be less obvious is making sure that the machine learning model understands the cyclic nature of the data. If raw day or month values are implemented (1-365 or 1-12) as predictor variables, the gives the impression that day 1 is vastly different from day 365 or that January and December are the least related months, which is not actually the case and can send the wrong message to the algorithm. The trick is to create two new features for each of the day and month time series that take the sin and sin and cosine of each value. Both sin and cosine are needed to get a unique mapping for each. If we plot this, you can see that the cyclic nature is preserved compared to the uneven jumps in the first figure.

Aggregation of Variables

When working with rainfall-runoff prediction, you might see peaks in runoff that are somewhat lagged from the precipitation event of interest. This can create difficulties for the machine learning model, which potentially could see a zero value for precipitation on a day with an increased outflow. This can be due to residence time and different types of storage in the basin. Using a memory-based model such as an LSTM and/or an aggregation of past precipitation over various timescales can help to capture these effects.

1. Immediate Effects: Aggregation of precipitation over the last few hours to a few days will likely capture run-off responses and a few initial ‘buckets’ of storage (e.g. canopy, initial abstraction, etc.)
2. Medium-Term Effects: Aggregation over a few days to weeks might capture the bulk of soil storage layers (and movement between them and the stream (interflow) or deep groundwater)
3. Long-Term Effects: Ideally, we would have snow information like SWE or snowpack depth if it is relevant, but in my test cases, this information was not available. Aggregation of precipitation over the past a water year is a proxy that can be used to capture snow storage.
Interaction Variables

Often time, classifiers consider variables to be independent, but often times, the interaction of variables can explain specific events that would not be captured by the variables alone. Think of an instance in which you are trying to predict yield based on water and fertilizer input variables. If you have no water, but you have fertilizer, the yield will be zero. If you have water and no fertilizer, the yield will be some value greater than zero. However, if you have both water and fertilizer, the yield will be greater than what each variable can produce alone. This is a demonstration of interaction effects, and the way to encode this as a variable is to add another input that is amount of water *amount of fertilizer. With this term, the model can attempt to capture these secondary effects. In a snow-driven watershed with limited information, interactions terms can be used to help predict runoff as follows:

1. For a given amount of precipitation the flow will change based on temperature. For instance, if a watershed receives 100mm of precipitation at 1C there will be lots of runoff. If a watershed receives 100mm of precipitation at -10C, there will be very little runoff because that precipitation will be stored a snow. However, if there is 100mm of precipitation at 10C, this will equate to lots of runoff + lots of snowmelt. (interaction term=precipitation*temperature)
2. For a given amount of precipitation and temperature the flow will change based on time of year. For instance, 100mm of precipitation at 10C in October will lead to lots of run-off. However, that same amount of precipitation at 10C in February can create flooding events not only from the precipitation but the additional rain-on-snow events. (interaction term=precipitation*temperature*sin(month)*cos(month))
3. Even independent of precipitation the response of the watershed to various temperatures will be affected by the month of the water year/day of year. A temperature of 30C in April will probably lead to a flood event just from the snowmelt it would trigger. That same temperature in September will likely lead to no change in runoff. (interaction term=temperature*sin(month)*cos(month))

Hopefully these tips can be useful to you on your quest for making the most of your machine learning information!