CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

    

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

Visualizing High Dimensional Data Using the T-SNE Method

In this blog post, I am going to go over a machine learning method that is used to visualize high dimensional datasets. The method is called T-distributed Stochastic Neighbor Embedding, or T-SNE for short. The method was developed by the Google Brain Team (van der Maaten & Hinton, 2008). In this blog post, I describe T-SNE and provide a simple example in R.

Two weeks ago, Dave introduced a few useful dimensionality reduction methods to visualize the Pareto front. Last week, Rohini’s blog post provided some very useful information about self-organizing maps. I guess this blog post will officially make this a series of posts on dimensionality reduction and the visualization of high dimensional datasets. If you haven’t already read these two super informative blog posts, I highly recommend you take a look at them.

T-SNE is an extension of the Local Linear Embedding technique, and it uses an interesting method to maintain the local relationships and similarities that exist in the original high-dimensional datasets when transferring that into a 2-D plane. As van der Maaten and Hinton (2008) argue, other dimensionality reduction techniques (e.g., PCA) usually focus on macro distances in the datasets, which might lead to overlooking short-distance relationships. However, T-SNE is able to capture both local and macro distances in the data structure. Therefore, it can give more reasonable results when you are dealing with non-linear high dimensional datasets.

T-SNE uses the following formula, which basically creates a Gaussian distribution over each data point and computes the density of all other data points in respect to the distribution around that point. The denominator of the equation normalizes the nominator.

In this equation, xi is the reference point, and xj is its neighboring points. Also, Sigma is the bandwidth that returns the same perplexity for each point. Perplexity is a measure of uncertainty that has a direct relationship with entropy. For more information about it, you can read this Wikipedia page. Basically, perplexity is a hyper parameter of T-SNE, and the final outcome might be very sensitive to its value. Also, in this equation, pij is the probability of the similarity of each two points in the high dimensional space. Basically, if two points are close to each other, pij will be high. If they are far from each other, the probability of similarity is low.

To make the low dimensional map consistent with its original high dimensional dataset, T-SNE also calculates a similar point-to-point probability of similarity in the 2D map.

Basically, T-SNE moves the point on the 2D plain to find the locations that minimize the Kullback-Leibler divergence metric between the Pi distributions of the original data and the Qi distributions of the 2D plane.

As you probably noticed, in the low dimensional space, T-SNE uses the Student T-distribution curve (with one degree of freedom). The reason is that Student T (compared to Gaussian) is a fat-tailed distribution. This allows the T-SNE to separate dissimilar points with higher margins, which makes the map easier to interpret.

T-SNE Example in R

Here, I provide a simple R code example that demonstrates how you can use T-SNE to visualize high dimensional datasets. This example only has four dimensions, so it doesn’t really represent a high dimensional problem. As such, keep in mind that T-SNE shows its true capabilities when there are tens or even hundreds of dimensions.

First, you need to install “tsne” library and load it.

install.packages("Rtsne”)
library(Rtsne)

# We also need to load ggplot2 and ggpubr libraries

library(ggplot2)
library(ggpubr)

In this example, I use R’s famous “iris” data set. Here is how you can activate it.

# Load the iris dataset
data(iris)

Now you can use the following scrip to visualize the data using T-SNE.

# Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=30, max_iter=1000, check_duplicates = FALSE) # Run TSNE

# Create a data frame using IRIS values and the T-SNE values
df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))

# Specify column names
names(df_to_gg)<-c("Species", "Y.1", "Y.2")

# Show the objects in the 2D T-SNE representation
ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
  labs(title=paste("T-SNE Visualization- Perplexity Number = 30" )) +
  scale_color_manual(values = c("darkblue", "orange", "pink3"))

Here is the 2-D map that our code generates

Although T-SNE is a powerful method and has become quite popular in recent years, it might have some pitfalls. This blog post discusses some of these issues, including the sensitivity of the final map to perplexity hyperparameters or the fact that it can be tricky to interpret T-SNE maps because they might not represent the actual relationships between different clusters.

To explore the impacts of the perplexity hyperparameter on the final clusters, I use the following code to create T-SNE maps for different perplexity values.

# Here are the perplexity values that I am taking into account in my T-SNE analysis
perplexity_number_values<-c(2,5,10, 15, 30, 40)

# Initializing a plot list object
plot_list = list()

# This loop explores the effect of perplexity values on the T-SNE results
for (i_prp in 1:6){
  
  # I am setting a seed to makes sure that my results for different perplexities are not sensitive to any random factors
  set.seed(10)
  
  # Perplexity number
  perplexity_number<- perplexity_number_values[i_prp]
  
  # Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
  y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=perplexity_number, max_iter=1000, check_duplicates = FALSE) # Run TSNE
  
  # Create a data frame using IRIS values and the T-SNE values
  df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))
  
  # Specify Column names
  names(df_to_gg)<-c("Species", "Y.1", "Y.2")
  
  # Show the objects in the 2D T-SNE representation
  plt<-ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
    labs(title=paste("T-SNE Visualization- Perplexity Number = ", perplexity_number )) +
    scale_color_manual(values = c("darkblue", "orange", "pink3"))
  
  # Save figure in the plot list object
  plot_list[[i_prp]]=plt
  
  
}
# Combine all the plot objects
plt_combined<-ggarrange(plotlist = plot_list[1:6], ncol = 2, nrow = 3, common.legend = T)

Here is the final result that underlines the high sensitivity of the location of our clusters on 2-D map to the choice of the perplexity values. There are other hyperparameters that can be tried as well, for example, the maximum number of iterations might change the final output in some cases.

There are many other very good tutorials on T-SNE, such as here and here. Also, for example, this tutorial provides a nice and easy-to-follow Python code for T-SNE. Finally, there are some great YouTube videos that clearly explain the logic behind T-SNE and its algorithm (for example, here and here).

Deeper Dive into Self-Organizing Maps (SOMs)

This blog post focuses on Self-Organizing Maps (SOMs) or Kohonen maps, first created by Teuvo Kohonen in the 1980s, and most recently introduced in Dave Gold’s latest blog post. SOMs are a type of artificial neural network (ANN) that are trained using unsupervised learning to produce a two-dimensional pattern representation of higher-dimensional datasets. SOMs are therefore a good algorithm for dimension reduction, and inputs that are closer in their higher-dimensional input vectors are also mapped closely together in the resulting 2D representation [1].

Self Organizing Maps. Recently, I learned about SOMs while… | by ...
User-defined neuron grid that represents the basis of SOM  [1]

SOMs can be very effective for clustering/classifying outputs and are advantageous because they “self-organize” to identify clusters. The SOM map is first initialized with a user-defined 2D grid of neurons instead of containing layers like traditional ANNs. The weight vector that characterizes the neurons can be randomly initialized with value or more favorably, by sampling across the eigenvectors of the two largest PCs of the input dataset. The training process that SOMs use to build maps is called “competitive learning”, which is also fundamentally different from backpropagation used for ANNs. During competitive learning, an input training point is fed into the algorithm. Its Euclidean distance from each of the neurons is calculated, and the closest neuron is termed the best matching unit, or BMU. The weight of the neuron and surrounding neurons is adjusted to be closer to the related input vector, using an update formula. This process is repeated for every input in the training data and for a large number of iterations until patterns and clusters begin to emerge [3]. The animation on the right illustrates competitive learning. Note how the locations of neurons are adjusted as training points get assigned to their respective BMUs and how whole grid shifts in response to the new samples.

220px-TrainSOM
Competitive learning [2]

At a high level, this is how SOMs are trained. Next, we will go through a clustering example using the kohonen package offered in R.

Test Case: Maryland Trout

For this text case, we will be using a presence/absence dataset of Maryland trout. Each data point is characterized by a feature vector that includes information on agricultural land cover, logarithmic distance to the nearest road, riffle quality, dissolved oxygen, and spring water temperature. The label associated with each point is a 0 or 1, denoting if trout was observed or not. We will train a SOM to cluster the feature vector and then investigate the characteristics of the resulting clusters. Below is a snippet of our dataset. We use columns 3-7 as our feature vector while column 2 denotes our label. SOMs can also be used for classification, which is not discussed in this blog post. Therefore, you can train the SOM only on a subset of the dataset and used the trained SOM to predict the clusters that the testing points will be assigned to.

inputset

First we must scale our data in order to make sure that variables in our feature vector that may have different magnitudes are treated equally by the algorithm. Then we define the SOM grid. The choice of the number of neurons generally follows the rule: 5*sqrt(# of datapoints). In our case, we have 84 datapoints, so technically a grid that is smaller than 7×7 will work well. I decide to use 5×5, which will be explained in the next paragraph. I also specify that I want the grid to be in a hexagonal shape.  Finally, we train our model by feeding in our input data, the user-defined grid, and specifying an iteration rate of 1000. This means that the input data will be presented 1000 times to the algorithm.

require(kohonen)
data=read.csv("Maryland.Trout.csv")
#Create the training dataset
data_train &amp;amp;lt;-data[, c(3:7)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix &amp;amp;lt;- as.matrix(scale(data_train))
# Create the SOM Grid
som_grid &amp;amp;lt;- somgrid(xdim = 5, ydim=5, topo="hexagonal")
# Finally, train the SOM
som_model &amp;amp;lt;- som(data_train_matrix,
                 grid=som_grid, rlen = 1000,
                 keep.data = TRUE )

Once the SOM is trained, there are a variety of figures that can be easily created. We can plot the training progress and observe how the distance to the BMU decreases as a function of iteration and starts to level off as we reach many iterations. This is ideal behavior because it means that the training points are associating closely with their BMU.

plot(som_model, type="changes")
Training
Average distance from a test point to its BMU

We can also plot a heat map of the node counts, which shows how many of each of the training inputs are associated with each node. Any node that is grey means that no training points were assigned to that BMU. This may indicate that the map is too big, so you can go back and adjust the grid size adjust accordingly until the grey points are no longer existent. For this dataset, this was around 5×5. Furthermore, we want the number of points associated with each node to be as uniform as possible.

plot(som_model, type="count", main="Node Counts")
Nodes

Next we can view the neighbor distances for each neuron, which is also called the U-matrix. Small distances indicate similarity between neighboring nodes.

plot(som_model, type="dist.neighbours", main = "SOM neighbor distances")
neighbor

We can also view the weight vector of the nodes, which are termed codes. The fan in each node indicates the variables of prominence that are linking the datapoints assigned to the neuron. In our dataset, you can see that in the upper left corner of the map, the algorithm is assigning datapoints to these neurons because of the similarity among the agricultural area and water temperature variables in the feature vector. In the lower right corner, these datapoints are linked primarily due to similarities in riffle quality, and dissolved oxygen. This type of map is useful to understand by what features the points are actually starting to cluster.

plot(som_model, type="codes")
codes

In my opinion, one of the most useful standard figures that the kohonen library creates are heatmaps that show the gradient of input values across the nodes. A heatmap can be created for each feature and when placed side by side, they can help demonstrate correlation between inputs. We can clearly see that areas characterized by low agricultural land tend to have a better riffle quality, lower water temperature, and a higher dissolved oxygen. While these patterns are expected and have a strong physical explanation, these maps can also help to discover other unexpected patterns. Note that the values on the side bar are normalized and can be transformed back to the original scale.

 

#Plot heatmaps for each feature
plot(som_model, type = "property", property = getCodes(som_model)[,1], main=colnames(getCodes(som_model))[1],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,2], main=colnames(getCodes(som_model))[2],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,3], main=colnames(getCodes(som_model))[3],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,5], main=colnames(getCodes(som_model))[5],palette.name=magma)
#Unscale the variables- do this for each input column in data_train
var_unscaled &amp;amp;lt;- aggregate(as.numeric(data_train[,1]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled, main=colnames(getCodes(som_model))[1], palette.name=magma)
heatmaps

Finally we can cluster our neuron codes weight vectors) to add some boundaries to our map. Typically what is done is to use k-means clustering to determine an appropriate number of clusters that, in this case, minimize the within-cluster sum of squares (WCSS). We can then plot the results and look for an elbow in the scree plot.

 

#k-means clustering code from [4]
mydata = som_model$codes[[1]]
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(mydata, centers=i)$withinss)
}
par(mar=c(5.1,4.1,4.1,2.1))
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)") #Define a palette and cluster with the number of clusters determined using k-means
WCCS

The elbow isn’t so clear in this dataset, but I go with 8 clusters and then use hierarchical clustering to assign each neuron to a cluster. Finally, we can obtain the cluster that each point is assigned to and insert that back as a column in our dataset.

#Final clustering code from [4]
#Define a palette and cluster with the number of clusters determined using k-means
pretty_palette &amp;amp;lt;- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
som_cluster &amp;amp;lt;- cutree(hclust(dist(som_model$codes[[1]])), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(som_model, som_cluster)

# get vector with cluster value for each original data sample
cluster_assignment &amp;amp;lt;- som_cluster[som_model$unit.classif]
# for each of analysis, add the assignment as a column in the original data:
data$cluster &amp;amp;lt;- cluster_assignment

clustering

Post-Processing

We’re now interested in understanding which of our points actually clustered together, so we generate summary statistics for each cluster. The points highlighted in blue represent the best values for the feature while the points in red represent the worst. Note that the clusters with a large amount of trout presence on average are characterized by best values of features that are favorable for aquatic life, while the clusters that exhibit a lower presence of trout do not have good values for these features. You can also see that clusters with similar characteristics are side by side on the clustering map.

summary

Overall, I found the kohonen library to be a really useful tool to start to internalize how SOMs work and to fairly quickly be able to do a nice analysis on both the features and the resulting clusters. I would highly recommend this package as a starting point. There are packages in Python such as SimpSom and minisom that have snappier graphics and will be a good next step to explore.

References

[1]https://towardsdatascience.com/self-organizing-maps-ff5853a118d4

[2] By Chompinha – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77822988

[3]http://fourier.eng.hmc.edu/e161/lectures/som/node2.html

[4] Much direction for structuring this tutorial comes from https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

Getting started with API requests in Python

This is an introductory blogpost on how to work with Application Programming Interfaces (APIs) in Python. Seeing as this is our blog’s first post on this topic I’ll spend some time explaining some basic information I’ve had to learn in the process of doing this and why it might be useful in your research. There are several blogs and websites with tutorials online and there’s no point repeating, but I’ll try to explain this for an audience like me, i.e., (a) has no formal training in computer science/web servers/HTTP but competent with scripting, and (b) interested in using this for research purposes.

What is an API?

Many sites (e.g., Facebook, Twitter, many many more) make their data available through what’s called Application Programming Interfaces or APIs. APIs basically allow software to interact, and send and receive data from servers so as to typically provide additional services to businesses, mobile apps and the like, or allow for additional analysis of the data for research or commercial purposes (e.g., collecting all trending Twitter hashtags per location to analyze how news and information propagates).

I am a civil/environmental engineer, what can APIs do for me?

APIs are particularly useful for data that changes often or that involves repeated computation (e.g., daily precipitation measurements used to forecast lake levels). There’s a lot of this kind of data relevant for water resources systems analysts, easily accessible through APIs:

I’m interested, how do I do it?

I’ll demonstrate some basic information scripts using Python and the Requests library in the section below. There are many different API requests one can perform, but the most common one is GET, which is a request to retrieve data (not to modify it in any way). To retrieve data from an API you need two things: a base URL and an endpoint. The base URL is basically the static address to the API you’re interested in and an endpoint is appended to the end of it as the server route used to collect a specific set of data from the API. Collecting different kinds of data from an API is basically a process of manipulating that endpoint to retrieve exactly what is needed for a specific computation. I’ll demo this using the USGS’s API, but be aware that it varies for each one, so you’d need to figure out how to construct your URLs for the specific API you’re trying to access. They most often come with a documentation page and example URL generators that help you figure out how to construct them.

The base URL for the USGS API is http://waterservices.usgs.gov/nwis/iv/

I am, for instance, interested in collecting streamflow data from a gage near my house for last May. In Python I would set up the specific URL for these data like so:

response = requests.get("http://waterservices.usgs.gov/nwis/iv/?format=json&indent=on&sites=04234000&startDT=2020-05-01&endDT=2020-05-31¶meterCd=00060&siteType=ST&siteStatus=all")

For some interpretation of how this was constructed, every parameter option is separated by &, and they can be read like so: data in json format; indented so I can read them more easily; site number is the USGS gage number; start and end dates; a USGS parameter code to indicate streamflow; site type ‘stream’; any status (active or inactive). This is obviously the format that works for this API, for a different API you’d have to figure out how to structure these arguments for the data you need. If you paste the URL in your browser you’ll see the same data I’m using Python to retrieve.

Object response now contains several attributes, the most useful of which are response.content and response.status_code. content contains the content of the URL, i.e. your data and other stuff, as a bytes object. status_code contains the status of your request, with regards to whether the server couldn’t find what you asked for or you weren’t authenticated to access that data, etc. You can find what the codes mean here.

To access the data contained in your request (most usually in json format) you can use the json function contained in the library to return a dictionary object:

data = response.json()

The Python json library is also very useful here to help you manipulate the data you retrieve.

How can I use this for multiple datasets with different arguments?

So obviously, the utility of APIs comes when one needs multiple datasets of something. Using a Python script we can iterate through the arguments needed and generate the respective endpoints to retrieve our data. An easier way of doing this without manipulating and appending strings is to set up a dictionary with the parameter values and pass that to the get function:

parameters = {"format": 'json', "indent": 'on',
              "sites": '04234000', "startDT": '2020-05-01',
              "endDT": '2020-05-31', "parameterCd":'00060',
              "siteType": 'ST', "siteStatus":'all'}
response = requests.get("http://waterservices.usgs.gov/nwis/iv/", 
                        params=parameters)

This produces the same data, but we now have an easier way to manipulate the arguments in the dictionary. Using simple loops and lists to iterate through one can retrieve and store multiple different datasets from this API.

Other things to be aware of

Multiple other people use these APIs and some of them carry datasets that are very large so querying them takes time. If the server needs to process something before producing it for you, that takes time too. Some APIs limit the number of requests you can make as a result and it is general good practice to try to be as efficient as possible with your requests. This means not requesting large datasets you only need parts of, but being specific to what you need. Depending on the database, this might mean adding several filters to your request URL to be as specific as possible to only the dates and attributes needed, or combining several attributes (e.g., multiple gages in the above example) in a single request.

Using Docker with Virtual Machines

This post is a continuation of my previous post on accessing virtual machines on RedCloud. In this post, you will learn how to run a Docker container on a VM Ubuntu image, but you can also do this tutorial with Ubuntu on a local machine.

Why Containerize Code?

Let’s use an example: Say that you have three Python applications that you want to run on your computer, but they all use different version of Python or its packages. We cannot host these applications at the same time using Python on our computer…so what do we do? Think of another case- you want to share neural network code with a collaborator, but you don’t want to have to deal with the fact that it’s incredibly difficult for them to get TensorFlow and all its dependencies downloaded on their machine. Containerization allows us to isolate the installation and running of our application without having to worry about the setup of the host machine. In other words, everything someone would need to run your code will be provided in the container. Shown in Figure 1, each container shares the host OS’s kernel but has a virtual copy of some of the components of the OS. This allows the containers to be isolated from each other while running on the same host. Therefore, containers are exceptionally light and take only seconds to start [1].

Picture1

Fig 1. Docker Container Example [1]

Docker

Docker started as a project to build single-application LXC (linux) containers that were more portable and flexible, but is now its own container runtime environment. It is written in GO and developed by DotCloud (a PaaS company). A Docker engine is used to build and manage a Docker image, which is just a template that contains the application and all of its dependencies that are required for it to run. A Docker container is a running instance of a Docker image. A Docker engine is composed of three main parts: a server known as the Docker daemon, a rest API, and a client. The docker daemon creates and manages images and containers, the rest API helps to link the server and applications, and the client (user) interacts with the docker daemon through the command line (Figure 2).

Picture2

Fig.2 Docker Engine [2]

 

Running a Docker Container on Ubuntu

  1. We will be setting up and running a Docker container that contains code to train a rainfall-runoff LSTM model built in Python using Tensorflow. The Github repo with the LSTM code and data can be downloaded here.
  2. Spin up a VM instance (Ubuntu 18.04) or use your own machine if you have a partition.
  3. I use MobaXterm to SSH into the VM instance and drag the HEC_HMS_LSTM folder into my home directory.
  4. Within the directory, you will see 2 additional files that are not part of the main neural network code: requirements.txt and jupyter.dockerfile. A requirements.txt file contains a list of only the packages that are necessary to run your application. You can make this file with just two lines of code: pip install pipreqs and then pipreqs  /path/to/project. The created file will look like this:

requirements

requirements.txt file

The jupyter.dockerfile is our Dockerfile. It is a text file that contains all the commands a user could call on the command line to assemble an image.

dockerfile

Dockerfile

Here in our Dockerfile, we are using the latest Ubuntu image and setting our working directory to the HEC_HMS_LSTM folder that has the neural network code, Hourly_LSTM.py, that we would like to execute. We start by looking for updates and then install Python, pip, and jupyter. Then we need to take our working directory contents and copy them into the container. We copy the requirements.txt file into the container and then add the whole HEC_HMS_LSTM folder into the container. We then use pip to install all of the packages listed in requirements.txt. Finally, we instruct the docker daemon to run the python script, Hourly_LSTM.py.

5. Next we have to download docker onto Ubuntu using the command line. This is the only thing that anyone will have to download to run your container. The steps to download Docker for Ubuntu through the command line are pretty easy using this link. You may be asked to allow automatic restarts during the installation, and you should choose this option. When Docker is done downloading, you can check to see that the installation was successful by seeing if the service is active.

dockerdownload

Successful Docker Installation

6. Now that Docker is downloaded, we can build our Docker image and then run the container. We build the image using: 

sudo docker build -f jupyter.dockerfile -t jupyter:latest .

In this command, the -f flag denotes the name of the dockerflile and -t is the name that we would like to tag our image with. If you’re running multiple containers, tagging is helpful to distinguish between the containers. The build will go through each step of the dockerfile. Be cognizant of how much space you have on your disk to store the downloaded Docker, the python libraries and the data you will generate; you may have to go back and resize your VM instance. We can then run our image using:

sudo docker run jupyter:latest

You’ll see the neural network training and the results of the prediction.

NNFinal

Successfully training the neural network in a Docker Container on a virtual machine

 

Credits:

[1]https://www.freecodecamp.org/news/docker-simplified-96639a35ff36/

[2]https://docs.docker.com

 

 

 

 

Beginner’s Guide to TensorFlow and Keras

This post is meant to provide a very introductory overview of TensorFlow and Keras and concludes with example of a how to use these libraries to implement a very basic neural network.

TensorFlow

At core, TensorFlow is a free, open-source symbolic math library that expresses calculations in terms of dataflow graphs that are composed of nodes and tensors. TensorFlow is particularly adept at handling operations required to train neural networks and thus has become a popular choice for machine learning applications1. All nodes and tensors as well as the API are Python-based. However, the actual mathematical operations are carried out efficiently using high-performance C++ binaries2. TensorFlow was originally developed by the Google Brain Team for internal use but since has been released for public use. The library has a reputation of being on the more technical side and less intuitive to new users, so based on user feedback from initial versioning, TensorFlow decided to add Keras to their core library in 2017.

Keras

Keras is an open-source neural network Python library that has become a popular API to run on top of TensorFlow and other deep learning frameworks. Keras is useful especially for beginners in machine learning because it offers a user-friendly, intuitive, and quick way to build models. Particularly, users tend to most interested in quickly implementing neural networks. In Keras, neural network models are composed of standalone modules of neural network layers, cost functions, optimizers, activation functions, and regularizers that can be built separately and combined3.

Example: Predicting the age of abalone

In the example below, I will implement a very basic neural network to illustrate the main components of Keras. The problem we are interested in solving is predicting the age of abalone (sea snails) given a variety of physical characteristics. Traditionally, the age of abalone is calculated by cutting the shell, staining it, and counting the number of rings in the shell, so the goal is to see if less time-consuming and intrusive measurements, such as the diameter, length, and gender, can be used to predict the age of the snail. This dataset is one of the most popular regression datatsets from the UC Irvine Machine Learning Repository. The full set of attributes and dataset can be found here.

Step 1: Assessing and Processing Data

The first step to approaching this (and any machine learning) problem is to assess the quality and quantity of information that is provided in the dataset. We first import relevant basic libraries. More specific libraries will be imported above their respective functions for illustrative purposes. Then we import the dataset which has 4177 observations of 9 features and also has no null values.


#Import relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Import dataset
dataset = pd.read_csv('Abalone_Categorical.csv')
#Print number of observations and features
print('This dataset has {} observations with {} features.'.format(dataset.shape[0], dataset.shape[1]))
#Check for null values
dataset.info()

Note that the dataset is composed of numerical attributes except for the “Gender” column which is categorical. Neural networks cannot work with categorical data directly, so the gender must be converted to a numerical value. Categorical variables can be assigned a number such as Male=1, Female=2, and Infant=3, but this is a fairly naïve approach that assumes some sort of ordinal nature. This scale will imply to the neural network that a male is more similar to a female than an infant which may not be the case. Instead, we instead use one-hot encoding to represent the variables as binary vectors.

#One hot encoding for categorical gender variable

Gender = dataset.pop('Gender')

dataset['M'] = (Gender == 'M')*1.0
dataset['F'] = (Gender == 'F')*1.0
dataset['I'] = (Gender == 'I')*1.0

Note that we now have 10 features because Male, Female, and Infant are represented as the following:

M

F

I

1

0 0

0

1

0

0 0

1

It would be beneficial at this point to look at overall statistics and distributions of each feature and to check for multicollinearity, but further investigation into these topics will not be covered in this post. Also note the difference in the ranges of each feature. It is generally good practice to normalize features that have different units which likely will correspond to different scales for each feature. Very large input variables can lead to large weights which, in turn, can make the network unstable. It is up to the user to decide on the normalization technique that is appropriate for their dataset, with the condition that the output value that is returned also falls in the range of the chosen activation function. Here, we separate the dataset into a “data” portion and a “label” portion and use the MinMaxScalar from Scikit-learn which will, by default, transform the data into the range: (0,1).

#Reorder Columns

dataset = dataset[['Length', 'Diameter ', 'Height', 'Whole Weight', 'Schucked Weight','Viscera Weight ','Shell Weight ','M','F','I','Rings']]

#Separate input data and labels
X=dataset.iloc[:,0:10]
y=dataset.iloc[:,10].values

#Normalize the data using the min-max scalar

from sklearn.preprocessing import MinMaxScaler
scalar= MinMaxScaler()
X= scalar.fit_transform(X)
y= y.reshape(-1,1)
y=scalar.fit_transform(y)

We then split the data into a training set that is composed of 80% of the dataset. The remaining 20% is the testing set.

#Split data into training and testing 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 2: Building and Training the Model

Then we build the Keras structure. The core of this structure is the model, which is of the “Sequential” form. This is the simplest style of model and is composed of a linear stack of layers.

#Build Keras Model

import keras
from keras import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(units=10, input_dim=10,activation='relu'))
model.add(Dense(units=1,activation='linear'))
model.compile(optimizer='adam', loss='mean_squared_error',  metrics=['mae','mse'])

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

history=model.fit(X_train,y_train,batch_size=5, validation_split = 0.2, callbacks=[early_stop], epochs=100)

# Model summary for number of parameters use in the algorithm
model.summary()

Stacking layers is executed with model.add(). We stack a dense layer of 10 nodes that has 10 inputs feeding into the layer. The activation function chosen for this layer is a ReLU (Rectified Linear Unit), a popular choice due to better gradient propagation and sparser activation than a sigmoidal function for example. A final output layer with a linear activation function is stacked to simply return the model output. This network architecture was picked rather arbitrarily, but can be tuned to achieve better performance. The model is compiled using model.compile(). Here, we specify the type of optimizer- in this case the Adam optimizer which has become a popular alternative to the more traditional stochastic gradient descent. A mean-squared-error loss function is specified and the metrics reported will be “mean squared error” and “mean absolute error”. Then we call model.fit() to iterate training in batch sizes. Batch size corresponds to the number of training instances that are processed before the model is updated. The number of epochs is the number of complete passes through the dataset. The more times that the model can see the dataset, the more chances it has to learn the patterns, but too many epochs can also lead to overfitting. The number of epochs appropriate for this case is unknown, so I can implement a validation set that is 20% of my current training set. I set a large value of 100 epochs and add early stopping criteria that stops training when the validation score stops improving and helps prevent overfitting.

Training and validation error can be plotted as a function of epochs4 .

def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$Rings^2$]')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')
  plt.legend()
  plt.show()

plot_history(history)

This results in the following figure:

Figure_2

Error as function of epochs

Step 3: Prediction and Assessment of Model

Once our model is trained, we can use it to predict the age of the abalone in the test set. Once the values are predicted, then they must be re-scaled back which is performed using the inverse_transform function from Scikit-learn.


#Predict testing labels

y_pred= model.predict(X_test)

#undo normalization 

y_pred_transformed=scalar.inverse_transform(y_pred.reshape(-1,1))
y_test_transformed=scalar.inverse_transform(y_test)

Then the predicted and actual ages are plotted along with a 1-to-1 line to visualize the performance of the model. An R-squared value and a RMSE are calculated as 0.554 and 2.20 rings respectively.


#visualize performance
fig, ax = plt.subplots()
ax.scatter(y_test_transformed, y_pred_transformed)
ax.plot([y_test_transformed.min(), y_test_transformed.max()], [y_test_transformed.min(), y_test_transformed.max()], 'k--', lw=4)
ax.set_xlabel('Measured (Rings)')
ax.set_ylabel('Predicted (Rings)')
plt.show()

#Calculate RMSE and R^2
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(y_test_transformed, y_pred_transformed))

from sklearn.metrics import r2_score
r_squared=r2_score(y_test_transformed,y_pred_transformed)

Figure_1

Predicted vs. Actual (Measured) Ages

The performance is not ideal, but the results are appropriate given that this dataset is notoriously hard to use for prediction without other relevant information such as weather or location. Ultimately, it is up to the user to decide if these are significant/acceptable values, otherwise the neural network hyperparameters can be further fine-tuned or more input data and more features can be added to the dataset to try to improve performance.

Sources:

[1]Dean, Jeff; Monga, Rajat; et al. (November 9, 2015). “TensorFlow: Large-scale machine learning on heterogeneous systems” (PDF)TensorFlow.org. Google Research

[2] Yegulalp, Serdar. “What Is TensorFlow? The Machine Learning Library Explained.” InfoWorld, InfoWorld, 18 June 2019, http://www.infoworld.com/article/3278008/what-is-tensorflow-the-machine-learning-library-explained.html.

[3]“Keras: The Python Deep Learning Library.” Home – Keras Documentation, keras.io/.

[4] “Basic Regression: Predict Fuel Efficiency  :   TensorFlow Core.” TensorFlow, http://www.tensorflow.org/tutorials/keras/regression.

Dynamic Emulation in Water Resources Applications

The purpose of this blog post is to introduce dynamic emulation in the context of applications to hydrology. Hydrologic modelling involves implementing mathematical equations to represent physical processes such as precipitation, runoff, and evapotranspiration and to characterize energy and water flux within the hydrologic system (Abbott et al., 1986). Users of a model might be interested in using it to approach a variety of problems related to, for instance, modeling the rainfall-runoff process in a river basin. The user might try to validate the model, perform sensitivity or uncertainty analysis, determine optimal planning and management of the basin, or test how hydrology of the basin is affected by different climate scenarios. Each of these problems requires a numerically intensive Monte Carlo style approach for which a physical model is not conducive. A solution to this problem is to create an emulator for the physical model. An emulator is a computationally efficient model whose response characteristics mimic those of a complex model as closely as possible. This model can then replace the more complicated model in computationally intensive exercises (Tych & Young, 2012).

There are many different approaches to creating an emulator; one particularly unified approach is Dynamic Emulation Modelling (DEMo) (Castelletti et al., 2012). DEMo seeks to accomplish three goals when creating an emulator:

  1. The emulator is less computationally intensive than the physical model.
  2. The emulator’s input-output behavior approximates as well as possible the behavior of the physical model.
  3. The emulator is credible to users.

DEMo has five main steps:

    1. Design of Computer Experiments: Determine a set of input data to build the emulator off of that will cover the range of responses of the physical model
    2. Variable Aggregation: Reduce dimensionality of the input data
    3. Variable Selection: Select components of the reduced inputs that are most relevant to explaining the output data
    4. Structure Identification and Parameter Estimation: In the case of a rainfall runoff model, choose a set of appropriate black box models that can capture the complex, non-linear process and fit the parameters of these models accordingly.
    5. Evaluation and Physical Interpretation: Evaluate the model on a validation set and determine how well the model’s behavior and structure can be explained or attributed to physical processes.

The next section outlines two data-driven style models that can be used for hydrologic emulation.

Artificial Neural Networks (ANNs)

Rainfall-runoff modelling is one of the most complex and non-linear hydrologic phenomena to comprehend and model. This is due to tremendous spatial and temporal variability in watershed characteristics. Because ANNs can mimic high-dimensional non-linear systems, they are a popular choice for rainfall-runoff modeling (Maier at al., 2010). Depending on the time step of interest as well as the complexity of the hydrology of the basin, a simple feedforward network may be sufficient for accurate predictions. However, the model may benefit from having the ability to incorporate memory that might be inherent in the physical processes that dictate the rainfall-runoff relationship. Unlike feedforward networks, recurrent neural networks are designed to understand temporal dynamics by processing inputs in sequential order (Rumelhart et al., 1986) and by storing information obtained from previous outputs. However, RNNs have trouble learning long-term dependencies greater than 10 time steps (Bengio, 1994). The current state of the art is Long Short-Term Memory (LSTM) models. These RNN style networks contain a cell state that has the ability of learn long-term dependencies as well as gates to regulate the flow of information in and out the cell, as shown in Figure 1.

lstm

Figure 1: Long Short-Term Memory Module Architecture1

LSTMs are most commonly used in speech and writing recognition but have just begun to be implemented in hydrology applications with success especially in modelling rainfall-runoff in snow-influenced catchments. Kratzert et al., 2018, show that the LSTM is able to outperform the Sacramento Soil Moisture Accounting Model (SAC-SMA) coupled with a Snow-17 routine to compute runoff in 241 catchments.

Support Vector Regression (SVR)

Support vector machines are commonly used for classification but have been successfully implemented for working with continuous values prediction in a regression setting. Support vector regression relies on finding a function within a user specified level of precision, ε, from the true value of every data point, shown in Figure 2.

lstm.png

Figure 2: Support Vector Regression Model2

It is possible that this function may not exist, and so slack variables, ξ , are introduced which allow errors up to ξ to still exist. Errors that lie within the epsilon bounds are treated as  0, while points that lie outside of the bounds will have a loss equal to the distance between the point and the ε bound. Training an SVR model requires solving the following optimization problem:

lstm

Where w is the learned weight vector and xand yare training points. C is a penalty parameter imposed on observations that lie outside the epsilon margin and also serves as a method for regularization. In the case that linear regression is not sufficient for the problem, the inner products in the dual form of the problem above can be substituted with a kernel function that maps x to a higher dimensional space, This allows for estimation of non-linear functions. Work by Granata et al., 2016 compares an SVR approach with EPA’s Storm Water Management Model (SWMM) and finds comparable results in terms of RMSE and R2 value.

References

Abbott, M.b., et al. “An Introduction to the European Hydrological System — Systeme Hydrologique Europeen, ‘SHE’, 1: History and Philosophy of a Physically-Based, Distributed Modelling System.” Journal of Hydrology, vol. 87, no. 1-2, 1986, pp. 45–59., doi:10.1016/0022-1694(86)90114-9.

Bengio, Y., et al. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks, vol. 5, no. 2, 1994, pp. 157–166., doi:10.1109/72.279181.

Castelletti, A., et al. “A General Framework for Dynamic Emulation Modelling in Environmental Problems.” Environmental Modelling & Software, vol. 34, 2012, pp. 5–18., doi:10.1016/j.envsoft.2012.01.002.

Castelletti, A., et al. “A General Framework for Dynamic Emulation Modelling in Environmental Problems.” Environmental Modelling & Software, vol. 34, 2012, pp. 5–18., doi:10.1016/j.envsoft.2012.01.002.

Granata, Francesco, et al. “Support Vector Regression for Rainfall-Runoff Modeling in Urban Drainage: A Comparison with the EPA’s Storm Water Management Model.” Water, vol. 8, no. 3, 2016, p. 69., doi:10.3390/w8030069.

Kratzert, Frederik, et al. “Rainfall–Runoff Modelling Using Long Short-Term Memory (LSTM) Networks.” Hydrology and Earth System Sciences, vol. 22, no. 11, 2018, pp. 6005–6022., doi:10.5194/hess-22-6005-2018.

Maier, Holger R., et al. “Methods Used for the Development of Neural Networks for the Prediction of Water Resource Variables in River Systems: Current Status and Future Directions.” Environmental Modelling & Software, vol. 25, no. 8, 2010, pp. 891–909., doi:10.1016/j.envsoft.2010.02.003.

Rumelhart, David E., et al. “Learning Representations by Back-Propagating Errors.” Nature, vol. 323, no. 6088, 1986, pp. 533–536., doi:10.1038/323533a0.

Tych, W., and P.c. Young. “A Matlab Software Framework for Dynamic Model Emulation.” Environmental Modelling & Software, vol. 34, 2012, pp. 19–29., doi:10.1016/j.envsoft.2011.08.008.

Figures:

(1) https://colah.github.io/posts/2015-08-Understanding-LSTMs/

(2) Kleynhans, Tania, et al. “Predicting Top-of-Atmosphere Thermal Radiance Using MERRA-2 Atmospheric Data with Deep Learning.” Remote Sensing, vol. 9, no. 11, 2017, p. 1133., doi:10.3390/rs9111133.

 

 

 

Intro to Boosting

Introduction

Once upon a time, in a machine learning class project, student Michael Kearns asked if it is possible to convert a weak classifier (high bias classifier whose outputs are correct only slightly over 50% of the time) into a classifier of arbitrary accuracy by using an ensemble of such classifiers. This question was posed in 1988 and, two years later, in 1990, Robert Schapire answered that it is possible (Schapire, 1990). And so boosting was born.

The idea of boosting is to train an ensemble of weak classifiers, each of which an expert in a specific region of the space. The ensemble of classifiers has the form below:
H(\vec{x}) = \sum_{t=1}^T \alpha_th_t(\vec{x})
where H is the ensemble of classifiers, \alpha_t is the weight assigned to weak classifier h_t around samples \vec{x} at iteration t, and T is the number of classifiers in the ensemble.

Boosting creates such an ensemble in a similar fashion to gradient descent. However, instead of:
\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \alpha \nabla \ell(\vec{x}_t)
as in gradient descent in real space, where \ell is a loss function, Boosting is trained via gradient descent in functional space, so that:
H_{t+1}(\boldsymbol{X}) = H_t(\boldsymbol{X}) + \alpha_t \nabla \ell(h_t(\boldsymbol{X}))

The question then becomes how to find the \alpha_t‘s and the classifiers h_t. Before answering these questions, we should get a geometric intuition for Boosting first. The presentation in the following sections were based on the presentation on these course notes.

Geometric Intuition

In the example below we have a set of blue crosses and red circles we would like our ensemble or weak classifiers to correctly classify (panel “a”). Our weak classifier for this example will be a line, orthogonal to either axes, dividing blue crosses from red circles regions. Such classifier can also be called a CART tree (Breiman et al., 1984) with depth of 1 — hereafter called a tree stump. For now, let’s assume all tree stumps will have the same weight \alpha_t in the final ensemble.

Boosting.png

The first tree stump, a horizontal divide in panel “b,” classified ten out of thirteen points correctly but failed to classify the remaining three. Since it incorrectly classified a few points in the last attempt, we would like the next classifier correctly classify these points. To make sure that will be the case, boosting will increase the weight of the points that were misclassified earlier before training the new tree stump. The second tree stump, a vertical line on panel “c,” correctly classifies the two blue crosses that were originally incorrectly classified, although it incorrectly three crosses that were originally correctly classified. For the third classifier, Boosting will now increase the weight of the three bottom misclassified crosses as well of the other misclassified two crosses and circle because they are still not correctly classified — technically speaking they are tied, each of the two classifiers classifies them in a different way, but we are here considering this a wrong classification. The third iteration will then prioritize correcting the high-weight points again, and will end up as the vertical line on the right of panel “d.” Now, all points are correctly classified.

There are different ways to mathematically approach Boosting. But before getting to Boosting, it is a good idea to go over gradient descent, which is a basis of Boosting. Following that, this post will cover, as an example, the AdaBoost algorithm, which assumes an exponential loss function.

Minimizing a Loss Function

Standard gradient descent

Before getting into Boosting per se, it is worth going over the standard gradient descent algorithm. Gradient descent is a minimization algorithm (hence, descent). The goal is to move from an initial x_0 to the value of x with the minimum value of f(x), which in machine learning is a loss function, henceforth called \ell(x). Gradient descent does that by moving one step of length s at a time starting from x^0 in the direction of the steeped downhill slope at location x_t, the value of x at the t^{th} iteration. This idea is formalized by the Taylor series expansion below:
\ell(x_{t + 1}) = \ell(x_t + s) \approx \ell(x_t) - sg(x_t)
where g(x_t)=\nabla\ell(x_t). Furthermore,
s=\alpha g(x_t)
which leads to:
\ell\left[x_{t+1} -\alpha g(x_t)\right] \approx \ell(x^{t})-\alpha g(x_t)^Tg(x_t)
where \alpha, called the learning rate, must be positive and can be set as a fixed parameter. The dot product on the last term g(x_t)^Tg(x_t) will also always be positive, which means that the loss should always decrease — the reason for the italics is that too high values for \alpha may make the algorithm diverge, so small values around 0.1 are recommended.

Gradient Descent in Functional Space

What if x is actually a function instead of a real number? This would mean that the loss function \ell(\cdot) would be a function of a function, say \ell(H(\boldsymbol{x})) instead of a real number x. As mentioned, Gradient Descent in real space works by adding small quantities to x_0 to find the final x_{min}, which is an ensemble of small \Delta x‘s added together. By analogy, gradient descent in functional space works by adding functions to an ever growing ensemble of functions. Using the definition of a functional gradient, which is beyond the scope of the this post, this leads us to:
\ell(H+\alpha h) \approx \ell(H) + \alpha \textless\nabla \ell(H),h\textgreater
where H is an ensemble of functions, h is a single function, and the \textless f, g\textgreater notation denotes a dot product between f and g. Gradient descent in function space is an important component of Boosting. Based on that, the next section will talk about AdaBoost, a specific Boosting algorithm.

AdaBoost

Basic Definitions

The goal of AdaBoost is to find an ensemble function H of functions h, a weak classifier, that minimize an exponential loss function below for a binary classification problem:
\ell(H)=\sum_{i=1}^ne^{-y(x_i)H(x_i)}
where x_i, y(x_i) is the i^{th} data point in the training set. The step size \alpha can be interpreted as the weight of each classifier in the ensemble, which optimized for each function h added to the ensemble. AdaBoost is an algorithm for binary classification, meaning the independent variables \boldsymbol{x} have corresponding vector of dependent variables \boldsymbol{y}(x_i), in which each y(x_i) \in \{-1, 1\} is a vector with the classification of each point, with -1 and 1 representing the two classes to which a point may belong (say, -1 for red circles and 1 for blue crosses). The weak classifiers h in AdaBoost also return h(x) \in \{-1, 1\}.

Setting the Weights of Each Classifier

The weight \alpha_t of each weak classifier h can be found by performing the following minimization:
\alpha=argmin_{\alpha}\ell(H+\alpha h)
Since the loss function is defined as the summation of the exponential loss of each point in the training set, the minimization problem above can be expanded into:
\alpha=argmin_{\alpha}\sum_{i=1}^ne^{y(\boldsymbol{x}_i)\left[H(\boldsymbol{x}_i)+\alpha h(\boldsymbol{x}_i)\right]}
Differentiating the error w.r.t. \alpha, equating it with zero and performing a few steps of algebra leads to:
\alpha = \frac{1}{2}ln\frac{1-\epsilon}{\epsilon}
where \epsilon is the classification error of weak classifier h_t. The error \epsilon can be calculated for AdaBoost as shown next.

The Classification Error of one Weak Classifier

Following from gradient descent in functional space, the next best classifier h_{t+1} will be the one that minimizes the term \textless\nabla \ell(H),h\textgreater, which when zero would mean that the zero-slope ensemble has been reached, denoting the minimum value of the loss function has been reached for a convex problem such as this. Replacing the dot product by a summation, this minimization problem can be written as:
h(\boldsymbol{x}_i)=argmin_h\epsilon=argmin_h\textless\nabla \ell(H),h\textgreater
h(\boldsymbol{x}_i)=argmin_h\sum_{i=1}^n\frac{\partial e^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}}{\partial H(\boldsymbol{x}_i)}h(\boldsymbol{x}_i)
which after some algebra becomes:
h(\boldsymbol{x}_i)=argmin_h\sum_{i:h(\boldsymbol{x}_i)\neq y(\boldsymbol{x}_i)}w_i
(the summation of the weights of misclassified points)

Comparing the last with the first expression, we have that the error \epsilon for iteration t is simply the summation of the weights of the points misclassified by h(\boldsymbol{x}_i)_t — e.g., in panel “b” the error would be summation the of the weights of the two crosses on the upper left corner and of the circle at the bottom right corner. Now let’s get to these weights.

The Weights of the Points

There are multiple ways we can think of for setting the weights of each data point at each iteration. Schapire (1990) found a great way of doing so:
\boldsymbol{w}_{t+1}=\frac{\boldsymbol{w}^{t}}{Z}e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}
where \boldsymbol{w} is a vector containing the weights of each point, Z is a normalization factor to ensure the weights will sum up to 1. Be sure not to confuse \alpha_t, the weight of classifier t, with the weight of the points at iteration t, represented by \boldsymbol{w}_t. For the weights to sum up to 1, Z needs to be the sum of their pre-normalization values, which is actually identical to the loss function, so
Z=\sum_{i=1}^ne^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}
Using the definition of the error \epsilon, the update for Z can be shown to be:
Z_{t+1}=Z_t\cdot2\sqrt{\epsilon(1-\epsilon)}
so that the complete update is:
w_{t+1}=w_t \frac{e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}}{2\sqrt{\epsilon(1-\epsilon)}}
Now AdaBoost is ready for implementation following the pseudo-code below.

AdaBoost Pseudo-code

Below is a pseudo-code of AdaBoost. Note that it can be used with any weak learner (high bias) classifier. Again, shallow decision trees are a common choice for their simplicity and good performance.Boosting_algorithm.png

Boosting Will Converge, and Fast

One fascinating fact about boosting is that any boosted algorithms is guaranteed to converge independently of which weak classified is chosen. Recall that Z, the normalization factor for the point weights update, equals the loss function. That being the case, we get the following relation:
\ell(H)=Z=n\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_ t}
where n is the normalizing factor for all weights at step 0 (all of the weights are initially set to 1/n). To derive an expression for the upper bound of the error, let’s assume that the errors at all steps t equal their highest value, \epsilon_{max}. We have that:
\ell(H)\leq n\left[2\sqrt{\epsilon_{max}(1-\epsilon_{max})}\right]^T
Given that necessarily \epsilon_{max} \leq \frac{1}{2}, we have that
\epsilon_{max}(1-\epsilon_{max})<\frac{1}{4}
or
\epsilon_{max}(1-\epsilon_{max})=\frac{1}{4}-\gamma^2
for any \gamma in the interval \left[-\frac{1}{2},\frac{1}{2}\right]. Therefore, replacing the equation above in the first loss inequality written as a function of \epsilon_{max}, we have that:
\ell(H)\leq n(1-4\gamma^2)^{T/2}
which means that the training error is bound by an exponential decay as you add classifiers to the ensemble. This is a fantastic result and applies to any boosted algorithm!

(Next to) No Overfitting

Lastly, Boosted algorithms are remarkably resistant to overfitting. According to Murphy (2012), a possible reason is that Boosting can be seen as a form of \ell_1 regularization, which is prone to eliminate irrelevant features and thus reduce overfitting. Another explanation is related to the concept of margins, so that at least certain boosting algorithms force a classification on a point only if possible by a certain margin, thus also preventing overfitting.

Final Remarks

In this post, I presented the general idea of boosting a weak classifier, emphasizing its use with shallow CART trees, and used the AdaBoost algorithm as an example. However, other loss functions can be used and boosting can also be used for non-binary classifiers and for regression. The Python package scikit-learn in fact allows the user to use boosting with different loss functions and with different weak classifiers.

Despite the theoretical proof that Boosting does not overfit, researchers running it for extremely long times on rather big supercomputers found at at some point it starts to overfit, although still very slowly. Still, that is not likely to happen in your application. Lastly, Boosting with shallow decision trees is also a great way to have a fine control over how much bias there will be on your model, as all you need to do for that is to choose the number of T iterations.

Bibliography

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression T rees (Monterey, California: Wadsworth).
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.

Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

raw_points

Figure 1: Example training set

 

Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

P(y|x)=\frac{P(x|y)P(y)}{P(x)}                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

P(x|y) ~ N(\mu_y, \Sigma_y)                                                                                   (2)

Where \Sigma_y is a diagonal covariance matrix with [\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y} for each feature \alpha.

For each feature, $\alpha$, and each class, c we can then model P(x_\alpha|y) as:

P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}                              (3)

We can then estimate model parameters:

\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}                                                                   (4)

\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2                                                  (5)

Where:

n_c = \sum^{n}_{i=1}I(y_i=c)                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1'); 
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')

 

 

compare_classification

Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1'); 
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')

compare_probs

Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

Further reading

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743

Intro to Machine Learning Part 5: Bagging

In Part 3 of this series, it was stated that classification error can be due to bias, variance, or noise. There are different methods to address each type of classification error. This post will focus on bootstrap aggregation, or bagging, which aims to improve classification error due to variance.

As a reminder, expected test error can be decomposed as follows:

Eq1

where hd is the training data classifier and hbar is the ideal expected classifier. Now let’s just focus on the part of the testing error that is due to variance. From the equation, it is clear that if we want this error component to approach 0, then hd(x) must tend to hbar(x). We can use the the Weak Law of Large Numbers (WLLN) to understand how this can happen. WLLN states that if you have a sample of independent and identically distributed (i.i.d) random variables, as the sample size grows larger, the sample mean will tend toward the population mean.

If this is applied in a classification setting:                                 Capture(4)

In this equation, a classifier is trained on each of the m datatsets and then the results are averaged. By WLLN, hd tends to hbar …but only as our number of datasets, m, approaches infinity. But we only have one datatset, D, so how can we get more datasets? Through bagging!

Bagging

Picture1

Fig 1: Bootstrapping Process

Let’s say our original dataset D comes from some distribution P, shown in Figure 1.We can simulate drawing from P by bootstrapping, or sampling our original dataset D with replacement. Because we are sampling with replacement, we are essentially creating new datasets out of our original one! Note that when we sample with replacement, we cannot say that the new samples are i.i.d., so therefore the Weak Law of Large Numbers does not quite apply. But error due to variance can still be considerably reduced with this method.

Application: Random Forests

Bagging is most popularly utilized for random forests which are bagged decision trees. A decision tree is trained on each of the m datasets and then either the mode label is assigned (in a classification setting), or an average value is calculated (in a regression setting). Random Forests are termed “out of the box” algorithms; they are very easy to use because 1) only 2 hyper-parameters have to be defined, m (the number of datasets) and k (a feature splitting parameter) and  2) features don’t need to be normalized to have the same units or scale, which is extremely helpful for any kind of heterogeneous data.

Coding up the Classifier

Random Forests can be implemented in every programming language. In Python, scikit-learn has a function called RandomForestClassifier. Only the number of trees in the forest needs to be specified, but there are many extra parameters that can be fine-tuned, including the criterion to measure the quality of a split (Gini vs. Entropy) and the maximum depth of each tree. The code below outlines a basic implementation of the classifier. A training dataset with “value” and “label” fields is read in and then fit with the RandomForestClassifier function that is trained on 200 trees. Then the classifier is used to predict the labels associated with each test set “value” and the prediction accuracy is calculated.


#import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#import training data
training = pd.read_csv('train.csv', delimiter=',')
#fit classifier in training data
rfc1=RandomForestClassifier(n_estimators= 200, max_depth=8, criterion='entropy')
rfc1.fit(training.value, training.label)
#import testing data
test = pd.read_csv('test.csv', delimiter=',')
#predict labels
predicted = rfc1.predict(test.value)
#calculate accuracy
acc=np.mean(predicted == test.label)

Random Forests in Water Resources

The Water Resources field has seen an increasing machine learning presence for the past decade. However, the majority of research has been focused on implementation of artificial neural networks. Though just emerging in the field, random forests are becoming increasingly popular for both prediction and regression. Shortridge et al. (2016), tests the ability of random forests (among other models such as GLMs and ANNs) to predict monthly streamflow for five rivers in the Lake Tana Basin in Ethiopia. Empirical models that estimate monthly streamflow as a function of climate conditions and agricultural land cover in each basin were developed for the period from 1961 to 2004. The predictive capabilities of each model was tested along with investigation of their error structures. Random forests were found to better capture non-linear relationships than their ANN counterparts. Furthermore, they were much more parsimonious and physically interpretable. The paper also addresses ability of these models to make meaningful predictions under uncertainty and a non-stationary climate.

References:

All of the material for the post comes from Kilian Weinberger’s CS4780 class notes and lectures found at: http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote18.html

Shortridge, Julie E., et al. “Machine Learning Methods for Empirical Streamflow Simulation: a Comparison of Model Accuracy, Interpretability, and Uncertainty in Seasonal Watersheds.” Hydrology and Earth System Sciences, vol. 20, no. 7, 2016, pp. 2611–2628., doi:10.5194/hess-20-2611-2016.