CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

    

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

Do The (Schaake) Shuffle

This post in an introduction to the Schaake Shuffle, a method that can be used to address reconstructing space time variability in forecasted and synthetic variables. The Schaake Shuffle was originally introduced in a synthetic weather generation post by Julie Quinn almost 5 years ago. Lately, the importance (and difficulty) of being able to reproduce spatial and temporal variability in forecasts and synthetically generated variables across multiple correlated sites has been a prominent topic in our group. The goal of this post is to just “bump” this topic back into discussion and to make readers aware of its existence as a nifty post-generation way to build spatial and temporal variability back into synthetically generated data or forecasts. In the fundamental paper that establishes the method, Clark et al., 2004, the authors are looking to apply the method to forecasts of precipitation and temperature. In the case of weather variables such as temperature and precipitation, it is common to create forecasts for individual stations from a Numerical Weather Prediction (NWP) model. These variables serve as predictor variables in regression models that can be used to generate forecasts. The problem with these styles of approaches is that spatial correlation is not preserved between multiple stations nor temporal persistence, which is very important for hydrologic applications with memory.

The Schaake Shuffle is a method that reorders ensemble forecasts of precipitation and temperature to better reconstruct the space-time variability using a rank-ordering approach constructed from a historical record. The basic steps are as follows:

  1. Gather appropriate data: The NWP model outputs forecasts of accumulated precipitation, air temperature, relative humidity at 700 hpa, wind speed, total column precipitable water, and mean sea level pressure which are used as predictors in the forecast equations. Further, the authors acquire historical precipitation and temperature data for stations within four basins across the United States.
  2. Create Forecasts: The next step involves creating the precipitation and temperature ensemble forecasts. A multiple linear regression is used to develop model output statistics (MOS) equations. The forecasted variables that are taken from the NWP model are ultimately filtered down to keep on the variables that explain the highest variance in the response variable (in this example, response variables are precipitation, minimum temperature, maximum temperature). A separate regression equation is fit for each variable, station, and forecast lead time. The residuals of the regression equation are modeled stochastically to generate an ensemble of forecasts. Alternatively, one can apply the Schaake Shuffle to synthetically generated ensembles (not limited to forecasts).
  3. Reorder Forecasts: The reordering method can best be described by an example. For a given time, assume you have an ensemble of possible forecasts that you align in a 3D matrix: Xi,j,k where i=ensemble member, j=station, and k=variable of interest (precipitation or temperature). From the historical record, you must construct an equally sized matrix Yi,j,k which contains historical station observations for the same date. For this matrix, i=an index of dates in the historical time period, j=station, and k=variable of interest (precipitation or temperature).

Using this same notation the authors create a toy example to demonstrate the process. For some time t, imagine we have forecasts of maximum temperature for 10 ensembles, for a given date and station.

Let X be a 10 member ensemble in consideration.

X=[15.3,11.2,8.8,11.9,7.5,9.7,8.3,12.5,10.3,10.1]

We can sort the vector X to create χ

χ=[7.5,8.3,8.8,9.7,10.1,10.3,11.2,11.9,12.5,15.3].

Then we go to the historical record and choose 10 dates that reside in a 7 day window around the date that is being forecasted. This is our Y vector.

Y=[10.7,9.3,6.8,11.3,12.2,13.6,8.9,9.9,11.8,12.9]

We can sort this vector to create γ.

γ=[6.8,8.9,9.3,9.9,10.7,11.3,11.8,12.2,12.9,13.6]

We also create a vector B, which denotes the order of the sorted historical vector with respect to the unsorted vector.

B=[3,7,2,8,1,4,9,5,10,6]

The key is to now to reorder the ensemble forecast in the same order as the B vector. The rank order 1 value is in position 5 of the B vector. Therefore, we take the 5th value from χ (10.1). Then rank order 2 is in position 3. We take the third value from χ (8.8). We continue doing this until we have

Xss=[10.1, 8.8, 7.5, 10.3, 11.9, 15.3, 8.3, 9.7, 11.2, 12.5]

These are the basic fundamentals of the reordering algorithm and it can be extended to involve forecasting at multiple stations, demonstrated in the figure below. Table A shows 10 ensembles for forecasting weather on January 14th, 2004, ranked from lowest to highest value for three stations. Table B shows the historical record and the black and light gray ellipses represent the 1st and 2nd ensemble respectively. Table C shows the sorted historical record and where the selected historical observations lie in the sorted list. Finally Table A can be reordered accordingly to form Table D.

Schaake Shuffle for 10 member ensemble and 3 stations (Figure 2 from Clarke et al., 2004)

It’s important to remember that the Schaake Shuffle is only meant to capture the Spearman rank correlation of observations, but not to reconstruct the actual spearman correlations. The results from the paper, however, are quite remarkable and show how well the method captures spatial and temporal properties. The figure below shows an example of how the method preserves spatial correlation between two selected stations. The top set of figures show raw ensemble output while the bottom figures show results after the ensemble is reordered. The black lines denote the target observed correlation. Clearly, the reordered output approximates the observed correlation across lead times better than the raw ensemble output.

Preservation of spatial correlation for raw (top) and reordered (bottom) forecast ensembles (Figure 6 from Clarke et al, 2004)

One basic limitation of this approach is the assumption of stationarity and that the structure in the historical record will be applicable to the forecasted data. While other methods exist which can potentially preserve space-time variability well, the advantage of the Schaake Shuffle is the ability to reconstruct these patterns after the fact, as a post-processing step. If readers are interested in implementing the Schaake Shuffle, basic pseudocode is included at the end of the paper but there are also R packages that can automate the reordering process here. The steps to download the package and run the algorithm are denoted here. Note that this code only works for a single-station case. Here each column in the X vector will be an ensemble and the rows correspond to the number of days in the forecast. Implementing the example in Figure 2 for one station will requires X and Y to be a single row vector. Of course, one can manually extend this process to multiple stations.

install.packages("devtools")

devtools::install_github("katerobsau/depPPR")

library(depPPR)

forecast_example=as.matrix(read.delim("C:/Users/Rohini/Documents/synthetic.txt",header=FALSE))
climate_example=as.matrix(read.delim("C:/Users/Rohini/Documents/historical.txt",header=FALSE))

schaake_shuffle(X = forecast_example, Y = climate_example)

References:

All material is derived from the fundamental paper that introduces the Shaake Shuffle:

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004). The Schaake shuffle: A method for reconstructing space–time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology5(1), 243-262.

Time series forecasting in Python for beginners

Time series forecasting in Python for beginners

This semester I am teaching Engineering Management Methods here at Cornell University. The course is aimed at introducing engineering students to systems thinking and a variety of tools and analyses they can use to analyze data. The first chapter has been on time series forecasting, where we discussed some of the simpler models one can use and apply for forecasting purposes, including Simple and Weighted Moving Average, Single and Double Exponential Smoothing, Additive and Multiplicative Seasonal Models, and Holt Winter’s Method.

The class applications as well as the homework are primarily performed in Excel, but I have been trying, with limited success, to encourage the use of programming languages for the assignments. One comment I’ve received by a student has been that it takes significantly more time to perform the calculations by coding; they feel that it’s a waste of time. I initially attributed the comment to the fact that the student was new to coding and it takes time in the beginning, but on later reflection I realized that, in fact, the student was probably simply manually repeating the same Excel operations by using code: take a set of 30 observations, create an array to store forecasts, loop through every value and calculate forecast using model formula, calculate error metrics, print results, repeat steps for next set of data. It occurred to me that of course they think it’s a waste of time, because doing it that way completely negates what programming is all about: designing and building an executable program or function to accomplish a specific computing task. In this instance, the task is to forecast using each of the models we learn in class and the advantage of coding comes with the development of some sort of program or function that performs these operations for us, given a set of data as input. Simply going through the steps of performing a set of calculations for a problem using code is not much different than doing so manually or in Excel. What is different (and beneficial) is designing a code so that it can then be effortlessly applied to all similar problems without having to re-perform all calculations. I realize this is obvious to the coding virtuosos frequenting this blog, but it’s not immediately obvious to the uninitiated who are rather confused on why Dr. Hadjimichael is asking them to waste so much time for a meager bonus on the homework.

So this blog post, is aimed at demonstrating to coding beginners how one can transition from one way of thinking to the other, and providing a small time-series-forecasting toolkit for users that simply want to apply the models to their data.

The code and data for this example can be found on my GitHub page and I will discuss it below. I will be using a wine sales dataset that lists Australian wine sales (in kiloliters) from January 1980 to October 1991. The data looks like this:

Date Sales
1/1/80 464
2/1/80 675
3/1/80 703
4/1/80 887
5/1/80 1139
6/1/80 1077
7/1/80 1318
8/1/80 1260
9/1/80 1120
10/1/80 963
11/1/80 996
12/1/80 960
view raw wine_sales.csv hosted with ❤ by GitHub

And this is what the time series looks like:

Figure_1-0

We first need to import the packages we’ll be using and load the data. I will be using Pandas in this example (but there’s other ways). I’m also defining the number of seasonal periods in a cycle, in this case 12.

import numpy as np #Package we'll use for numerical calculations
import matplotlib.pyplot as plt #From matplotlib package we import pyplot for plots
import pandas #Package to data manipulation
import scipy.optimize #Package we'll use to optimize
plt.style.use('seaborn-colorblind') #This is a pyplot style (optional)

'''Load the data into a pandas series with the name wine_sales'''
time_series = pandas.Series.from_csv("wine_sales.csv", header=0)

P=12 #number of seasonal periods in a cycle
In class, I’ve always mentioned that one should use a training and a validation set for model development, primarily to avoid overfitting our model to the specific training set. In this example, the functions are written as they apply to the training set. Should you choose to apply the functions listed here, you should apply the functions for the training set, extract forecasts and then use those to initialize your validation period. To divide the observations, you would do something like this:
training = time_series[0:108] # Up to December '88
validation = time_series[108:] # From January '89 until end

Now, if say, we wanted to apply the Naive model of the next steps forecast being equal to the current observation, i.e., \hat{y}_{t+1}=y_t, we’d do something like:

 
y_hat=pandas.Series().reindex_like(time_series) # Create an array to store forecasts
y_hat[0]= time_series[0] # Initialize forecasting array with first observation
''' Loop through every month using the model to forecast y_hat'''
for t in range(len(y_hat)-1): # Set a range for the index to loop through
    y_hat[t+1]= time_series[t] # Apply model to forecast time i+1

Now if we’d like to use this for any time series, so we don’t have to perform our calculations every time, we need to reformat this a bit so it’s a function:

 
def naive(time_series):
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0]= time_series[0] # Initialize forecasting array with first observation
    ''' Loop through every month using the model to forecast y'''
    #This sets a range for the index to loop through
    for t in range(len(y_hat)-1): 
        y_hat[t+1]= time_series[t] # Apply model to forecast time i+1
    return y_hat

Now we can just call define this function at the top of our code and just call it with any time series as an input. The function as I’ve defined it returns a pandas.Series with all our forecasts. We can then do the same for all the other modeling methods (below). Some things to note:

  • The data we read in the top, outside the functions, as well as any parameters defined (P in this case) are global variables and do not need to be defined as an input to the function. The functions below only need a list of parameter values as inputs.
  • For the models with seasonality and/or trend we need to create separate series to store those estimates for E, S, and T.
  • Each model has its own initialization formulas and if we wanted to apply them to the validation set that follows our training set, we’d need to initialize with the last values of our training.
'''SIMPLE MOVING AVERAGE
Using this model, y_hat(t+1)=(y(t)+y(t-1)...+y(t-k+1))/k (i.e., the predicted 
next value is equal to the average of the last k observed values).'''
def SMA(params):
    k=int(np.array(params))
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0:k]=time_series[0:k]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]= np.sum(time_series[t-k+1:t+1])/k # Apply model to forecast time i+1
    return y_hat

'''WEIGHTED MOVING AVERAGE
Using this model, y_hat(t+1)=w(1)*y(t)+w(2)*y(t-1)...+w(k)*y(t-k+1) (i.e., the 
predicted next value is equal to the weighted average of the last k observed 
values).'''
def WMA(params):
    weights = np.array(params)
    k=len(weights)
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0:k]=time_series[0:k] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(k-1,len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]= np.sum(time_series[t-k+1:t+1].multiply(weights)) # Apply model to forecast time i+1
    return y_hat
'''This model includes the constraint that all our weights should sum to one. 
To include this in our optimization later, we need to define it as a function of our 
weights.'''
def WMAcon(params):
    weights = np.array(params)
    return np.sum(weights)-1

'''SINGLE EXPONENTIAL SMOOTHING
Using this model, y_hat(t+1)=y_hat(t)+a*(y(t)-y_hat(t))(i.e., the 
predicted next value is equal to the weighted average of the last forecasted value and its
difference from the observed).'''
def SES(params):
    a = np.array(params)
    y_hat=pandas.Series().reindex_like(time_series)
    y_hat[0]=time_series[0] # Initialize values
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(len(y_hat)-1): #This sets a range for the index to loop through
        y_hat[t+1]=  y_hat[t]+a*(time_series[t]-y_hat[t])# Apply model to forecast time i+1
    return y_hat

'''DOUBLE EXPONENTIAL SMOOTHING (Holts Method)
Using this model, y_hat(t+1)=E(t)+T(t) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend).'''
def DES(params):
    a,b = np.array(params)
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and T values.'''
    E = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[0]=E[0]=time_series[0] # Initialize values
    T[0]=0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*time_series[t]+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        y_hat[t+1] = E[t] + T[t] # Apply model to forecast time i+1
    return y_hat

'''ADDITIVE SEASONAL
Using this model, y_hat(t+1)=E(t)+S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
appropriate seasonal factor). We first need to create an array to store our 
forecast values.'''
def ASM(params):
    a,b = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]-E[:p]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*E[t]
        S[t+1] = b*(time_series[t]-E[t])+(1-b)*S[t+1-p]
        y_hat[t+1] = E[t] + S[t+1-p] # Apply model to forecast time i+1
    return y_hat

'''MULTIPLICATIVE SEASONAL
Using this model, y_hat(t+1)=E(t)*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series times 
the appropriate seasonal factor). We first need to create an array to store our 
forecast values.'''
def MSM(params):
    a,b = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]/E[:p]
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*E[t]
        S[t+1] = b*(time_series[t]/E[t])+(1-b)*S[t+1-p]
        y_hat[t+1] = E[t]*S[t+1-p] # Apply model to forecast time i+1
    return y_hat

'''ADDITIVE HOLT-WINTERS METHOD
Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend, times the appropriate seasonal factor). We first need to create an array 
to store our forecast values.'''
def AHW(params):
    a, b, g = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    E[:p] = np.sum(time_series[:p])/p
    S[:p] = time_series[:p]-E[:p]
    T[:p] = 0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]-S[t+1-p])+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        S[t+1] = g*(time_series[t]-E[t])+(1-g)*S[t+1-p]
        y_hat[t+1] = E[t]+T[t]+S[t+1-p] # Apply model to forecast time i+1
    return y_hat

'''MUTLIPLICATIVE HOLT-WINTERS METHOD
Using this model, y_hat(t+1)=(E(t)+T(t))*S(t-p) (i.e., the 
predicted next value is equal to the expected level of the time series plus the 
trend, times the appropriate seasonal factor). We first need to create an array 
to store our forecast values.'''
def MHW(params):
    a, b, g = np.array(params)
    p = P
    y_hat=pandas.Series().reindex_like(time_series)
    '''We need to create series to store our E and S values.'''
    E = pandas.Series().reindex_like(time_series)
    S = pandas.Series().reindex_like(time_series)
    T = pandas.Series().reindex_like(time_series)
    y_hat[:p]=time_series[0] # Initialize values
    '''We need to initialize the first p number of E and S values'''
    S[:p] = time_series[:p]/(np.sum(time_series[:p])/p)
    E[:p] = time_series[:p]/S[:p]
    T[:p] = 0
    ''' Loop through every month using the model to forecast y.
    Be careful with Python indexing!'''
    for t in range(p-1, len(y_hat)-1): #This sets a range for the index to loop through
        E[t+1] = a*(time_series[t]/S[t+1-p])+(1-a)*(E[t]+T[t])
        T[t+1] = b*(E[t+1]-E[t])+(1-b)*T[t]
        S[t+1] = g*(time_series[t]/E[t])+(1-g)*S[t+1-p]
        y_hat[t+1] = (E[t]+T[t])*S[t+1-p] # Apply model to forecast time i+1
    return y_hat

Having defined this, I can then, for example, call the Multiplicative Holt Winters method by simply typing:

MHW([0.5,0.5,0.5])

This will produce a forecast using the Multiplicative Holt Winters method with those default parameters, but we would like to calibrate them to get the “best” forecasts from our model. To do so, we need to define what we mean by “best”, and in this example I’m choosing to use Mean Square Error as my performance metric. I define it below as a function that receives the parameters and some additional arguments as inputs. I only need to set it up this way because my optimization function is trying to minimize the MSE function by use of those parameters. I’m using the “args” array to simply tell the function which model it’s using to forecast.

def MSE(params, args):
    model, = args
    t_error = np.zeros(len(time_series))
    forecast = model(params)
    for t in range(len(time_series)):
        t_error[t] = time_series[t]-forecast[t]
    MSE = np.mean(np.square(t_error))
    return MSE

To perform the optimization in Excel, we’d use Solver, but in Python we have other options. SciPy is a Python package that allows us, among many other things, to optimize such single-objective problems. What I’m doing here is that I define a list of all the models I want to optimize, their default parameters, and the parameters’ bounds. I then use a loop to go through my list of models and run the optimization. To store the minimized MSE values as well as the parameter values that produce them, we can create an array to store the MSEs and a list to store the parameter values for each model. The optimization function produces a “dictionary” item that contains the minimized MSE value (under ‘fun’), the parameters that produce it (under ‘x’) and other information.

''' List of all the models we will be optimizing'''
models = [SES, DES, ASM, MSM, AHW, MHW]
''' This is a list of all the default parameters for the models we will be 
optimizing. '''
                      #SES,  DES,     ASM
default_parameters = [[0.5],[0.5,0.5],[0.5,0.5],
                      #MSM,        AHW,            MHW
                      [0.5,0.5],[0.5,0.5,0.5],[0.5,0.5,0.5]]
''' This is a list of all the bounds for the default parameters we will be 
optimizing. All the a,b,g's are weights between 0 and 1. '''
bounds = [[(0,1)],[(0,1)]*2, [(0,1)]*2,
           [(0,1)]*2,[(0,1)]*3,[(0,1)]*3]
min_MSEs = np.zeros(len(models)) # Array to store minimized MSEs
opt_params = [None]*len(models) # Empty list to store optim. parameters
for i in range(len(models)):
    res = scipy.optimize.minimize(MSE, # Function we're minimizing (MSE in this case)
                                  default_parameters[i], # Default parameters to use
                                  # Additional arguments that the optimizer 
                                  # won't be changing (model in this case)
                                  args=[models[i]], 
                                  method='L-BFGS-B', # Optimization method to use
                                  bounds=bounds[i]) # Parameter bounds
    min_MSEs[i] = res['fun'] #Store minimized MSE value
    opt_params[i] = res['x'] #Store parameter values identified by optimizer

Note: For the WMA model, the weights should sum to 1 and this should be input to our optimization as a constraint. To do so, we need to define the constraint function as a dictionary and include the following in our minimization call: constraints=[{‘type’:’eq’,’fun’: WMAcon}]. The number of periods to consider cannot be optimized by this type of optimizer.

Finally, we’d like to present our results. I’ll do so by plotting the observations and all my models as well as their minimized MSE values:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1) # Create figure
ax.set_title("Australian wine sales (kilolitres)") # Set figure title
l1 = ax.plot(time_series, color='black', linewidth=3.0, label='Observations') # Plot observations
for i in range(len(models)):
    ax.plot(time_series.index,models[i](opt_params[i]), label = models[i].__name__)
ax.legend() # Activate figure legend
plt.show()
print('The estimated MSEs for all the models are:')
for i in range(len(models)):
    print(models[i].__name__ +': '+str(min_MSEs[i]))

This snippet of code should produce this figure of all our forecasts, as well as a report of all MSEs:Figure_1

The estimated MSEs for all the models are:
SES: 133348.78
DES: 245436.67
ASM: 80684.00
MSM: 64084.48
AHW: 72422.34
MHW: 64031.19

The Multiplicative Holt Winters method appears to give the smallest MSE when applied to these data.