This semester, I had the opportunity to take Dr. Scott Steinschneider’s new class, Hydrologic Engineering in a Changing Climate, that is offered here at Cornell. In this class, we covered time series analysis, extreme value modeling, and trend tests. I chose to do a final project which focused on using a time series approach to forecast electricity demand in California. As I worked through my project, it became apparent to me that data are rarely in the form where a time series model can be applied directly. Consequently, multiple transformations are usually necessary before a model can be fit to the data set. In this post, I outline some of the inherent characteristics of data that might warrant a transformation as well as the steps that can be taken to address these problems.
Given a data set that you would like to fit any Box-Jenkins model to, you should ask yourself the following two questions?
- Is normality a reasonable assumption for the residuals?
- Are the data stationary?
Normality
Normality can be checked before fitting the model because if the original data are not normal, then there is a good chance that the residuals won’t be as well. If you fit a histogram to the data and it looks like Figure 1, you probably need to apply some form of normalizing transformation.
Figure 1: Histogram of Monthly Flow
Two traditional transformations that you can try are a log transform or a Box-Cox transform, shown in the following two equations, where xt is the original data point.
Log Transform:
Box-Cox Transform:
Sometimes a log transformation can be too drastic and skew the data the opposite way. The Box-Cox transform is effectively a less intense transformation that one can try if the log transform is not suitable. Note that when λ=0, the Box-Cox transform reduces to a simple log transform.
The powerTransform function in the R package, car, can be used to find a lambda that will maximize normality.
Stationarity
For a data set to exhibit stationarity, the following three principles must be true for us to be confident that our model will represent our data well:
For some lag term, s,
- E[xt]=E[xt+s] (The mean of the data set does not change with time)
- Var[xt]=Var[xt+s] (The variance of the data set does not change with time)
- Cov[xt,xt+s]= γ (The covariance between data points is some constant value,γ)
Outlined below are some of the characteristics of a data set that can cause a violation of one or more of these principles.
Seasonality
Seasonality in data can exist if a time series pattern repeats over a fixed and known period. Figure 2 shows monthly inflow into the Schoharie Creek Reservoir. Periodicity is apparent, but it isn’t until we look at the autocorrelation function (ACF) of the data, shown in Figure 3, that we see that there is a clear repetition occurring every 12 months.
Figure 2: Monthly Inflow for the Schoharie Creek
Figure 3: ACF of Monthly Inflow
One effective way to get rid of this monthly seasonality is to use the following de-seasonalizing equation:
The seasonality is removed from each data point by subtracting the corresponding monthly mean (xmt) and dividing by the month’s standard deviation ( smt). This equation can also be used to account for daily or yearly seasonality as well.
Differencing is another way to address seasonality in data. A seasonal difference is the difference between an observation and the corresponding observation from the previous year.
Where m=12 for monthly data, m=4 for quarterly data, and so on 1.
Trend
A trend, shown in the first panel of Figure 4, is a clear violation of the first requirement for stationarity. There are a couple options that one can implement to deal with trends: differencing and model fitting.
Figure 4: De-trending process1
From the above figures, it is clear that differencing can be used to account for seasonality but can also be used to dampen a trend. A first difference is performed by subtracting the value of the current observation from the one in the time step before. It can be applied as follows:
If the transformed data is plotted and still has a trend, a second difference can further be applied.
It is important to note the distinction between seasonal and first differences. Seasonal differencing is the difference from one year to the next, while first differencing is the difference between one observation and the next. Seasonal and trend differencing can both be applied, but sometimes, if seasonal differencing is performed first, it will remove the need for further differencing1.
In Figure 4, note how a log transform, seasonal differencing, and second differencing is necessary to ultimately remove the trend.
Figure 5: Modeling Fitting with Ordinary Least Squares2
If a monotonic trend is observed, such as the one in Figure 5, a model fitting can be performed. In this example, a linear model is fit to the trend by choosing coefficients that minimize the sum of squares. This model is then subtracted from the original data to give residuals. The goal is for the resulting residuals to be stationary. Note that a polynomial model can also be fit to the trend if appropriate2 .
Heteroscedasticity
Heteroscedasticity describes the phenomenon when the data do not exhibit a constant variance. This is a violation of the second principle. Heteroscedasticity tends to appear in financial time series (i.e. prices of stocks and bonds) which can be very volatile, but it appears less so in hydrological data3. I did not have to address heteroscedasticity in the electricity load data for my project, and some statisticians suggest that one doesn’t have to deal with it unless it is very severe as weak heteroscedasticity tends be taken care of with normalization and de-seasonalization.
One way to check for heteroscedasticity in a time series is with the McLeod-Li test for conditional heteroscedasticity. If heteroscedasticity is present, consider using an ARCH/GARCH model, if an AR or ARMA model can be fit to the data, respectively, or a hybrid ARCH-ARIMA model if the latter models are not appropriate.
Choosing a Time Series Model
Once the necessary transformations have been performed, you are ready to fit a time series model to your data. R has a some useful packages for this: forecast and stats. Some helpful functions in these packages include:
auto.arima (forecast) – This function tells you what model is the best fit for your data, the coefficients for the lag terms, and variance of errors (along with other useful information).
arima.sim (stats) – This function allows you to simulate a set of data from your time series model.
predict (stats) – This function will provide a prediction for n time steps into the future based on the chosen time series model. Keep in mind it is best when used to predict just the next few time step.
Finally, remember that back-transformations must be performed on all simulations or predictions to get them into back into the original space.
*For a really helpful explanation of different time series notation, check this previous post.
References
*All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class
(1) Stationarity and Differencing: https://www.otexts.org/fpp/8/1
(2) Removal of Trend and Seasonality, UC Berkeley: https://www.stat.berkeley.edu/~gido/Removal%20of%20Trend%20and%20Seasonality.pdf
(3) Heteroscedasticity: http://www.math.canterbury.ac.nz/~m.reale/econ324/Topic2.pdf