Preparing Data for a Time Series Analysis

This semester, I had the opportunity to take Dr. Scott Steinschneider’s new class, Hydrologic Engineering in a Changing Climate, that is offered here at Cornell. In this class, we covered time series analysis, extreme value modeling, and trend tests. I chose to do a final project which focused on using a time series approach to forecast electricity demand in California. As I worked through my project, it became apparent to me that data are rarely in the form where a time series model can be applied directly. Consequently, multiple transformations are usually necessary before a model can be fit to the data set. In this post, I outline some of the inherent characteristics of data that might warrant a transformation as well as the steps that can be taken to address these problems.

Given a data set that you would like to fit any Box-Jenkins model to, you should ask yourself the following two questions?

  1. Is normality a reasonable assumption for the residuals?
  2. Are the data stationary?

Normality

Normality can be checked before fitting the model because if the original data are not normal, then there is a good chance that the residuals won’t be as well. If you fit a histogram to the data and it looks like Figure 1, you probably need to apply some form of normalizing transformation.

Rplot03

Figure 1: Histogram of Monthly Flow

Two traditional transformations that you can try are a log transform or a Box-Cox transform, shown in the following two equations, where xt is the original data point.

Log Transform:

Picture1

Box-Cox Transform:

Picture3

Sometimes a log transformation can be too drastic and skew the data the opposite way. The Box-Cox transform is effectively a less intense transformation that one can try if the log transform is not suitable. Note that when λ=0, the Box-Cox transform reduces to a simple log transform.

The powerTransform function in the R package, car, can be used to find a lambda that will maximize normality.

Stationarity

For a data set to exhibit stationarity, the following three principles must be true for us to be confident that our model will represent our data well:

For some lag term, s,

  1. E[xt]=E[xt+s] (The mean of the data set does not change with time)
  2. Var[xt]=Var[xt+s] (The variance of the data set does not change with time)
  3. Cov[xt,xt+s]= γ (The covariance between data points is some constant value,γ)

Outlined below are some of the characteristics of a data set that can cause a violation of one or more of these principles.

Seasonality

Seasonality in data can exist if a time series pattern repeats over a fixed and known period. Figure 2 shows monthly inflow into the Schoharie Creek Reservoir. Periodicity is apparent, but it isn’t until we look at the autocorrelation function (ACF) of the data, shown in Figure 3, that we see that there is a clear repetition occurring every 12 months.

Inflow

Figure 2: Monthly Inflow for the Schoharie Creek

ACF

Figure 3: ACF of Monthly Inflow

One effective way to get rid of this monthly seasonality is to use the following de-seasonalizing equation:

Picture4

The seasonality is removed from each data point by subtracting the corresponding monthly mean (xmt) and dividing by the month’s standard deviation ( smt). This equation can also be used to account for daily or yearly seasonality as well.

Differencing is another way to address seasonality in data. A seasonal difference is the difference between an observation and the corresponding observation from the previous year.

Picture7

Where m=12 for monthly data, m=4 for quarterly data, and so on 1.

Trend

A trend, shown in the first panel of Figure 4, is a clear violation of the first requirement for stationarity. There are a couple options that one can implement to deal with trends: differencing and model fitting.

Pic8

Figure 4: De-trending process1

From the above figures, it is clear that differencing can be used to account for seasonality but can also be used to dampen a trend. A first difference is performed by subtracting the value of the current observation from the one in the time step before. It can be applied as follows:

Picture8

 If the transformed data is plotted and still has a trend, a second difference can further be applied.

Picture9

It is important to note the distinction between seasonal and first differences. Seasonal differencing is the difference from one year to the next, while first differencing is the difference between one observation and the next. Seasonal and trend differencing can both be applied, but sometimes, if seasonal differencing is performed first, it will remove the need for further differencing1.

In Figure 4, note how a log transform, seasonal differencing, and second differencing is necessary to ultimately remove the trend.

 

Pic10

Figure 5: Modeling Fitting with Ordinary Least Squares2

If a monotonic trend is observed, such as the one in Figure 5, a model fitting can be performed. In this example, a linear model is fit to the trend by choosing coefficients that minimize the sum of squares. This model is then subtracted from the original data to give residuals. The goal is for the resulting residuals to be stationary. Note that a polynomial model can also be fit to the trend if appropriate2 .

Heteroscedasticity

Heteroscedasticity describes the phenomenon when the data do not exhibit a constant variance. This is a violation of the second principle. Heteroscedasticity tends to appear in financial time series (i.e. prices of stocks and bonds) which can be very volatile, but it appears less so in hydrological data3. I did not have to address heteroscedasticity in the electricity load data for my project, and some statisticians suggest that one doesn’t have to deal with it unless it is very severe as weak heteroscedasticity tends be taken care of with normalization and de-seasonalization.

One way to check for heteroscedasticity in a time series is with the McLeod-Li test for conditional heteroscedasticity. If heteroscedasticity is present, consider using an ARCH/GARCH model, if an AR or ARMA model can be fit to the data, respectively, or a hybrid ARCH-ARIMA model if the latter models are not appropriate.

Choosing a Time Series Model

Once the necessary transformations have been performed, you are ready to fit a time series model to your data. R has a some useful packages for this: forecast and stats. Some helpful functions in these packages include:

auto.arima (forecast) – This function tells you what model is the best fit for your data, the coefficients for the lag terms, and variance of errors (along with other useful information).

arima.sim (stats) – This function allows you to simulate a set of data from your time series model.

predict (stats) – This function will provide a prediction for n time steps into the future based on the chosen time series model. Keep in mind it is best when used to predict just the next few time step.

Finally, remember that back-transformations must be performed on all simulations or predictions to get them into back into the original space.

*For a really helpful explanation of different time series notation, check this previous post.

References

*All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class

(1) Stationarity and Differencing: https://www.otexts.org/fpp/8/1

(2) Removal of Trend and Seasonality, UC Berkeley: https://www.stat.berkeley.edu/~gido/Removal%20of%20Trend%20and%20Seasonality.pdf

(3) Heteroscedasticity: http://www.math.canterbury.ac.nz/~m.reale/econ324/Topic2.pdf

 

Directed search with the Exploratory Modeling workbench

This is the third blog in a series showcasing the functionality of the Exploratory Modeling workbench. Exploratory modeling entails investigating the way in which uncertainty and/or policy levers map to outcomes. To investigate these mappings, we can either use sampling based strategies (open exploration) or optimization based strategies (directed search) In the first blog, I gave a general overview of the workbench and showed briefly how both investigation strategies can be done. In the second blog, I demonstrated the use of the workbench for open exploration in substantial more detail. In this third blog, I will demonstrate in more detail how to use the workbench for directed search. Like in the previous two blog post, I will use the DPS version of the lake problem.

For optimization, the workbench relies on platypus. You can easily install the latest version of platypus from github using pip

pip install git+https://github.com/Project-Platypus/Platypus.git

By default, the workbench will use epsilon NSGA2, but all the other algorithms available within platypus can be used as well.

Within the workbench, optimization can be used in three ways:
* Search over decision levers for a reference scenario
* Robust search: search over decision levers for a set of scenarios
* worst case discovery: search over uncertainties for a reference policy

The search over decision levers or over uncertainties relies on the specification of the direction for each outcome of interest defined on the model. It is only possible to use ScalarOutcome objects for optimization.

Search over levers

Directed search is most often used to search over the decision levers in order to find good candidate strategies. This is for example the first step in the Many Objective Robust Decision Making process. This is straightforward to do with the workbench using the optimize method.

from ema_workbench import MultiprocessingEvaluator, ema_logging

ema_logging.log_to_stderr(ema_logging.INFO)

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=10000, searchover='levers', 
                                 epsilons=[0.1,]*len(model.outcomes),
                                 population_size=50)

the result from optimize is a DataFrame with the decision variables and outcomes of interest. The latest version of the workbench comes with a pure python implementation of parallel coordinates plot built on top of matplotlib. It has been designed with the matplotlib and seaborn api in mind. We can use this to quickly visualize the optimization results.

from ema_workbench.analysis import parcoords

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0)
paraxes.plot(results, color=sns.color_palette()[0])
paraxes.invert_axis('max_P')
plt.show()

Note how we can flip an axis using the invert_axis method. This eases interpretation of the figure because the ideal solution in this case would be a straight line for the four outcomes of interest at the top of the figure.

output_8_1

Specifying constraints

In the previous example, we showed the most basic way for using the workbench to perform many-objective optimization. However, the workbench also offers support for constraints and tracking convergence. Constrains are an attribute of the optimization problem, rather than an attribute of the model as in Rhodium. Thus, we can pass a list of constraints to the optimize method. A constraint can be applied to the model input parameters (both uncertainties and levers), and/or outcomes. A constraint is essentially a function that should return the distance from the feasibility threshold. The distance should be 0 if the constraint is met.

As a quick demonstration, let’s add a constraint on the maximum pollution. This constraint applies to the max_P outcome. The example below specifies that the maximum pollution should be below 1.

from ema_workbench import MultiprocessingEvaluator, ema_logging, Constraint

ema_logging.log_to_stderr(ema_logging.INFO)

constraints = [Constraint("max pollution", outcome_names="max_P",
                          function=lambda x:max(0, x-1))]

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=1000, searchover='levers', 
                                 epsilons=[0.1,]*len(model.outcomes),
                                 population_size=25, constraints=constraints)

tracking convergence

To track convergence, we need to specify which metric(s) we want to use and pass these to the optimize method. At present the workbench comes with 3 options: Hyper volume, Epsilon progress, and a class that will write the archive at each iteration to a separate text file enabling later processing. If convergence metrics are specified, optimize will return both the results as well as the convergence information.

from ema_workbench import MultiprocessingEvaluator, ema_logging
from ema_workbench.em_framework.optimization import (HyperVolume,
                                                     EpsilonProgress, )
from ema_workbench.em_framework.outcomes import Constraint

ema_logging.log_to_stderr(ema_logging.INFO)

# because of the constraint on pollution, we can specify the 
# maximum easily
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]),
                       EpsilonProgress()]
constraints = [Constraint("max pollution", outcome_names="max_P",
                          function=lambda x:max(0, x-1))]

with MultiprocessingEvaluator(model) as evaluator:
    results_ref1, convergence1 = evaluator.optimize(nfe=25000, searchover='levers', 
                                    epsilons=[0.05,]*len(model.outcomes),
                                    convergence=convergence_metrics,
                                    constraints=constraints,
                                    population_size=100)

We can visualize the results using parcoords as before, while the convergence information is in a DataFrame making it also easy to plot.

fig, (ax1, ax2) = plt.subplots(ncols=2, sharex=True)
ax1.plot(convergence1.epsilon_progress)
ax1.set_xlabel('nr. of generations')
ax1.set_ylabel('$\epsilon$ progress')
ax2.plot(convergence1.hypervolume)
ax2.set_ylabel('hypervolume')
sns.despine()
plt.show()

output_16_0

Changing the reference scenario

Up till now, we have performed the optimization for an unspecified reference scenario. Since the lake model function takes default values for each of the deeply uncertain factors, these values have been implicitly assumed. It is however possible to explicitly pass a reference scenario that should be used instead. In this way, it is easy to apply the extended MORDM approach suggested by Watson and Kasprzyk (2017).

To see the effects of changing the reference scenario on the values for the decision levers found through the optimization, as well as ensuring a fair comparison with the previous results, we use the same convergence metrics and constraints from the previous optimization. Note that the constraints are in essence only a function, and don’t retain optimization specific state, we can simply reuse them. The convergence metrics, in contrast retain state and we thus need to re-instantiate them.

from ema_workbench import Scenario

reference = Scenario('reference', **dict(b=.43, q=3,mean=0.02, 
                                         stdev=0.004, delta=.94))
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[1,1,1,1]),
                       EpsilonProgress()]

with MultiprocessingEvaluator(model) as evaluator:
    results_ref2, convergence2 = evaluator.optimize(nfe=25000, searchover='levers', 
                                  epsilons=[0.05,]*len(model.outcomes),
                                  convergence=convergence_metrics,
                                  constraints=constraints,
                                  population_size=100, reference=reference)

comparing results for different reference scenarios

To demonstrate the parcoords plotting functionality in some more detail, let’s combine the results from the optimizations for the two different reference scenarios and visualize them in the same plot. To do this, we need to first figure out the limits across both optimizations. Moreover, to get a better sense of which part of the decision space is being used, let’s set the limits for the decision levers on the basis of their specified ranges instead of inferring the limits from the optimization results.

columns = [lever.name for lever in model.levers]
columns += [outcome.name for outcome in model.outcomes]
limits = {lever.name: (lever.lower_bound, lever.upper_bound) for lever in 
           model.levers}
limits = dict(**limits, **{outcome.name:(0,1) for outcome in model.outcomes})
limits = pd.DataFrame.from_dict(limits)
# we resort the limits in the order produced by the optimization
limits = limits[columns] 

paraxes = parcoords.ParallelAxes(limits, rot=0)
paraxes.plot(results_ref1, color=sns.color_palette()[0], label='ref1')
paraxes.plot(results_ref2, color=sns.color_palette()[1], label='ref2')
paraxes.legend()
paraxes.invert_axis('max_P')
plt.show()

output_22_0.png

Robust Search

The workbench also comes with support for many objective robust optimization. In this case, each candidate solution is evaluated over a set of scenarios, and the robustness of the performance over this set is calculated. This requires specifying 2 new pieces of information:
* the robustness metrics
* the scenarios over which to evaluate the candidate solutions

The robustness metrics are simply a collection of ScalarOutcome objects. For each one, we have to specify which model outcome(s) it uses, as well as the actual robustness function. For demonstrative purposes, let’s assume we are use a robustness function using descriptive statistics: we want to maximize the 10th percentile performance for reliability, inertia, and utility, while minimizing the 90th percentile performance for max_P.

We can specify our scenarios in various ways. The simplest would be to pass the number of scenarios to the robust_optimize method. In this case for each generation a new set of scenarios is used. This can create noise and instability in the optimization. A better option is to explicitly generate the scenarios first, and pass these to the method. In this way, the same set of scenarios is used for each generation.

If we want to specify a constraint, this can easily be done. Note however, that in case of robust optimization, the constrains will apply to the robustness metrics instead of the model outcomes. They can of course still apply to the decision variables as well.

import functools
from ema_workbench import Constraint, MultiprocessingEvaluator
from ema_workbench import Constraint, ema_logging
from ema_workbench.em_framework.optimization import (HyperVolume,
                                                     EpsilonProgress)
from ema_workbench.em_framework.samplers import sample_uncertainties

ema_logging.log_to_stderr(ema_logging.INFO)

percentile10 = functools.partial(np.percentile, q=10)
percentile90 = functools.partial(np.percentile, q=90)

MAXIMIZE = ScalarOutcome.MAXIMIZE
MINIMIZE = ScalarOutcome.MINIMIZE
robustnes_functions = [ScalarOutcome('90th percentile max_p', kind=MINIMIZE, 
                             variable_name='max_P', function=percentile90),
                       ScalarOutcome('10th percentile reliability', kind=MAXIMIZE, 
                             variable_name='reliability', function=percentile10),
                       ScalarOutcome('10th percentile inertia', kind=MAXIMIZE, 
                             variable_name='inertia', function=percentile10),
                       ScalarOutcome('10th percentile utility', kind=MAXIMIZE, 
                             variable_name='utility', function=percentile10)]

def constraint(x):
    return max(0, percentile90(x)-10)

constraints = [Constraint("max pollution", 
                          outcome_names=['90th percentile max_p'],
                          function=constraint)]
convergence_metrics = [HyperVolume(minimum=[0,0,0,0], maximum=[10,1,1,1]),
                       EpsilonProgress()]
n_scenarios = 10
scenarios = sample_uncertainties(model, n_scenarios)

nfe = 10000

with MultiprocessingEvaluator(model) as evaluator:
    robust_results, convergence = evaluator.robust_optimize(robustnes_functions, 
                            scenarios, nfe=nfe, constraints=constraints,
                            epsilons=[0.05,]*len(robustnes_functions),
                            convergence=convergence_metrics,)
fig, (ax1, ax2) = plt.subplots(ncols=2)
ax1.plot(convergence.epsilon_progress.values)
ax1.set_xlabel('nr. of generations')
ax1.set_ylabel('$\epsilon$ progress')
ax2.plot(convergence.hypervolume)
ax2.set_ylabel('hypervolume')
sns.despine()
plt.show()

output_25_0.png

paraxes = parcoords.ParallelAxes(parcoords.get_limits(robust_results), rot=45)
paraxes.plot(robust_results)
paraxes.invert_axis('90th percentile max_p')
plt.show()

output_26_0.png

Search over uncertainties: worst case discovery

Up till now, we have focused on optimizing over the decision levers. The workbench however can also be used for worst case discovery (Halim et al, 2016). In essence, the only change is to specify that we want to search over uncertainties instead of over levers. Constraints and convergence works just as in the previous examples.

Reusing the foregoing, however, we should change the direction of optimization of the outcomes. We are no longer interested in finding the best possible outcomes, but instead we want to find the worst possible outcomes.

# change outcomes so direction is undesirable
minimize = ScalarOutcome.MINIMIZE
maximize = ScalarOutcome.MAXIMIZE

for outcome in model.outcomes:
    if outcome.kind == minimize:
        outcome.kind = maximize
    else:
        outcome.kind = minimize

We can reuse the reference keyword argument to perform worst case discovery for one of the policies found before. So, below we select solution number 9 from the pareto approximate set. We can turn this into a dict and instantiate a Policy objecti.

from ema_workbench import Policy

policy = Policy('9', **{k:v for k, v in results_ref1.loc[9].items()
                        if k in model.levers})

with MultiprocessingEvaluator(model) as evaluator:
    results = evaluator.optimize(nfe=1000, searchover='uncertainties', 
                                 epsilons=[0.1,]*len(model.outcomes),
                                 reference=policy)

Visualizing the results is straightforward using parcoords.

paraxes = parcoords.ParallelAxes(parcoords.get_limits(results), rot=0)
paraxes.plot(results)
paraxes.invert_axis('max_P')
plt.show()

output_30_0.png

Closing remarks

This blog showcased the functionality of the workbench for applying search based approaches to exploratory modelling. We specifically looked at the use of many-objective optimization for searching over the levers or uncertainties, as well as the use of many-objective robust optimization. This completes the overview of the functionality available in the workbench. In the next blog, I will put it all together to show how the workbench can be used to perform Many Objective Robust Decision Making.

Nondimensionalization of differential equations – an example using the Lotka-Volterra system of equations

I decided to write about nondimensionalization today since it’s something I only came across recently and found very exciting. It’s apparently a trivial process for a lot of people, but it wasn’t something I was taught how to do during my education so I thought I’d put a short guide out on the interwebs for other people. Nondimensionalization (also referred to as rescaling) refers to the process of transforming an equation to a dimensionless form by rescaling its variables. There are a couple benefits to this:

  • All variables and parameters in the new system are unitless, i.e. scales and units not an issue in the new system and the dynamics of systems operating on different time and/or spatial scales can be compared;
  • The number of model parameters is reduced to a smaller set of fundamental parameters that govern the dynamics of the system; which also means
  • The new model is simpler and easier to analyze, and
  • The computational time becomes shorter.

I will now present an example using a predator-prey system of equations. In my last blogpost, I used the Lotka-Volterra system of equations for describing predator-prey interactions. Towards the end of that post I talked about the logistic Lotka-Volterra system, which is in the following form:image002image004Where x is prey abundance, y is predator abundance, b is the prey growth rate, d is the predator death rate, c is the rate with which consumed prey is converted to predator abundance, a is the rate with which prey is killed by a predator per unit of time, and K is the carrying capacity of the prey given its environmental conditions.

The first step is to define the original model variables as products of new dimensionless variables (e.g. x*) and scaling parameters (e.g. X), carrying the same units as the original variable.image006The rescaled models are then substituted in the original model:
image008image010Carrying out all cancellations and obvious simplifications:image012image014Our task now is to define the rescaling parameters X, Y, and T to simplify our model – remember they have to have the same units as our original parameters.

 

Variable/parameter Unit
x mass 1
y mass 1
t time
b 1/time
d 1/time
a 1/(mass∙time) 2
c mass/mass 3
K mass

There’s no single correct way of going about doing this, but using the units for guidance and trying to be smart we can simplify the structure of our model. For example, setting X=K will remove that term from the prey equation (notice that this way X has the same unit as our original x variable).

image016image018

The choice of Y is not very obvious so let’s look at T first. We could go with both T=1/b or T=1/d. Unit-wise they both work but one would serve to eliminate a parameter from the first equation and the other from the second. The decision here depends on what dynamics we’re most interested in, so for the purposes of demonstration here, let’s go with T=1/b.

image020image022

We’re now left with defining Y, which only appears in the second term of the first equation. Looking at that term, the obvious substitution is Y=b/a, resulting in this set of equations:image024image022

Our system of equations is still not dimensionless, as we still have the model parameters to worry about. We can now define aggregate parameters using the original parameters in such a way that they will not carry any units and they will further simplify our model.

By setting p1=caK/b and p2=d/b we can transform our system to:image024image026a system of equations with no units and just two parameters.

 

1 Prey and predator abundance don’t have to necessarily be measured using mass units, it could be volume, density or something else. The units for parameters a, c, K would change equivalently and the rescaling still holds.

2 This is the death rate per encounter with predator per time t.

3 This is the converted predator (mass) per prey (mass) consumed.