Power Spectral Density Estimation in Matlab

Power Spectral Density Estimation in Matlab

Hi all,

This post is about power spectral density estimation (PSDE) using Matlab.  PSDE is often used in signal processing and fluid mechanics, so you may or may not be familiar with it.  The idea is to determine the frequency content of sampled time series data.  This means that we can determine the amount of the variance in the time series that is contained at different frequencies (or periodicities).  Let’s take a look at a simple example.

Suppose we are sampling a signal (in this case a sign wave) for about 3 minutes with a frequency of 100 hertz, meaning we take 100 measurements every second.  Let’s generate our signal and plot our observations using Matlab.

x = (0:0.01:60*pi);
y = sin(x);

figure(1)
plot(x,sin(x))
title('Signal with no Noise')
xlabel('Seconds')
ylabel('Observed Signal')
ylim([-1.5,1.5])
untitled

Sine wave, with no noise, sampled at 100 Hertz

From observing the data directly it’s pretty clear what the frequency and period of the data are here (1/(2*pi) and pi respectively): no need for PSDE.  However, real data is rarely so well behaved.  To simulate this, let’s add random noise to our original signal.  To make it interesting, let’s add a lot of noise: normal distributed noise with standard deviation 10 times larger than the magnitude of the signal!

for i = 1:max(size(x))
 y(i) = y(i)+10*norminv(rand);
end

figure(2)
plot(x,y)
title('Signal with Noise')
xlabel('Seconds')
ylabel('Observed Signal')
Sine wave, with random normal noise, sampled at 100 Hertz

Sine wave, with random normal noise, sampled at 100 Hertz

Now the noise completely washes out the signal, and it is not at all clear what the period or frequency of the underlying data are.  So let’s compute and plot the power spectrum of the data.  I leave it to the reader to review all of the math underlying computing the spectrum, but I’ll provide a very brief description of what we are doing here.  PSDE involves taking the discrete Fourier transform of the data.  This transforms our data from the temporal or spatial domain to the frequency domain.  The power spectral density (PSD) function is simply the magnitude of the squared Fourier transformed data (often scaled). It tells us how much of the variance in our signal is explained in each discrete frequency band.  A spike in the PSD function tells us that there’s a lot of variance explained in that frequency band.  The frequency resolution of our analysis (width of the frequency bands) is dictated by the length of our recorded observations.  Interestingly, if we integrate the PSD function over the frequency domain, we obtain the sample variance of our data.  Now, let’s look at the code.

fs = (0.01)^-1;
N =size(y,2);

Y = fft(y);
f=fs*linspace(0,1,N);
G = abs(Y.^2)*(1/(fs*N));

figure(3)
loglog(f(2:floor(N/2)),G(2:floor(N/2)))
title('Amplitude of Spectrum of y(t)')
xlabel('Frequency (Hz)')
ylabel('S(f)')

The first two lines are defining the sampling frequency (fs), the number of observations (N).  Next we use Matlab’s fft command to take the discrete Fourier transformation of the data.  We then compute the frequency bins we are considering, then we compute and plot the PSD function.  Let’s take a look at the resultamplitude_of_spectrum

 

There is a lot of noise, but there is a distinct peak at 0.1592, which is 1/(2*pi), indicates the frequency of the underlying signal.  This is neat, even though the signal contains a huge amount of noise, we’ve distinguished the correct frequency (and inversely the period) of the signal!

So how might we as systems analysts use this?

In my PhD thesis we used PSDE to determine the critical time-dynamics of a hydropower system’s operation.  This turned out to be a critical diagnostic tool.  In the hydropower system I considered, both inflows and energy prices are stochastic, so it can be difficult to determine if  the derived ‘optimal’ policy makes sense by only observing simulation results: the release in any one time step (here 6-hrs) can appear noisy and erratic in response to price or hydrologic shocks.  The power of PSDE is the ability to cut through the noise and reveal the underlying signal.

As an example, I’ve derived an implicit ‘optimal’ policy using sampling stochastic dynamic programming.  To evaluate the quality of the derived policy, I’ve simulated the system operation with this policy over the range of historical inflow series.  For simplicity, we’ll assume that the energy price profile (ranging from high on-peak to low off-peak prices) is identical on every day.  Now, let’s look at the PSD function for the reservoir release.

error_hydro

We can immediately pick out elements of the PSD function that make sense.  First, there is a lot of variability at very low frequencies, corresponding to periodicity on the order of months or seasons.  This is likely due to seasonal hydrologic variability between summer, spring, and winter inflows: releases are higher in the spring and early summer because more water is available, and generally lower in the winter and late summer when inflows to the reservoir are lower.  Similarly the peak at frequency 1/day makes sense, because we still have a daily price cycle that the system is responding to, put we are harder pressed to explain the other peaks between frequencies 0.1 and 1 (roughly corresponding to weekly and within week periodicities).  We have fixed the price profile, but hydrology shouldn’t be cycling during the week, so what is going on?

As it turns out, I had a bug in my algorithm (bad ending conditions for the SSDP) which was causing the ‘optimal’ policy to behave a bit myopically on a weekly basis.  This error was propagating through the model, but was not easy to see by simply observing the simulated results (see chapter 6 of the thesis).  Correcting the problem, I got the following spectrum.  The unexplainable spikes have gone away, and we now see that daily price cycling and seasonal hydrology explain the majority of the systems variability (as we expect).

correct_hydro

For our Cornell-based readers our new CEE faculty member Gregory McLaskey is offering a new class on time series analysis (including PSDE) which is highly recommended.  PSDE is also used extensively (with real data) in Todd Cowen‘s experimental fluid mechanics class which also recommend because you learn a lot and get to play with lasers which, let’s be honest, is why we all became engineers.

Speed up your Python code with basic shared memory parallelization

Part 1 – The simplest case

In many situations we have for-loops in our Python codes in which the iteration i+1 does not depend on iteration i. In these situations, chances are your code can be easily modified for smaller run time by making it use multiple cores of your computer to calculate different iterations at the same time. What will be presented here works great for codes to be run on a desktops or on individual nodes in a cluster (different nodes do not share a memory).

Suppose you ran 3,000 simulations with a given model and each simulation returns a csv file with 24 columns and 10,000 values per column. Suppose that for each file you want to get the summation of all values in each column. One approach would be to use a code like the following one:

from glob import glob
from numpy import genfromtxt

list_of_files = glob("*.csv")

results = []
for f in list_of_files:
    data = genfromtxt(f, delimiter=' ')
    data_transposed = [list(x) for x in zip(*data)] # transposed data
    totals = [sum(line) for line in data_transposed] # sums all lines (former columns, vector of length 24)
    results.append(totals)

print results

This code works great but it takes 11 minutes to run on a Intel® Xeon(R) CPU E5-1620 v2 @ 3.70GHz × 8 (8 cores, 16 with hyper-threading). A way of making this code run faster is to parallelize the for-loop with the Pool class from the multiprocessing Python library.

A Pool object creates a pool of workers (threads) to handle different parts (in our case, files) of the for-loop. A Pool object uses the map(function, input_list) method to distribute the work among all workers and, when all workers are done, it merges the results in a python list in the right order, as if you had ran the for-loop above.

Here is the basic structure of a code that was modified to use the Pool class:

  1. The first step is to create a function based on the code inside the for-loop to be called by the map method.
  2. A pool object is then created with the desired number of threads (between 1 and 2 times the number of cores)
  3. The pool object distributes the work among the workers by making each one of them call the function with a different part of the input list and puts the returned values together in a single list.

The parallelized code would then look like this:

from glob import glob
from numpy import genfromtxt
from multiprocessing import Pool

def process_file(f): 
    data = genfromtxt(f, delimiter=' ') 
    data_transposed = [list(x) for x in zip(*data)] # transposed 
    data totals = [sum(line) for line in data_transposed] # sums all lines (former columns, vector of length 24) 
    return totals

list_of_files = glob("*.rdm")

p = Pool(16) # create a pool of 16 workers

results = p.map(process_file, list_of_files) # perform the calculations

print results 

The code above too 3 minutes to run, as opposed to the 11 minutes taken by its serial version.

Part 2 – More complex cases: parallelization with multiple lists and parameters

Suppose now you want to calculate the weighted average of each column and each row has a different weight. You would need to pass two different arguments to the process_file function, namely the file name and the list of weights. This can be done by using the partial(function, parameter1, parameter2, parameter 3, …) function of the functools library. The partial function combines a user defined function with arguments that will be passed to this function every time it is called. Here is how the code would look like:

from glob import glob
from numpy import genfromtxt
from multiprocessing import Pool
from functools import partial
import random

def process_file(weights, f):
    print f
    data = genfromtxt(f, delimiter=' ')
    n = len(data)
    data_transposed = [list(x) for x in zip(*data)] # transposed data
    averages = [sum(l*w for l, w in zip(line, weights)) / n for line in data_transposed] # sums all lines (former columns, vector of length 24)
    return averages

list_of_files = glob("*.rdm")

p = Pool(16) # creates the pool of workers

weight_factors = [random.random() for _ in range(0, 10000)] # creates random list of weights between 0 and 1

partial_process_file = partial(process_file, weight_factors) # adds weight_factors as an argument to be always passed to the process_file
results = p.map(partial_process_file, list_of_files) # perform the calculations

print results

Keep in mind that the function input parameter related to the list your code will iterate over (f in the process_file function above) must be the last parameter in the function definition.

You may also want to iterate over two lists as opposed to just one, which would be normally done with a code of the type:

for i, j in zip(list1, list2):
    print i + j

In this case, you can zip both lists together and pass them as one combined list to your function. As an example, suppose you want to add a certain value to all values in each file before calculating the columns’ weighted averages in the files processing example. Here is how the previous code could be modified to take care of this task:

from glob import glob
from numpy import genfromtxt
from multiprocessing import Pool
from functools import partial
import random
 
def process_file(weights, file_and_adding_factor_tuple):
    f, adding_factor = file_and_adding_factor_tuple
    data = genfromtxt(f, delimiter=' ') 
    n = len(data) 
    data_transposed = [list(x) for x in zip(*data)] # transposed data
    averages = [sum((l + adding_factor) * w for l, w in zip(line, weights)) / n for line in data_transposed] # sums all lines (former columns, vector of length 24) 
    return averages 
 
list_of_files = glob("*.rdm") 

p = Pool(16) # creates the pool of workers
  
weight_factors = [random.random() for _ in range(0, 10000)] # creates random list of weights between 0 and 1
adding_factors = [random.random() for _ in range(0, len(list_of_files))] # creates random list of numbers between 0 and 1
 
partial_process_file = partial(process_file, weight_factors) # adds weight_factors as an argument to be always passed to the process_file results = p.map(partial_process_file, list_of_files) # perform the calculations print results 
results = p.map(partial_process_file, zip(list_of_files, adding_factors)) # perform the calculations
 
print results

There are other ways of doing code parallelization in Python as well as other reasons why you would do so. Some of those will be addressed in future posts.

MOEA diagnostics for a simple test case (Part 2/3)

In part 1 of this tutorial we evaluated the DTLZ2 3 objective test problem with 100 parameter samples of NSGAII and Borg replicated for 15 random seeds.  We also generated the best known reference set across all parameter samples for these two algorithms which we named:  DTLZ2_3_combined.pf.    The following metrics will be calculated relative to this reference set.

1) Metric Calculation

Option 1. Calculate average metrics

Before you type the following command, remember to place a #  at the end of your DTLZ2_3_combined.pf file

$ for i in {1..15}; do java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.ResultFileEvaluator --b DTLZ2_3 --i NSGAII_DTLZ2_3_$i.obj --r DTLZ2_3_combined.pf --o NSGAII_DTLZ2_3_$i.metrics; done

Once you run the previous command, replace NSGAII  with Borg  in the –i flag.  In your .metrics files you will see 6 columns, one column per metric.  For this tutorial we will only be interested in the 1st, 2nd and 5th columns, corresponding to hypervolume, generational distance and epsilon-indicator.

The following command averages the metrics across the 15 random seeds:

$ java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.SimpleStatistics --mode average --output NSGAII_DTLZ2_3.average NSGAII_DTLZ2_3_*.metrics

Again, repeat the previous command by pressing the up arrow and replace NSGAII with Borg to obtain Borg’s average metrics.

Option 2.  Calculate reference set metrics for each parameterization.

In part 1 of this tutorial we generated a reference set for each parameter sample, in this step we will calculate the metrics of each of these sets relative to our reference set.

</pre>
 java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.ResultFileEvaluator --b DTLZ2_3 --i NSGAII_DTLZ2_3_LHS.reference --r DTLZ2_3_combined.pf --o NSGAII_DTLZ2_3.metrics

<pre>

2) Calculating reference set hypervolume

Copy the following script into a text editor and save it in your working directory as HypervolumeEval.java


/*Written by Dave Hadka */
import java.io.File;
import java.io.IOException;

import org.moeaframework.core.NondominatedPopulation;
import org.moeaframework.core.PopulationIO;
import org.moeaframework.core.indicator.Hypervolume;
import org.moeaframework.analysis.sensitivity.*;

public class HypervolumeEval {

public static void main(String[] args) throws IOException {
NondominatedPopulation refSet = new NondominatedPopulation(
PopulationIO.readObjectives(new File(args[0])));

System.out.println(new Hypervolume(
new ProblemStub(refSet.get(0).getNumberOfObjectives()), refSet)
.evaluate(refSet));
}

}


Compile it using the following command:

$ javac -cp MOEAFramework-2.0-Executable.jar HypervolumeEval.java

Once you have compiled the HypervolumeEval, you should have a new file on your directory called HypervolumeEval.class.  Now you can compute the reference set hypervolume with the following command:

$ java -cp MOEAFramework-2.0-Executable.jar HypervolumeEval DTLZ2_3_combined.pf

It can take a while to compute hypervolume, you can either stare at your Cygwin terminal until it outputs a value or you can do >  whatever.txt at the end of the previous command and the value will be printed in a separate file.  Either way,  keep track of this value because we will use it later.

Great, we finished calculating metrics.  Now we can generate our control maps, attainment plots and calculate our reference set contribution in part 3 of this tutorial.

Go to → MOEA diagnostics for a simple test case (Part 3/3)

MOEA diagnostics for a simple test case (Part 1/3)


In this tutorial you will learn how to do  the following:

👓 Control maps to evaluate the MOEA’s sensitivity to different combinations of crossover, mating and mutation parameters, and to test how quickly  MOEAs can reach an acceptable approximation to the Pareto front.

👓 Attainment plots to evaluate the reliability of each MOEA (how likely can MOEAs attain high levels of performance) and how effectively they can approximate the best known Pareto front.

👓 Reference set contribution to test the percent contribution of each MOEA when calculating the approximation to the best known Pareto front.

We will use three metrics: generational distance, epsilon-indicator and hypervolume to measure proximity, consitency and diversity, respectively (read more).

Prerequisites:

✓ Cygwin terminal

✓  Complete the basic Borg tutorial part 1  and part 2

✓ The MOEAFramework all in one executable available at moeaframework.org

Java

Objective:

Perform a diagnostic study with an MOEAFramework algorithm (for this tutorial I selected NSGAII) and with the Borg MOEA  using the 3 objective DTLZ2 problem.


1) Parameter description files

Please refer to the parameter description files for a full explanation.  You can copy the corresponding parameters for NSGAII and Borg into a text editor and save it as  NSGAII_Params.txt and Borg_Params.txt.

2) Generating Latin Hypercube Samples

You will need a copy of the Borg_Params.txt, NSGAII_Params.txt and the MOEAFramework executable in your working directory.   From your terminal, navigate your working directory, in my case its something like this:


jz583@jz583-pc ~
$ cd C:

jz583@jz583-pc /cygdrive/c
$ cd Users/jz583/Desktop

jz583@jz583-pc /cygdrive/c/Users/jz583/Desktop
$ mkdir DTLZ2_3_example

jz583@jz583-pc /cygdrive/c/Users/jz583/Desktop
$ cd DTLZ2_3_example/

jz583@jz583-pc /cygdrive/c/Users/jz583/Desktop/DTLZ2_3_example
$ ls
MOEAD_Params.txt MOEAFramework-2.0-Executable.jar NSGAII_Params.txt


In lines 2 and 5, I access my Desktop.  In line 8, I create a directory called DTLZ2_3_example, where I will store all the files generated in this example.  After I copy the required files into the DTLZ2_3_example folder, I use the cd or “change directory” command in line 11 to navigate the folder.  I finally use the  ls command in line 14 to list the folder’s contents and verify that I have the correct files in it, and I do ;).

Now you can type the following command:


java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.SampleGenerator --method latin --numberOfSamples 100 --parameterFile NSGAII_Params.txt --output NSGAII_Latin

The previous command generates 100 latin hypercube samples (LHS) for the NSGAII algorithm.  After you run the command, simply press the ↑ key and  replace NSGAII with Borg in the –parameterFile flag  and in the –output flag.

Afterwards, there should be two new files in your directory, one named NSGAII_Latin and the other one named Borg_Latin.  Open the NSGAII_Latin file and verify that it has 100 rows, one row per parameter combination, and 6 columns, one column per parameter. Likewise, open the Borg_Latin file,  it should have 100 rows and 21 columns.

3) Running each parameter sample for multiple random seeds

The main difference between evaluating MOEAFramework algorithms and evaluating Borg, lies in this step.  The rest of the analysis is supported by the MOEAFramework and the commands will be the same for Borg and for NSGAII.

For NSGAII

The following command will evaluate the DTLZ2_3 problem for 100 parameterizations replicated for 15 seeds each using NSGAII.  A numerical precision of 0.01 is specified for each objective.   Since NSGAII is a point-dominance algorithm, we must specify an epsilon precision for consistency when comparing with Borg which is an epsilon-dominance algorithm.


for i in {1..15}; do java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.Evaluator --parameterFile NSGAII_Params.txt --input NSGAII_Latin --seed $i --problem DTLZ2_3 --algorithm NSGAII --epsilon 0.01,0.01,0.01 --output NSGAII_DTLZ2_3_$i.set; done

For Borg

Before running the following code, you may want to complete the borg basics tutorial.



for i in {1..15}; do ./borg.exe -v 12 -o 3 -l 0,0,0,0,0,0,0,0,0,0,0,0 -u 1,1,1,1,1,1,1,1,1,1,1,1 -e 0.01,0.01,0.01 -p Borg_params.txt -i Borg_Latin -s $i -f Borg_DTLZ2_3_$i.set python dtlz2.py; done

The -v flag indicates the number of variables, -o number of objectives, the -l and -u flags are the lower and upper bounds for the decision variables, respectively, -e is the epsilon precision, -p parameter description file, -i latin hypercube samples, -s indicates the random seed number and -f specifies the name of the output file.

4) Selecting objectives

The .set files generated in step 3 contain 15 columns.  The first 12 columns are the decision variables, and the last 3 columns are the objectives, we are interested on the latter.  To select the objectives, run the following command:


for i in {1..15}; do awk 'Begin{FS=" "};/^#/{print$0};/[-]?[0-9]/{printf("%s %s %s\n", $13,$14,$15)}' NSGAII_DTLZ2_3_$i.set > NSGAII_DTLZ2_3_$i.obj; done

To select the objectives for Borg, replace all the instances of NSGAII in the previous command with Borg.

5) Generating a reference set for each parameterization across all random seeds

The following command merges the results across the 15 random seeds for each parameterization:


java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger -d 3 -e 0.01,0.01,0.01 -o Borg_DTLZ2_3_LHS.reference Borg*.obj

Once you run the previous command, press the ↑ key and replace all instances of Borg with NSGAII.   The files generated should have 100 reference sets, one per parameterization, separated by a #.  Keep track of these files, we will use them later in part 2 of this tutorial.

6) Generate reference set for each MOEA (across all parameters and seeds)

The following command generates a reference set for Borg across its 100 LHS parameterizations and 15 seeds.


java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.ResultFileMerger -d 3 -e 0.01,0.01,0.01 -o Borg_DTLZ2_3.reference Borg*.obj

Run the same command for NSGAII.

Once I generated Borg_DTLZ2_3.reference and NSGAII_DTLZ2_3.reference files, I inserted a # at the end of each file.  The reason is that I combine these two sets in the following step using  the MOEAFramework’s result file merger, which uses the # as an indicator that the file has reached to an end.

7) Generate a combined reference set across all MOEAs tested

The DTLZ2_3 problem has a well-known analytical solution but for learning purposes we will pretend that we don’t know this solution.  In real world applications, an analytical solution is hardly known; therefore, the best approximation to the Pareto front is generated across all algorithms, all seeds and all parameterizations.   In this case, we will generate a combined set merging NSGAII and Borg.  Note that in the following command I used the Borg_DTLZ2_3.reference and NSGAII_DTLZ2_3.reference because these reference sets are already merged across their seeds and parameters.  What we are doing now is merging across algorithms:


java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.ResultFileMerger --problem DTLZ2_3 --output DTLZ2_3_combined.pf Borg_DTLZ2_3.reference NSGAII_DTLZ2_3.reference

At this point, we generated the DTLZ2_3_combined.pf, which represents our best known Pareto front approximation, at least for this exercise. Now we are ready to calculate metrics!!

Synthetic Weather Generation: Part II

Non-Parametric Generators

This is the second part of a blog series on synthetic weather generators. In Part I, I described common parametric methods of weather generation. Here I am going to discuss a subset of non-parametric methods. There are several techniques for non-parametric weather generation (Apipattanavis et al., 2007): simulating from empirical distributions of wet and dry spells and precipitation amounts (Semenov and Porter, 1995); using simulated annealing for generating precipitation (Bárdossy, 1998) and neural networks for generating temperature (Trigo and Palutikof, 1999); fitting multivariate PDFs to several weather variables using Kernel-based methods (Rajagopalan et al, 1997); and resampling historical observations (Young, 1994). For brevity, all of the methods I will discuss employ the nearest neighbor, or k-nn, bootstrap resampling technique introduced by Lall and Sharma (1996). Many of you are probably familiar with this method from streamflow generation, as our group has been using it to disaggregate monthly and annual flows to daily values using an adaptation of Lall and Sharma’s method described by Nowak, et al. (2010).

The original purpose of the nearest-neighbor bootstrap was to reproduce the autocorrelation in simulated time series without assuming a parametric generating process (Lall and Sharma, 1996). Traditionally, auto-correlated time series are modeled using autoregressive (AR) or autoregressive moving average (ARMA) models, as described in Part I for simulating non-precipitation variables in parametric weather generators. Recall that in the AR model, the values of the non-precipitation variables at any given time step are linear combinations of the values of the non-precipitation variables at the previous time step(s), plus some Gaussian-distributed random error. Hence, ARMA models assume a linear generating process with independent, normally distributed errors. Non-parametric methods do not require that these assumptions be met.

The most common non-parametric method is the bootstrap, in which each simulated value is generated by independently sampling (with replacement) from the historical record (Efron, 1979; Efron and Tibshirani, 1993). This is commonly used to generate confidence intervals. If simulating time series, though, this method will not reproduce the auto-correlation of the observations. For this reason, the moving-blocks bootstrap has often been used, where “blocks” of consecutive values are sampled jointly with replacement (Kunsch, 1989; Vogel and Shallcross, 1996). This will preserve the correlation structure within each block, but not between blocks. The k-nn bootstrap, however, can reproduce the auto-correlation in the simulated time series without discontinuities between blocks (Lall and Sharma, 1996).

In the simplest case of the nearest neighbor bootstrap, the historical record is divided into overlapping blocks of length d, where d represents the dependency structure of the time series. So if d=2, then each value in the time series depends on the preceding two values. To simulate the next value in the time series, the most recent block of length d is compared to all historical blocks of length d to find the k “nearest” neighbors (hence k-nn). How “near” two blocks are to one another is determined using some measure of distance, such as Euclidean distance. The k nearest neighbors are then sorted and assigned a rank j from 1 to k, where 1 is the closest and k the furthest. The next simulated value will be that which succeeds one of the k nearest neighbors, where the probability of selecting the successor of each neighbor is inversely proportional to the neighbor’s rank, j.

P(j) = (1/j) / Σi=1,…,k (1/i)

To simulate the second value, the most recent block of length d will now include the first simulated value, as well as the last d-1 historical values, and the process repeats.

Lall and Sharma recommend that d and k be chosen using the Generalized Cross Validation score (GCV) (Craven and Wahba, 1979). However, they also note that choosing k = √n, where n is the length of the historical record, works well for 1 ≤ d ≤ 6 and n ≥ 100. Using d=1, k = √n and selecting neighbors only from the month preceding that which is being simulated, Lall and Sharma demonstrate the k-nn bootstrap for simulating monthly streamflows at a snowmelt-fed mountain stream in Utah. They find that it reproduces the first two log-space moments and auto-correlation in monthly flows just as well as an AR(1) model on the log-transformed flows, but does a much better job reproducing the historical skew.

Because of its simplicity and successful reproduction of observed statistics, Rajagopalan and Lall (1999) extend the k-nn bootstrap for weather generation. In their application, d is again set to 1, so the weather on any given day depends only on the previous day’s weather. Unlike Lall and Sharma’s univariate application of the nearest neighbor bootstrap for streamflow, Rajagopalan and Lall simulate multiple variables simultaneously. In their study, a six dimensional vector of solar radiation, maximum temperature, minimum temperature, average dew point temperature, average wind speed and precipitation is simulated each day by resampling from all historical vectors of these variables.

First, the time series of each variable is “de-seasonalized” by subtracting the calendar day’s mean and dividing by the calendar day’s standard deviation. Next, the most recent day’s vector is compared to all historical daily vectors from the same season using Euclidean distance. Seasons were defined as JFM, AMJ, JAS, and OND. The k nearest neighbors are then ranked and one probabilistically selected using the same Kernel density estimator given by Lall and Sharma. The vector of de-seasonalized weather variables on the day following the selected historical neighbor is chosen as the next simulated vector, after “re-seasonalizing” by back-transforming the weather variables. Rajagopalan and Lall find that this method better captures the auto- and cross-correlation structure between the different weather variables than a modification of the popular parametric method used by Richardson (1981), described in Part I of this series.

One shortcoming of the nearest neighbor bootstrap is that it will not simulate any values outside of the range of observations; it will only re-order them. To overcome this deficiency, Lall and Sharma (1996) suggest, and Prairie et al. (2006) implement, a modification to the k-nn approach. Using d=1 for monthly streamflows, a particular month’s streamflow is regressed on the previous month’s streamflow using locally-weighted polynomial regression on the k nearest neighbors. The order, p, of the polynomial is selected using the GCV score. The residuals of the regression are saved for future use in the simulation.

The first step in the simulation is to predict the expected flow in the next month using the local regression. Next, the k nearest neighbors to the current month are found using Euclidean distance and given the probability of selection suggested by Lall and Sharma. After a neighbor has been selected, its residual from the locally weighted polynomial regression is added to the expected flow given by the regression, and the process is repeated. This will yield simulated flows that are outside of those observed in the historical record.

One problem with this method is that adding a negative residual could result in the simulation of negative values for strictly positive variables like precipitation. Leander and Buishand (2009) modify this approach by multiplying the expected value from the regression by a positive residual factor. The residual factor is calculated as the quotient when the observation is divided by the expected value of the regression, rather than the difference when the expected value of the regression is subtracted from the observation.

Another noted problem with the k-nn bootstrap for weather generation is that it often under-simulates the lengths of wet spells and dry spells (Apipattanavis et al., 2007). This is because precipitation is intermittent, while the other weather variables are not, but all variables are included simultaneously in the k-nn resampling. Apipattanavis et al. (2007) developed a semi-parametric approach to solve this problem. In their weather generator, they adopt the Richardson approach to modeling the occurrence of wet and dry days through a lag-1 Markov process. Once the precipitation state is determined, the k-nn resampling method is used, but with only neighbors of that precipitation state considered for selection.

Steinschneider and Brown (2013) expand on the Apipattanavis et al. (2007) generator by incorporating wavelet decomposition to better capture low-frequency variability in weather simulation, as both parametric and non-parametric weather generators are known to under-simulate inter-annual variability (Wilks and Wilby, 1999). Their generator uses a method developed by Kwon et al. (2007) in which the wavelet transform is used to decompose the historical time series of annual precipitation totals into H orthogonal component series, zh (plus a residual noise component), that represent different low-frequency signals of annual precipitation totals. Each signal, zh, is then simulated separately using auto-regressive models. The simulated precipitation time series is the sum of these H component models. This approach is called WARM for wavelet auto-regressive model.

The time series of annual precipitation totals produced by the WARM simulation model is used to condition the sampling of daily weather data using Apipattanavis et al.’s method. First, the precipitation total in each year of the WARM simulation is compared to the precipitation in all historical years using Euclidean distance. As in the k-nn generator, the k nearest neighbors are ranked and given a probability of selection according to Lall and Sharma’s Kernel density function. Next, instead of sampling one of these neighbors, 100 of them are selected with replacement, so there will be some repeats in the sampled set. The Apipattanavis generator is then used to generate daily weather data from this set of 100 sampled years rather than the entire historical record. Effectively, this method conditions the simulation such that data from years most similar to the WARM-simulated year are given a greater probability of being re-sampled.

Another characteristic of the Apipattanavis et al. (2007) and Steinschneider and Brown (2013) weather generators is their ability to simulate correlated weather across multiple sites simultaneously. Part III discusses methods used by these authors and others to generate spatially consistent weather using parametric or non-parametric weather generators.

Works Cited

Apipattanavis, A., Podestá, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43.

Bárdossy, A. (1998). Generating precipitation time series using simulated annealing. Water Resources Research34(7), 1737-1744.

Craven, P., & Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische Mathematik31(4), 377-403.

Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26.

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

Kunsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. The Annals of Statistics, 1217-1241.

Kwon, H. H., Lall, U., & Khalil, A. F. (2007). Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature. Water Resources Research43(5).

Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water Resources Research, 32(3), 679-693.

Leander, R., & Buishand, T. A. (2009). A daily weather generator based on a two-stage resampling algorithm. Journal of Hydrology, 374, 185-195.

Nowak, K., Prairie, J., Rajagopalan, B., & Lall, U. (2010). A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow. Water Resources Research, 46.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., & Zagona, E. A. (2006). Modified K-NN model for stochastic streamflow simulation. Journal of Hydrologic Engineering11(4), 371-378.

Rajagopalan, B., & Lall, U. (1999). A k‐nearest‐neighbor simulator for daily precipitation and other weather variables. Water Resources Research35(10), 3089-3101.

Rajagopalan, B., Lall, U., Tarboton, D. G., & Bowles, D. S. (1997). Multivariate nonparametric resampling scheme for generation of daily weather variables. Stochastic Hydrology and Hydraulics11(1), 65-93.

Semenov, M. A., & Porter, J. R. (1995). Climatic variability and the modelling of crop yields. Agricultural and forest meteorology73(3), 265-283.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49, 7205-7220.

Trigo, R. M., & Palutikof, J. P. (1999). Simulation of daily temperatures for climate change scenarios over Portugal: a neural network model approach.Climate Research13(1), 45-59.

Vogel, R. M., & Shallcross, A. L. (1996). The moving blocks bootstrap versus parametric time series models. Water Resources Research, 32(6), 1875-1882.

Wilks, D. S., & Wilby, R. L. (1999). The weather generation game: a review of stochastic weather models. Progress in Physical Geography, 23(3), 329-357.

Young, K. C. (1994). A multivariate chain model for simulating climatic parameters from daily data. Journal of Applied Meteorology33(6), 661-671.

Synthetic Weather Generation: Part I

Parametric Generators

As water resources systems engineers, we spend a lot of time evaluating and comparing the performance of different water management policies. This is often achieved by modeling the system behavior under each policy when subjected to historical or synthetically generated hydrology. Testing performance under synthetic hydrology allows us to see how well the policy generalizes across plausible futures. One can generate synthetic hydrologies by directly simulating future streamflows, or by indirectly simulating them through a hydrologic model fed by real or synthetically generated weather conditions. Weather generation methods are also useful for predicting how hydrology will change given seasonal climate forecasts or long-term climate change projections.

Our group seems to be fairly well-versed in direct streamflow generation methods, but less familiar with indirect, weather generation methods. There are several similarities, though. First, both streamflow and weather (precipitation, temperature, solar radiation, etc.) can be simulated through “parametric” or “non-parametric” methods. In parametric methods, the streamflow and weather characteristics are assumed to come from some known distribution, whose parameters are fit with the historical record. In non-parametric methods, no such distribution is assumed, and instead future realizations are simulated by resampling from the historical record. Some methods, called “semi-parametric,” combine elements of each of these.

This will be the first of a series of blog posts on weather generators. In this blog, I will describe a handful of parametric weather generation techniques. In Part II, I will describe several non-parametric and semi-parametric weather generators. In Part III, I will cover different techniques for correlating weather generated at multiple sites. Finally, in Parts IV and V, I will discuss different methods of conditioning weather generation on climate change projections and seasonal climate forecasts, respectively. These blog posts will by no means be an exhaustive review of the literature, but a quick glimpse into commonly used weather generation techniques.

Generating Precipitation

Most parametric weather generation methods stem from the weather generator developed by Richardson (1981). In Richardson’s simulator, daily minimum temperature, maximum temperature and solar radiation are generated simultaneously, conditional on the occurrence or not of precipitation. Since both temperature and solar radiation are conditioned on whether or not precipitation has occurred, the first step is to generate a sequence of wet and dry days. This is done by assuming the occurrence of precipitation is a first-order Markov process, meaning whether or not it rains today depends solely on whether or not it rained yesterday.

Using the method of moments (MM), one can easily estimate the probability that today will be wet given that yesterday was wet or dry. Denoting a wet day at time t as Wt and a dry day at time t as Dt,

P(Wt+1|Dt) = (# of dry days followed by a wet day) / (# of dry days)

P(Wt+1|Wt) = (# of wet days followed by a wet day) / (# of wet days)

And by the law of total probability:

P(Dt+1|Dt) = 1 – P(Wt+1|Dt)

P(Dt+1|Wt) = 1 – P(Wt+1|Wt)

These estimates of the transition probabilities are the same as the Maximum Likelihood Estimates (MLE).

Wilks (1999) finds that the first-order Markov process often works well, except in the case of really dry climates, where it tends to under-simulate the lengths of dry spells. In this case, one could instead fit higher (k-)order Markov models (Chin, 1977; Gates and Tong, 1976), where the occurrence of rainy days depends on whether or not it rained on each of the previous k days. Unfortunately, the number of parameters that need to be estimated for a Markov model of order k is 2k, so a linear increase in the order of the process requires an exponential increase in the number of estimated parameters. Since it is often only dry spell lengths that are under-simulated, one could instead fit a hybrid-order Markov model where the occurrence of a wet or dry day depends on the previous min(l, k) days, where l is the number of days since it last rained, and k is the order of the hybrid model (Stern and Coe, 1984).

Another way to better capture persistence is to simulate the occurrence of wet and dry days by alternating between simulating the lengths of wet spells (consecutive wet days) and dry spells. A first order Markov process will create wet and dry spells whose lengths are geometrically distributed with parameter p, where p = P(Wt+1|Dt) for dry spells, and P(Dt+1|Wt) for wet spells. Racsko et al. (1991) found that they could better simulate dry spells in Hungary by using a mixed geometric distribution, where the probability of getting a dry spell, X, of length x is:

P(X = x) = αp1(1-p1)x-1 + (1-α)p2(1-p2)x-1.

So in this model, (100×α)% of dry spells are generated from a geometric distribution with parameter p1, and the rest of them are generated from a geometric distribution with parameter p2, thus requiring the estimation of three parameters. Since only dry spells were being under-simulated by the one-parameter geometric distribution, Racsko et al. (1991) still simulated wet spells from the one-parameter geometric distribution, requiring a total of four parameter estimates. Alternatively, one could simulate wet and dry spells from the negative binomial distribution (Wilby et al., 1998), which requires four total parameters if used for both wet and dry spells, and three if used only for dry spells while the geometric is used for wet spells.

Regardless of how precipitation occurrences are determined, once a wet day has been simulated, the amount of rainfall on that day must also be generated. It is often assumed that the amount of rainfall on any given day is independent of the amount of rainfall on any other day. While the auto-correlation in precipitation amounts is statistically significant, the consequences of assuming independence are not (Wilks, 1999). Richardson assumes that rainfall amounts are generated from an exponential distribution, which only has one parameter, λ, that must be estimated. More commonly (Katz, 1977; Wilks, 1989, 1992) a Gamma distribution is assumed, which has two parameters, α and β. Some authors (Foufoula-Georgiou and Lettenmaier, 1987; Wilks, 1998; 1999; Woolhiser and Roldan, 1982) find that a mixed exponential distribution works best. Like the mixed geometric distribution, a mixed exponential distribution is a weighted average of two exponential distributions:

f(x) = αλ1exp(-λ1x) + (1-α) λ2exp(-λ2x)

So on (100×α)% of wet days, rainfall amounts are generated from an exponential distribution with parameter λ1, and the rest of the time they are generated from an exponential distribution with parameter λ2.

Each of these models has different numbers of parameters that must be estimated; 1 for exponential, 2 for gamma, and 3 for mixed exponential. Using additional parameters should improve the fit, but at the risk of overfitting the data. One method of choosing between these models is to use the Akaike Information Criterion (AIC) which compares the log likelihood function of the different fitted models but with a penalty for the number of parameters estimated. The model with the lowest AIC is said to lose the least information in representing the generating process of the data. Another option is the Bayesian Information Criterion (BIC), which more harshly penalizes the use of additional parameters (Wilks and Wilby, 1999).

In Richardson’s model, he lumps the data from 26 14-day periods in each year of the historical record and calculates the transition probabilities and parameters of the amounts distribution over each of those 2-week periods. He then fits a harmonic function to the time series of transition probabilities and parameter estimates to capture their seasonality. So each day, the transition probabilities and distribution parameters are determined by this continuous function (see Figure 4 from his paper, below). It is more common to simply calculate transition probabilities and parameters of the amounts distribution each month, and use those for precipitation generation each day of that month (Wilks and Wilby, 1999). Despite the discontinuity from month to month, this method does a good job reproducing the seasonal pattern of observed weather.

ppt_interpolation

Generating Temperature and Solar Radiation

Once the daily rainfall occurrence and amounts have been generated, the daily minimum temperature, maximum temperature and solar radiation can be simulated. These must be conditioned on the occurrence of rainfall because they are not independent; a rainy day in July is likely to be cooler than a dry day in July due to, among other reasons, increased cloud cover blocking incident solar radiation. Clearly, then, temperature and radiation are not only correlated with precipitation, but with each other. Additionally, like precipitation occurrence, there is great persistence in daily temperature and daily radiation time series. One model that can account for the persistence in the daily time series of each variable, as well as the cross-correlation between them, is a vector auto-regression (VAR) (Richardson, 1981). Often, a VAR(1) model is used, meaning the auto-regression is of order 1. This is similar to a Markov model of order 1, where the values of the weather variables on a particular day depend only on the values of those variables on the previous day.

To fit the vector auto-regression, one must first calculate the historical time series of “residual” minimum temperature, maximum temperature and solar radiation (Richardson, 1981). Outside of the tropics, temperature and solar radiation tend to vary sinusoidally over the course of a year, with peaks in the summer and troughs in the winter. One must first “remove” this signal before fitting the auto-regressive model. Removing the signal will yield “residuals” to which the VAR(1) model will be fit. Using minimum temperature as an example, the average minimum temperature of every January 1st in the historical record that was wet will be calculated, as well as the standard deviation. This will be done every single day, and a Fourier series fit to the resulting time series of means and standard deviations. The same will be done for the minimum temperature on dry days.

To remove the signal, for each observation of minimum temperature on a wet day in the historical record, that day’s mean wet-day minimum temperature (predicted by the Fourier model) will be subtracted, and the resulting difference divided by that day’s standard deviation of wet-day minimum temperature (predicted by the Fourier model) to produce a residual. The analogous process is performed for each observation of minimum temperature on a dry day. What will result is a historical record of “residuals”, also called “standardized anomalies”; the “anomaly” is the difference between the observation and the mean, which is “standardized” by dividing by the standard deviation. See Figure 1 from Richardson (1981) for a visual depiction of this process for solar radiation. Note that if the residuals are not normal, it may be appropriate to first transform the raw data to normality using a Box-Cox transformation, as Parlange and Katz (2000) do for radiation and wind speed in their extension of the VAR(1) model to include daily mean wind speed and dew point.

radiation_residuals

Sometimes, one might not have a long enough or complete enough historical record to calculate reliable means and standard deviations of wet day and dry day temperature and solar radiation. In this case, the daily data can be lumped to calculate average monthly wet day and dry day temperature and solar radiation data over the historical record. Fourier series can be fit to the mean and standard deviation of these time series instead, with a few caveats. Consider fitting a harmonic function to mean monthly wet day temperature. If one is going to interpolate daily values from this function, several of the daily values within each month should be above the mean and several below. However, oftentimes the peak and trough of the harmonic will lie on the interpolated value. In this case, what is supposed to be the mean daily value for that month will actually be the minimum or maximum. To account for this, one can adjust the amplitudes of the harmonics using a method described by Epstein (1991). A second caveat is that the standard deviation in average monthly wet day temperature will be much smaller than the standard deviation in average daily wet day temperature. If the variance of some random variable, X, is σ2, the variance in its average is σ2/n, where n is the number of observations being averaged. For this reason, the standard deviation of the average monthly wet day temperature must first be multiplied by √n before fitting a Fourier series to the monthly values, which again should be adjusted using Epstein’s method.

The residual time series resulting from this process should be auto-correlated within a particular variable’s time series, and cross-correlated between the three variables of minimum temperature, maximum temperature and solar radiation. This is captured by a VAR(1) process:

z(t) = [A] z(t-1) + [B] ε(t)

where z(t) and z(t-1) are 3×1 vectors of the residuals of the three variables at times t and t-1­, respectively, ε(t) is a 3×1 vector of independent standard normal random variables at time t, and [A] and [B] are 3×3 matrices that together capture the auto- and cross-correlation between the time series of the three variables. [A] and [B] are determined by the following two equations:

[A] = [S1][S]-1

[B][B]T = [S] – [A][S1]T

where [S] is the covariance matrix of the three variables, and [S1] the matrix of lagged (by one time step) covariances of the three variables. [B] can be found using either Cholesky decomposition or spectral decomposition (Wilks, 2011, pp. 470-481, 500-504). Sometimes [A] and [B] will be similar throughout the year, while seasonal changes in the autocovariance structure may require different estimates for different times of the year (Wilks and Wilby, 1999).

Once all of these parameters have been fit, you can start generating synthetic weather!

Works Cited

Chin, E. H. (1977). Modeling daily precipitation occurrence process with Markov chain. Water Resources Research, 13, 949-956.

Epstein, E. S. (1991). On obtaining daily climatological values from monthly means. Journal of Climate, 4, 365-368.

FouFoula-Georgiou, E., & Lettenmaier, D. P. (1987). A Markov renewal model for rainfall occurrences. Water Resources Research, 23, 875-884.

Gates, P., & Tong, H. (1976). On Markov chain modeling to some weather data. Journal of Applied Meteorology, 15, 1145-1151.

Katz, R. W. (1977). Precipitation as a chain-dependent process. Journal of Applied Meteorology, 16, 671-676.

Parlange, M. B., & Katz, R. W. (2000). An extended version of the Richardson model for simulating daily weather variables. Journal of Applied Meteorology, 39, 610-622.

Racsko, P., Szeidl, L., & Semenov, M. (1991). A serial approach to local stochastic weather models. Ecological Modelling, 57, 27-41.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Stern, R. D., & Coe, R. (1984). A model fitting analysis of daily rainfall data. Journal of the Royal Statistical Society, A147, 1-34.

Wilby, R. L., Wigley, T. M. L., Conway, D., Jones, P. D., Hewitson, B. C., Main, J., & Wilks, D. S. (1998). Statistical downscaling of general circulation model output: a comparison of methods. Water Resources Research, 34, 2995-3008.

Wilks, D. S. (1989). Conditioning stochastic daily precipitation models on total monthly precipitation. Water Resources Research, 25, 1429-1439.

Wilks, D. S. (1992). Adapting stochastic weather generation algorithms for climate change studies. Climatic Change, 22, 67-84.

Wilks, D. S. (1998). Multisite generalization of a daily stochastic precipitation generation model. Journal of Hydrology, 210, 178-191.

Wilks, D. S. (1999). Interannual variability and extreme-value characteristics of several stochastic daily precipitation models. Agricultural and Forest Meteorology, 93, 153-169.

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic press.

Wilks, D. S., & Wilby, R. L. (1999). The weather generation game: a review of stochastic weather models. Progress in Physical Geography, 23(3), 329-357.

Woolhiser, D. A., & Roldan, J. (1982). Stochastic daily precipitation models. 2. A comparison of distribution amounts. Water Resources Research, 18, 1461-1468.

Scenario discovery in Python

The purpose of this blog post is to demonstrate how one can do scenario discovery in python. This blogpost will use the exploratory modeling workbench available on github. I will demonstrate how we can perform both PRIM in an interactive way, as well as briefly show how to use CART, which is also available in the exploratory modeling workbench. There is ample literature on both CART and PRIM and their relative merits for use in scenario discovery. So I won’t be discussing that here in any detail. This blog was first written as an ipython notebook, which can be found here

The workbench is mend as a one stop shop for doing exploratory modeling, scenario discovery, and (multi-objective) robust decision making. To support this, the workbench is split into several packages. The most important packages are expWorkbench that contains the support for setting up and executing computational experiments or (multi-objective) optimization with models; The connectors package, which contains connectors to vensim (system dynamics modeling package), netlogo (agent based modeling environment), and excel; and the analysis package that contains a wide range of techniques for visualization and analysis of the results from series of computational experiments. Here, we will focus on the analysis package. It some future blog post, I plan to demonstrate the use of the workbench for performing computational experimentation and multi-objective (robust) optimization.

The workbench can be found on github and downloaded from there. At present, the workbench is only available for python 2.7. There is a seperate branch where I am working on making a version of the workbench that works under both python 2.7 and 3. The workbench is depended on various scientific python libraries. If you have a standard scientific python distribution, like anaconda, installed, the main dependencies will be met. In addition to the standard scientific python libraries, the workbench is also dependend on deap for genetic algorithms. There are also some optional dependencies. These include seaborn and mpld3 for nicer and interactive visualizations, and jpype for controlling models implemented in Java, like netlogo, from within the workbench.

In order to demonstrate the use of the exploratory modeling workbench for scenario discovery, I am using a published example. I am using the data used in the original article by Ben Bryant and Rob Lempert where they first introduced scenario discovery. Ben Bryant kindly made this data available for my use. The data comes as a csv file. We can import the data easily using pandas. columns 2 up to and including 10 contain the experimental design, while the classification is presented in column 15

import pandas as pd

data = pd.DataFrame.from_csv('./data/bryant et al 2010 data.csv',
                             index_col=False)
x = data.ix[:, 2:11]
y = data.ix[:, 15]

The exploratory modeling workbench is built on top of numpy rather than pandas. This is partly a path dependecy issue. The earliest version of prim in the workbench is from 2012, when pandas was still under heavy development. Another problem is that the pandas does not contain explicit information on the datatypes of the columns. The implementation of prim in the exploratory workbench is however datatype aware, in contrast to the scenario discovery toolkit in R. That is, it will handle categorical data differently than continuous data. Internally, prim uses a numpy structured array for x, and a numpy array for y. We can easily transform the pandas dataframe to either.

x = x.to_records()
y = y.values

the exploratory modeling workbench comes with a seperate analysis package. This analysis package contains prim. So let’s import prim. The workbench also has its own logging functionality. We can turn this on to get some more insight into prim while it is running.

from analysis import prim
from expWorkbench import ema_logging
ema_logging.log_to_stderr(ema_logging.INFO);

Next, we need to instantiate the prim algorithm. To mimic the original work of Ben Bryant and Rob Lempert, we set the peeling alpha to 0.1. The peeling alpha determines how much data is peeled off in each iteration of the algorithm. The lower the value, the less data is removed in each iteration. The minimium coverage threshold that a box should meet is set to 0.8. Next, we can use the instantiated algorithm and find a first box.

prim_alg = prim.Prim(x, y, threshold=0.8, peel_alpha=0.1)
box1 = prim_alg.find_box()

Let’s investigate this first box is some detail. A first thing to look at is the trade off between coverage and density. The box has a convenience function for this called show_tradeoff. To support working in the ipython notebook, this method returns a matplotlib figure with some additional information than can be used by mpld3.

import matplotlib.pyplot as plt

box1.show_tradeoff()
plt.show()

fig1

The notebook contains an mpld3 version of the same figure with interactive pop ups. Let’s look at point 21, just as in the original paper. For this, we can use the inspect method. By default this will display two tables, but we can also make a nice graph instead that contains the same information.

box1.inspect(21)
box1.inspect(21, style='graph')
plt.show()

This first displays two tables, followed by a figure

coverage    0.752809
density     0.770115
mass        0.098639
mean        0.770115
res dim     4.000000
Name: 21, dtype: float64

                            box 21
                               min         max     qp values
Demand elasticity        -0.422000   -0.202000  1.184930e-16
Biomass backstop price  150.049995  199.600006  3.515113e-11
Total biomass           450.000000  755.799988  4.716969e-06
Cellulosic cost          72.650002  133.699997  1.574133e-01

fig 2

If one where to do a detailed comparison with the results reported in the original article, one would see small numerical differences. These differences arise out of subtle differences in implementation. The most important difference is that the exploratory modeling workbench uses a custom objective function inside prim which is different from the one used in the scenario discovery toolkit. Other differences have to do with details about the hill climbing optimization that is used in prim, and in particular how ties are handled in selecting the next step. The differences between the two implementations are only numerical, and don’t affect the overarching conclusions drawn from the analysis.

Let’s select this 21 box, and get a more detailed view of what the box looks like. Following Bryant et al., we can use scatter plots for this.

box1.select(21)
fig = box1.show_pairs_scatter()
fig.set_size_inches((12,12))
plt.show()

fig3

We have now found a first box that explains close to 80% of the cases of interest. Let’s see if we can find a second box that explains the remainder of the cases.

box2 = prim_alg.find_box()

The logging will inform us in this case that no additional box can be found. The best coverage we can achieve is 0.35, which is well below the specified 0.8 threshold. Let’s look at the final overal results from interactively fitting PRIM to the data. For this, we can use to convenience functions that transform the stats and boxes to pandas data frames.

print prim_alg.stats_to_dataframe()
print prim_alg.boxes_to_dataframe()
       coverage   density      mass  res_dim
box 1  0.752809  0.770115  0.098639        4
box 2  0.247191  0.027673  0.901361        0
                             box 1              box 2
                               min         max    min         max
Demand elasticity        -0.422000   -0.202000   -0.8   -0.202000
Biomass backstop price  150.049995  199.600006   90.0  199.600006
Total biomass           450.000000  755.799988  450.0  997.799988
Cellulosic cost          72.650002  133.699997   67.0  133.699997

For comparison, we can also use CART for doing scenario discovery. This is readily supported by the exploratory modelling workbench.

from analysis import cart
cart_alg = cart.CART(x,y, 0.05)
cart_alg.build_tree()

Now that we have trained CART on the data, we can investigate its results. Just like PRIM, we can use stats_to_dataframe and boxes_to_dataframe to get an overview.

print cart_alg.stats_to_dataframe()
print cart_alg.boxes_to_dataframe()
       coverage   density      mass  res dim
box 1  0.011236  0.021739  0.052154        2
box 2  0.000000  0.000000  0.546485        2
box 3  0.000000  0.000000  0.103175        2
box 4  0.044944  0.090909  0.049887        2
box 5  0.224719  0.434783  0.052154        2
box 6  0.112360  0.227273  0.049887        3
box 7  0.000000  0.000000  0.051020        3
box 8  0.606742  0.642857  0.095238        2
                       box 1                  box 2               box 3  \
                         min         max        min         max     min
Cellulosic yield        80.0   81.649994  81.649994   99.900002  80.000
Demand elasticity       -0.8   -0.439000  -0.800000   -0.439000  -0.439
Biomass backstop price  90.0  199.600006  90.000000  199.600006  90.000   

                                         box 4                box 5  \
                               max         min         max      min
Cellulosic yield         99.900002   80.000000   99.900002   80.000
Demand elasticity        -0.316500   -0.439000   -0.316500   -0.439
Biomass backstop price  144.350006  144.350006  170.750000  170.750   

                                      box 6                  box 7  \
                               max      min         max        min
Cellulosic yield         99.900002  80.0000   89.050003  89.050003
Demand elasticity        -0.316500  -0.3165   -0.202000  -0.316500
Biomass backstop price  199.600006  90.0000  148.300003  90.000000   

                                         box 8
                               max         min         max
Cellulosic yield         99.900002   80.000000   99.900002
Demand elasticity        -0.202000   -0.316500   -0.202000
Biomass backstop price  148.300003  148.300003  199.600006

Alternatively, we might want to look at the classification tree directly. For this, we can use the show_tree method. This returns an image that we can either save, or display.

fig3

If we look at the results of CART and PRIM, we can see that in this case PRIM produces a better description of the data. The best box found by CART has a coverage and density of a little above 0.6. In contrast, PRIM produces a box with coverage and density above 0.75.