Power Spectral Density Estimation in Matlab

Power Spectral Density Estimation in Matlab

Hi all,

This post is about power spectral density estimation (PSDE) using Matlab.  PSDE is often used in signal processing and fluid mechanics, so you may or may not be familiar with it.  The idea is to determine the frequency content of sampled time series data.  This means that we can determine the amount of the variance in the time series that is contained at different frequencies (or periodicities).  Let’s take a look at a simple example.

Suppose we are sampling a signal (in this case a sign wave) for about 3 minutes with a frequency of 100 hertz, meaning we take 100 measurements every second.  Let’s generate our signal and plot our observations using Matlab.

x = (0:0.01:60*pi);
y = sin(x);

figure(1)
plot(x,sin(x))
title('Signal with no Noise')
xlabel('Seconds')
ylabel('Observed Signal')
ylim([-1.5,1.5])
untitled

Sine wave, with no noise, sampled at 100 Hertz

From observing the data directly it’s pretty clear what the frequency and period of the data are here (1/(2*pi) and pi respectively): no need for PSDE.  However, real data is rarely so well behaved.  To simulate this, let’s add random noise to our original signal.  To make it interesting, let’s add a lot of noise: normal distributed noise with standard deviation 10 times larger than the magnitude of the signal!

for i = 1:max(size(x))
 y(i) = y(i)+10*norminv(rand);
end

figure(2)
plot(x,y)
title('Signal with Noise')
xlabel('Seconds')
ylabel('Observed Signal')
Sine wave, with random normal noise, sampled at 100 Hertz

Sine wave, with random normal noise, sampled at 100 Hertz

Now the noise completely washes out the signal, and it is not at all clear what the period or frequency of the underlying data are.  So let’s compute and plot the power spectrum of the data.  I leave it to the reader to review all of the math underlying computing the spectrum, but I’ll provide a very brief description of what we are doing here.  PSDE involves taking the discrete Fourier transform of the data.  This transforms our data from the temporal or spatial domain to the frequency domain.  The power spectral density (PSD) function is simply the magnitude of the squared Fourier transformed data (often scaled). It tells us how much of the variance in our signal is explained in each discrete frequency band.  A spike in the PSD function tells us that there’s a lot of variance explained in that frequency band.  The frequency resolution of our analysis (width of the frequency bands) is dictated by the length of our recorded observations.  Interestingly, if we integrate the PSD function over the frequency domain, we obtain the sample variance of our data.  Now, let’s look at the code.

fs = (0.01)^-1;
N =size(y,2);

Y = fft(y);
f=fs*linspace(0,1,N);
G = abs(Y.^2)*(1/(fs*N));

figure(3)
loglog(f(2:floor(N/2)),G(2:floor(N/2)))
title('Amplitude of Spectrum of y(t)')
xlabel('Frequency (Hz)')
ylabel('S(f)')

The first two lines are defining the sampling frequency (fs), the number of observations (N).  Next we use Matlab’s fft command to take the discrete Fourier transformation of the data.  We then compute the frequency bins we are considering, then we compute and plot the PSD function.  Let’s take a look at the resultamplitude_of_spectrum

 

There is a lot of noise, but there is a distinct peak at 0.1592, which is 1/(2*pi), indicates the frequency of the underlying signal.  This is neat, even though the signal contains a huge amount of noise, we’ve distinguished the correct frequency (and inversely the period) of the signal!

So how might we as systems analysts use this?

In my PhD thesis we used PSDE to determine the critical time-dynamics of a hydropower system’s operation.  This turned out to be a critical diagnostic tool.  In the hydropower system I considered, both inflows and energy prices are stochastic, so it can be difficult to determine if  the derived ‘optimal’ policy makes sense by only observing simulation results: the release in any one time step (here 6-hrs) can appear noisy and erratic in response to price or hydrologic shocks.  The power of PSDE is the ability to cut through the noise and reveal the underlying signal.

As an example, I’ve derived an implicit ‘optimal’ policy using sampling stochastic dynamic programming.  To evaluate the quality of the derived policy, I’ve simulated the system operation with this policy over the range of historical inflow series.  For simplicity, we’ll assume that the energy price profile (ranging from high on-peak to low off-peak prices) is identical on every day.  Now, let’s look at the PSD function for the reservoir release.

error_hydro

We can immediately pick out elements of the PSD function that make sense.  First, there is a lot of variability at very low frequencies, corresponding to periodicity on the order of months or seasons.  This is likely due to seasonal hydrologic variability between summer, spring, and winter inflows: releases are higher in the spring and early summer because more water is available, and generally lower in the winter and late summer when inflows to the reservoir are lower.  Similarly the peak at frequency 1/day makes sense, because we still have a daily price cycle that the system is responding to, put we are harder pressed to explain the other peaks between frequencies 0.1 and 1 (roughly corresponding to weekly and within week periodicities).  We have fixed the price profile, but hydrology shouldn’t be cycling during the week, so what is going on?

As it turns out, I had a bug in my algorithm (bad ending conditions for the SSDP) which was causing the ‘optimal’ policy to behave a bit myopically on a weekly basis.  This error was propagating through the model, but was not easy to see by simply observing the simulated results (see chapter 6 of the thesis).  Correcting the problem, I got the following spectrum.  The unexplainable spikes have gone away, and we now see that daily price cycling and seasonal hydrology explain the majority of the systems variability (as we expect).

correct_hydro

For our Cornell-based readers our new CEE faculty member Gregory McLaskey is offering a new class on time series analysis (including PSDE) which is highly recommended.  PSDE is also used extensively (with real data) in Todd Cowen‘s experimental fluid mechanics class which also recommend because you learn a lot and get to play with lasers which, let’s be honest, is why we all became engineers.

Algorithm Diagnostics Walkthrough using the Lake Problem as an example (Part 1 of 3: Generate Pareto approximate fronts)

This three part series is an overview of the algorithm diagnostics I performed in my Lake Problem study with the hope that readers may apply the steps to any problem of interest. All of the source code for my study, including the scripts used for the diagnostics can be found at https://github.com/VictoriaLynn/Lake-Problem-Diagnostics.

The first step to using the MOEAFramework for comparative algorithm diagnostics was to create the simulation model on which I would be assessing algorithm performance. The Lake Problem was written in C++. The executable alone could be used for optimization with Borg and I created a java stub to connect the problem to the MOEAFramework. (https://github.com/VictoriaLynn/Lake-Problem-Diagnostics/blob/master/Diagnostic-Source/myLake4ObjStoch.java).  Additional information on this aspect of a comparative study can be found in examples 4 and 5 for the MOEAFramework (http://moeaframework.org/examples.html) and in Chapter 5 of the manual. I completed the study using version 2.1, which was the newest at the time. I used the all in one executable instead of the source code although I compiled my simulation code within the examples subfolder of the source code.

Once I had developed an appropriate simulation model to represent my problem, I could begin the diagnostic component of my study. I first chose algorithms of interest and determined the range of parameters from which I would like to sample. To determine parameter ranges, I consulted Table 1 of the 2013 AWR article by Reed et al.

Reed, P., et al. Evolutionary Multiobjective Optimization in Water Resources: The Past, Present, and Future. (Editor Invited Submission to the 35th Anniversary Special Issue), Advances in Water Resources, 51:438-456, 2013.

Example parameter files and the ones I used for my study can be found at https://github.com/VictoriaLynn/Lake-Problem-Diagnostics/tree/master/Diagnostic-Source/params. Once I had established parameter files for sampling, I found chapter 8 of the MOEAFramework manual to be incredibly useful.  Below I walk through the steps I took in generating approximations of the Pareto optimal front for my problem across multiple seeds, algorithms, and parameterizations.   All of the commands have been consolidated into the file Lake_Problem_Comparative_Study.sh on Github, but I had many separate files during my study, which will be separated into steps here. It may have been possible to automate the whole process, but I liked breaking it up into separate scripts to make sure I checked that the output made sense after each step.

Step 1: Generate Parameter Samples To generate parameter samples for each algorithm, I used the following code, which I kept in a file called sample_parameters.sh. I ran all .sh scripts using the general command sh script_name.sh.

NSAMPLES=500
METHOD=Latin
PROBLEM=myLake4ObjStoch
ALGORITHMS=(Borg MOEAD eMOEA NSGAII eNSGAII GDE3)
JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"

# Generate the parameter samples
echo -n "Generating parameter samples..."
for ALGORITHM in ${ALGORITHMS[@]}
do
java ${JAVA_ARGS} \
org.moeaframework.analysis.sensitivity.SampleGenerator \
--method ${METHOD} --n ${NSAMPLES} --p ${ALGORITHM}_params.txt \
--o ${ALGORITHM}_${METHOD}
done

Step 2: Optimize the problem using algorithms of interest This step had two parts: optimization with Borg and optimization with the MOEAFramework algorithms. To optimize using Borg, one needs to request Borg at http://borgmoea.org/. This is the only step that needs to be completed outside of the MOEAFramework. I then used the following script to generate approximations to the Pareto front for all 500 samples and 50 random seeds. The –l and –u flags indicate upper and lower bounds for decision variable values. Fortunately, it should soon be possible to type one value and specify the number of variables with that bound instead of typing all 100 values as shown here.

#!/bin/bash
#50 random seeds

NSEEDS=50
PROBLEM=myLake4ObjStoch
ALGORITHM=Borg

SEEDS=$(seq 1 ${NSEEDS})

for SEED in ${SEEDS}
do
NAME=${ALGORITHM}_${PROBLEM}_${SEED}
PBS="\
#PBS -N ${NAME}\n\
#PBS -l nodes=1\n\
#PBS -l walltime=96:00:00\n\
#PBS -o output/${NAME}\n\
#PBS -e error/${NAME}\n\
cd \$PBS_O_WORKDIR\n\
./BorgExec -v 100 -o 4 -c 1 \
-l 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 \
-u 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1 \
-e 0.01,0.01,0.0001,0.0001 -p Borg_params.txt -i Borg_Latin -s ${SEED} -f ./sets/${ALGORITHM}_${PROBLEM}_${SEED}.set -- ./LakeProblem4obj_control "
echo -e $PBS | qsub
done

Optimization with the MOEAFramework allowed me to submit jobs for all remaining algorithms and seeds with one script as shown below. In my study, I actually submitted epsilon dominance algorithms (included –e flag) and point dominance algorithms (did not include –e flag) separately; however, it is my understanding that it would have been fine to submit jobs for all algorithms with the epsilon flag, especially since I converted all point dominance approximations to the Pareto front to epsilon dominance when generating reference sets.


#!/bin/bash

NSEEDS=50
PROBLEM=myLake4ObjStoch
ALGORITHMS=(MOEAD GDE3 NSGAII eNSGAII eMOEA)

SEEDS=$(seq 1 ${NSEEDS})
JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"
set -e

for ALGORITHM in ${ALGORITHMS[@]}
do
for SEED in ${SEEDS}
do
NAME=${ALGORITHM}_${PROBLEM}_${SEED}
PBS="\
#PBS -N ${NAME}\n\
#PBS -l nodes=1\n\
#PBS -l walltime=96:00:00\n\
#PBS -o output/${NAME}\n\
#PBS -e error/${NAME}\n\
cd \$PBS_O_WORKDIR\n\
java ${JAVA_ARGS}
org.moeaframework.analysis.sensitivity.Evaluator -p
${ALGORITHM}_params.txt -i ${ALGORITHM}_Latin -b ${PROBLEM}
-a ${ALGORITHM} -e 0.01,0.01,0.0001,0.0001 -s ${SEED} -o ./sets/${NAME}.set"
echo -e $PBS | qsub
done

done

Step 3: Generate combined approximation set for each algorithm and Global reference set Next, I generated a reference set for each algorithm’s performance. This was useful as it made it easier to generate the global reference set for all algorithms across all seeds and parameterizations and it allowed me to calculate a percent contribution for each algorithm to the global reference set. Below is the script for the algorithm reference sets:

#!/bin/bash

NSAMPLES=50
NSEEDS=50
METHOD=Latin
PROBLEM=myLake4ObjStoch
ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg)

JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"
set -e

# Generate the combined approximation sets for each algorithm
for ALGORITHM in ${ALGORITHMS[@]}
do
echo -n "Generating combined approximation set for
${ALGORITHM}..."
java ${JAVA_ARGS} \
org.moeaframework.analysis.sensitivity.ResultFileMerger \
-b ${PROBLEM} -e 0.01,0.01,0.0001,0.0001 -o ./SOW4/reference/${ALGORITHM}_${PROBLEM}.combined \
./SOW4/sets/${ALGORITHM}_${PROBLEM}_*.set
echo "done."
done

In the same file, I added the following lines to generate the global reference set while running the same script.
# Generate the reference set from all combined approximation sets
echo -n "Generating reference set..."
java ${JAVA_ARGS} org.moeaframework.util.ReferenceSetMerger \
-e 0.01,0.01,0.0001,0.0001 -o ./SOW4/reference/${PROBLEM}.reference ./SOW4/reference/*_${PROBLEM}.combined > /dev/null
echo "done."

If one wants to keep the decision variables associated with the reference set solutions, it is possible to use org.moeaframework.analysis.sensitivity.ResultFileMerger on all of the pertinent .set files.

A final option for reference sets is to generate local reference sets for each parameterization of each algorithm. This was done with the following script:

#!/bin/bash
NSEEDS=50
ALGORITHMS=( GDE3 eMOEA Borg NSGAII eNSGAII MOEAD)
PROBLEM=myLake4ObjStoch

SEEDS=$(seq 1 ${NSEEDS})

# Evaluate all algorithms for all seeds
for ALGORITHM in ${ALGORITHMS[@]}
do
java -cp MOEAFramework-2.1-Demo.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger -d 4 -e 0.01,0.01,0.0001,0.0001 \
--output ./SOW4/${ALGORITHM}_${PROBLEM}.reference ./SOW4/objs/${ALGORITHM}_${PROBLEM}*.obj
done

Part 2 of this post walks through my calculation of metrics.

Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 1)

While working on my diagnostic assessment of the difficulty of the “Lake Problem”, I decided that I would like to check that the differences in performance I observed were statistically significant. This part of the blog post is a broad overview of the steps I took before the statistical analysis and that analysis. Part 2 includes a tutorial. Looking through prior group papers, I drew initial inspiration from the following articles by Dave although I think my final method differs slightly.

Hadka, D. and P. Reed. Diagnostic Assessment of Search Controls and Failure Modes in Many-Objective Evolutionary Optimization. Evolutionary Computation, 20(3):423-452, 2012.

Hadka, D. and P. Reed. Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework. Evolutionary Computation, 21(2):231-259, 2013.

Below is a short summary of the steps preceding my statistical analysis with links to group resources for those who may need more background.  For everyone else, feel free to skip ahead.

Before I could determine the statistical significance of my results, I needed to perform a comparative study of algorithm performance. I got help from sections 8.1 to 8.8 and section 8.10 of the MOEAFramework manual, which can be found at the MOEAFramework website to generate approximate fronts for the problem for multiple random seeds (I used 50) using algorithms found within the MOEAFramework, calculate metrics for each seed across all parameterizations, and determine the best value for each seed across all parameterizations and the probability of attaining a specified threshold (I used 75%) for each seed.  I skipped section 8.9 and performed the analysis described in 8.10 on metrics for each seed’s metrics as many samples are necessary to perform statistics on algorithm performance.  Please note, I also included the Borg evolutionary algorithm in this study, which requires optimization outside of the framework. All other steps for Borg can be performed the same as for the MOEAFramework algorithms.

In this case, I used Latin Hypercube Sampling to generate 500 samples of the feasible parameter space for each algorithm. I chose parameter ranges from Table 1 of the following paper.
Reed, P., et al. Evolutionary Multiobjective Optimization in Water Resources: The Past, Present, and Future. (Editor Invited Submission to the 35th Anniversary Special Issue), Advances in Water Resources, 51:438-456, 2013.

Here is the shell script I used for calculating the best attained values and probability of attaining the 75% threshold for all algorithms and all seeds.

NSEEDS=50
METHOD=Latin
PROBLEM=myLake4ObjStoch
ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg)

SEEDS=$(seq 1 ${NSEEDS})
JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"
set -e

# Perform the analysis
echo ""
echo "Hypervolume Analysis:"
for ALGORITHM in ${ALGORITHMS[@]}
do
	for SEED in ${SEEDS}
	do
    NAME=${ALGORITHM}_${PROBLEM}_${SEED}_Hypervolume

    PBS="\
    #PBS -N ${NAME}\n\
    #PBS -l nodes=1\n\
    #PBS -l walltime=96:00:00\n\
    #PBS -o output/${NAME}\n\
    #PBS -e error/${NAME}\n\
    cd \$PBS_O_WORKDIR\n\
        java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.Analysis \
        -p ${ALGORITHM}_params.txt -i ${ALGORITHM}_${METHOD} -t 0.75 -o ./SOW6/Hypervolume/${ALGORITHM}_${PROBLEM}_${SEED}_Hypervolume.analysis \
        -c -e -m 0 --hypervolume 0.7986 ./SOW6/metrics/${ALGORITHM}_${PROBLEM}_${SEED}.metrics"
    echo -e $PBS | qsub
    done

done

The above shell script is written for Hypervolume, but other methods can be specified easily by indicating a different number after the -m flag and removing the hypervolume flag.  If one wants to analyze Generational Distance, type 1 after the -m flag.  For the Additive Epsilon Indicator, type 4.  Typically, these are the three metrics of interest as explained in this blog post https://waterprogramming.wordpress.com/2013/06/25/moea-performance-metrics/.   The –hypervolume flag allows you to specify the hypervolume of your reference set. That way the individual reference set hypervolume values are normalized to that of the best known approximation to the Pareto front.

Hypervolume can be calculated for a given approximation set using the following command line argument.

java -cp MOEAFramework-2.1-Demo.jar org.moeaframework.analysis.sensitivity.SetHypervolume name_of_reference_set_file 

Finally, I calculated statistics on these values.  I used the Kruskal Wallis and Mann Whitney U (Also called the Wilcoxon Rank Sum) tests that Dave used.  One advantage of these tests is that they are non-parameteric, meaning that assumptions are not made about the distributions from which the data come.  Another advantage I found is that they are built into Matlab as the kruskalwallis and ranksum functions.  I began with the Kruskal Wallis test as it tests the null hypothesis that independent samples from two or more groups come from distributions with equal medians and returns the p-value that this is true given the data.  This test is a useful screen, as there would not have been much point in continuing if this test indicated that the values for all algorithms had been statistically similar.

I received very low p-values using Kruskal Wallis so I continued to the Mann Whitney U test in every case.  The Mann Whitney U test provides a pair-wise comparison.   Matlab has options to test one of three alternative hypotheses

  1. medians are not equal (default, two-tailed test)
  2. median of X is greater than median of Y (right-tailed test)
  3. median of X is less than median of Y (left-tailed test)

I used one tailed tests in order to determine if one algorithm’s performance was superior to another algorithm’s performance.  I then was able to rank algorithm performance on any given metric by determining the number of algorithms each algorithm outperformed according to this test. A detailed walk through is provided in Part 2.

Runtime dynamics with serial Borg

written by Riddhi Singh

Purpose:

The purpose of this write up is to provide step-by-step instructions for generating a runtime dynamics file using the serial version of Borg.

Pre-requisites:

Familiarity with running the serial version of Borg is NOT needed. We will give a brief overview of setting up a Borg problem. The user should preferably have a pre-defined problem to test this method. Otherwise, the dtlz2 test function that is provided with the serial release of Borg (dtlz2.py) can be used.   Serial Borg can be requested here: http://borgmoea.org/

Step-by-step instructions for obtaining a runtime dynamics file when calling Borg from the command line

1.    Create a Borg executable – use the make file that comes with the Borg-1.6 version.

Use the following commands in a makefile and then run the ‘make’ command- 

CC  = gcc

build:

            ${CC}  -o borgExec frontend.c borg.c mt19937ar.c -lm

Alternatively run the following on command line:

gcc -o borgExec frontend.c borg.c mt19937ar.c -lm

2.    Create a problem executable – This step can be skipped while using the dtlz2.py executable that comes with the Borg package. If compiling your own problem, you need essentially create a bridge between Borg and your problem. For each function evaluation, Borg needs a way to pass variables to your problem, and after evaluation your problem needs to return the associated objective function values back to Borg. You can use the template in dtlz2.py to parse Borg’s input stream. Sample code in C++ is provided below. Note that you cannot use this code directly, this is just the interface with your test function that should be defined above.

void test_function(double * vars, double *objs)

{

/*your test function here

test function takes inputs in vars and returns objective function values in objs*/

/*Once the objectives are evaluated, pass them to Borg using output stream functions*/

      for (int cols=0;cols<nobjs;cols++)

            {

              printf(“%lg “, objs[cols]);

            }

      printf(“\n”);

      fflush(stdout);

}

void main( int argc, char * argv[])

{

//Initialize number of variables and objective functions

  int nvars = 10;

  int nobjs = 1;

/*Declare the pointers to the arrays that will store the incoming Borg variables and outgoing objective function values*/

  double * vars = new double [nvars];

  double * objs = new double [nobjs];

//Read the input stream and output the objective function values

  int maxSize = 10000;

  char buffer [maxSize];

/*Read the input stream from Borg until the end of stream is reached and

assign the numbers to a testbuffer*/

Borg terminates each stream with a endline character*/

while ( *fgets(buffer, maxSize, stdin)!=’\n’)

    {

      int UseSize = strlen(buffer);

      char *pEnd;

      char *testbuffer = new char [UseSize];

      for (int i=0; i <UseSize; i++)

            {

              testbuffer[i] = buffer[i];

            }

/*Now, parse through the testbuffer and assign variables.

Borg only passes variables followed by objective functions but since we are reading these as strings, they need to be converted to double before sending out for evaluation*/

      for (int cols =0;cols<nvars;cols++)

            {

              vars[cols] = strtod(buffer, &pEnd);

              testbuffer  = pEnd;

            }

      //Evaluate the test problem with this vector

     test_problem(vars, objs);

//This test problem is declared as:

void test_problem(double vars, double* objs)

//The function returns the objective function values in objs

}

Now compile this executable. Use the following commands in a makefile and then run the ‘make’ command-

CC2 = g++

build:

            ${CC2} -o TestExec TestFunction.cpp

Alternatively run the following on command line:

g++ -o TestExec TestFunction.cpp

3.    Get the runtime dynamics file – Once both Borg and problem executables are ready, obtaining runtime dynamics from the command line is straightforward. Use the following command to generate a runtime dynamics file –

./BorgExec –v 10 –o 1 –n 1000 –e 0.01 –R RuntimeDynamicsFile.txt –F 10 ./TestExec > BorgOutput.txt

The options are elaborated in the README file that is provided with the Borg-1.6 package. The ‘RuntimeDynamicsFile.txt’ contains the runtime dynamics and the ‘BorgOutput.txt’ file contains the final result.

Alternatively, if using the dtlz2.py function, execute the following command –

./BorgExec –v 11 –o 2 –n 1000 –e 0.01,0.01 –R RuntimeDynamicsFile.txt –F 10 python dtlz2.py > BorgOutput.txt

Python Data Analysis Part 1c: Borg Runtime Metrics Plots

Borg Runtime Metrics

Have you done Part 1a yet?  How about Part 1b? Go do those first, and come back with metrics.txt, which has all of the data in a single file.  Afterwards, please check out Part 2.

We’re interested in Borg’s runtime metrics because they tell us interesting things about the problem we’re optimizing. See the Borg paper for more details.

The Script

Please consider this one released under the MIT license.

  1 import pandas
  2 import matplotlib
  3 import matplotlib.backends.backend_svg as svg
  4 import matplotlib.backends.backend_agg as agg
  5
  6 metrics = pandas.read_table('metrics.txt')
  7
  8 models = ["gasp", "response"]
  9 colors = ['b', 'r']
 10 seeds = range(50)
 11 toplot = [
 12     "Elapsed Time", "Population Size", "Archive Size",
 13     "GenerationalDistance", "AdditiveEpsilonIndicator",
 14     "SBX+PM", "DifferentialEvolution+PM",
 15     "PCX", "SPX", "UNDX", "UM"
 16     ]
 17 titles = {
 18     "GenerationalDistance": "Generational Distance",
 19     "AdditiveEpsilonIndicator": "Additive Epsilon Indicator",
 20     "SBX+PM": "Simulated Binary Crossover",
 21     "DifferentialEvolution+PM": "Differential Evolution",
 22     "PCX": "Parent Centric Crossover", "SPX": "Simplex Crossover",
 23     "UNDX": "Unimodal Normally Distributed Crossover",
 24     "UM": "Uniform Mutation"
 25     }
 26 filenames = {
 27     "Elapsed Time": "time", "Population Size": "popsize",
 28     "Archive Size": "archive",
 29     "GenerationalDistance": "gd",
 30     "AdditiveEpsilonIndicator":"aei",
 31     "SBX+PM": "sbx", "DifferentialEvolution+PM":"de",
 32     "PCX":"pcx", "SPX":"spx", "UNDX":"undx", "UM":"um"
 33     }
 34 axis_limits = {
 35     "SBX+PM": (0.0, 1.0), "DifferentialEvolution+PM": (0.0, 1.0),
 36     "PCX": (0.0, 1.0), "SPX": (0.0, 1.0), "UNDX": (0.0, 1.0),
 37     "UM": (0.0, 1.0)
 38     }
 39 for column in toplot:
 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)
 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])
 52
 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

Reading in the Data

Remember all of the text manipulation we had to do in Part 1a to deal with our data? Pandas gives us a very convenient interface to our table of data that makes Part 1a look like banging rocks together.

Line 6 shows how I read in a table. No need to parse the header or tell Pandas that we used tabs as field separators. Pandas does all of that.

  6 metrics = pandas.read_table('metrics.txt')

metrics is now a pandas DataFrame object. Users of R will recognize the DataFrame concept. It’s basically a table of data with named columns, where the data in each column are basically homogeneous (e.g. all floating-point numbers or all text).

Setting Things Up

Line 8 identifies the models for which I expect to find data in metrics.txt, and Line 9 indicates what color to use for plotting, for each model. 'r' is red, and 'b' is blue. Line 10 identifies the seeds I expect to find in the data table. range(50) is a shorthand way of saying “integers 0 through 49”.

Lines 11 through 16 make a list of the columns I want to plot. Lines 17 through 25 set up a dictionary relating those column names to the names I want to use as plot titles. (Generational Distance instead of GenerationalDistance, for example). Likewise, Lines 26 through 33 make a dictionary relating column names to the names of the files where I want to save the plots, and Lines 34 through 38 specify Y-axis limits for the plots.

A dictionary (dict) in Python is an associative data structure. Every value stored in a dictionary is attached to a key, which is used to look it up. Wikipedia has a pretty good explanation.

Making the Plots

Lines 39 through 60 make a plot for each of the columns we specified in toplot and save it to a file.

Setting up the Axes

When using Matplotlib, the axes object provides most of the plotting interface. Lines 40 through 43 set up an axes instance.

 40     fig = matplotlib.figure.Figure(figsize=(10,6))
 41     svg.FigureCanvasSVG(fig) # for SVG
 42     # agg.FigureCanvasAgg(fig) # for PNG
 43     ax = fig.add_subplot(1,1,1)

An axes object belongs to a figure. Line 40 sets up a figure, specifying that it should be 10 inches by 6 inches (this is a nominal size since we’re dealing with screen graphics).

Line 41 creates a canvas, which is a backend for drawing. This is not what the Matplotlib tutorials would have you do. They use pyplot.figure() to create a figure which then keeps a whole bunch of state in the background. The pyplot approach is apparently designed to ease the transition for Matlab users. Since I’m not a Matlab user it just seems weird to me, so I create a canvas explicitly. Commenting out Line 41 and switching to Line 42 instead switches between SVG and PNG output.

Line 43 creates the axes object for plotting the data in a column. add_subplot tells a figure where to put the axes. Matplotlib is designed from the ground up to support figures with multiple plots in them. The arguments 1, 1, 1 tell the figure it has a 1×1 array of subplots, and the one we’re interested in is in the first position.

Plotting the Runtime Metrics

Lines 44 through 51 do the plotting:

 44     for model, color in zip(models, colors):
 45         for seed in seeds:
 46             filtered = metrics[(metrics['model'] == model) &
 47                                (metrics['seed'] == seed)]
 48             line = ax.plot(filtered['NFE'], filtered[column],
 49                     color=color)[0]
 50             line.set_label('_nolegend_')
 51         line.set_label({"gasp":"GASP","response":"RSM"}[model])

The call to zip packs up models and colors into an array of tuples. It would look like this if you were to create it explicitly:

[("gasp", "b"), ("response", "r")]

So on the first iteration through the for loop, model is "gasp" and color is "b". Then on the second iteration, model is "response and color is "r".

Line 45 opens a for loop that iterates through all 50 seeds (range(50) if you’ll recall.)

Lines 46 and 47 use a few excellent Pandas features. This bit:

metrics["model"] == model

returns an array of boolean (True/False values) where the condition ( == model) is true. The indices of that array correspond to the lines in the metrics table.

(metrics['model'] == model) & (metrics['seed'] == seed)

is an array of booleans where both conditions are true.
Pandas lets you filter a table using an array of booleans, so

metrics[(metrics['model'] == model) & (metrics['seed'] == seed)]

is a the subset of rows in the full table where the model and seed are as specified. Pause for a second and think about how you would do that if you didn’t have Pandas.

Lines 48 and 49 then call the plotting routine itself. It puts NFE on the X axis and whichever metric you’re plotting on the Y axis, and it uses the color we specified. The plot method returns a list of the lines created (it can make more than one at a time, although we aren’t.) So the subscript [0] at the end means that we just assign to line the single line created by our call to plot, rather than a 1-element list containing that line.

Line 50 excludes every line from the legend, since we don’t want 100 items in the legend. (50 seeds times two models). Line 51 then selectively enables the legend for one line from each model (and it doesn’t matter which one because they all look the same.) Note that Line 51 is outside the loop that starts on Line 45, so it only gets executed twice.
This part:

{"gasp":"GASP","response":"RSM"}[model]

is probably just me getting excessively clever. I could have written this instead:

if model == "gasp":
    line.set_label("GASP")
else:
    line.set_label("RSM")

But I got lazy.

Wrapping Up

Lines 53 through 56 fix up the plotting limits, add a title and a legend, and write the figure out to a file.

 53     ax.set_xlim(0, 100000)
 54     limits = axis_limits.get(column, None)
 55     if limits:
 56         ax.set_ylim(limits[0], limits[1])
 57
 58     ax.legend(bbox_to_anchor=(1.0, 1.0))
 59     ax.set_title(titles.get(column, column))
 60     fig.savefig(filenames.get(column, column))

I already know that I ran 100,000 function evaluations, so I set the X-axis limits accordingly for each plot on Line 53. Line 54 checks to see if I wanted to set limits for Y-axis for the column I’m plotting. I only want to do this for the operator probabilities, because I want to scale them all between 0 and 1. (See Lines 34 through 38.) This means that some of the columns don’t have limits specified. Ordinarily, asking a dictionary for the value associated with a key that’s not in a dictionary raises an exception. However, I’m using the get method with a default of None, so if I didn’t specify any limits in axis_limits, limits gets set to None. Line 55 tests whether that is the case, so Line 56 sets the Y-axis limits only if I actually specified limits when I declared axis_limits.

Line 58 makes a legend and puts it where I specify (bbox_to_anchor). Line 59 sets the title for the plot, and Line 60 writes it to a file. Note that the filenames specified on Lines 26 to 33 don’t include extensions (.svg or .png). Matplotlib decides which one to used based on whether we chose the SVG canvas or the Agg canvas on Lines 41 and 42.

The Result

If you run this script, these files should appear in your directory.

aei.svg  
archive.svg  
de.svg  
gd.svg  
pcx.svg  
popsize.svg  
sbx.svg  
spx.svg  
time.svg  
um.svg  
undx.svg

Here’s what sbx.svg looks like.

sbx

(WordPress doesn’t accept SVGs, so this is actually a PNG rasterization.)

Your Homework

If you’re in the Pat Reed group, please give this (and Part 1a) a try before our meeting on the 13th. If you have trouble, please post comments to the blog so I can clarify in public.

After all this, please proceed to part 2!

Python Data Analysis Part 1a: Borg Runtime Metrics Plots (Preparing the Data)

Introduction

Welcome to a series of posts discussing making Borg runtime metrics plots using Python, with Pandas and Matplotlib.  To set up Python, see this post.  The first post is broken into three parts.  This part sets up the data, whereas the next part sets up the libraries, and the final part puts it all together.  There is also a second companion post going through a “live demo” of how to do everything.  Also search “python” on the right side for other blog posts relating to using Python!

Generating Runtime Metrics Data

First, I’ll assume that you’ve collected metrics during your optimization runs.

Here’s what I had at the end of my main function (I had used Java MOEAFramework for optimization).

// write the runtime dynamics 
Accumulator accumulator = instrumenter.getLastAccumulator();

String[] metrics = {
        "NFE",
        "Elapsed Time",
        "Population Size",
        "Archive Size",
        "CV",
        "GenerationalDistance",
        "AdditiveEpsilonIndicator",
        "SBX+PM",
        "UM",
        "DifferentialEvolution+PM",
        "PCX",
        "SPX",
        "UNDX"
};

for(String metric : metrics) {
    System.out.print(metric);
    System.out.print("\t");
}
System.out.print("\n");
for(int ii=0; ii<accumulator.size("NFE"); ii++) {
    for(String metric : metrics) {
        System.out.print(accumulator.get(metric, ii));
        System.out.print("\t");
    }
    System.out.print("\n");
}

Resulting Data Files

Here’s a sample of what was inside the resulting data files. They’re tab-delimited text, and unfortunately our WordPress theme will only let you see the first few columns, but you get the idea.

NFE Elapsed Time Population Size Archive Size CV Generational Distance Additive Epsilon Indicator SBX+PM UM Differential Evolution+PM PCX SPX UNDX
3000 77.359 308 93 0.0 0.048 0.639 0.617 0.021 0.042 0.148 0.148 0.021
6111 151.339 900 244 0.011 0.023 0.589 0.543 0.005 0.021 0.059 0.365 0.005
9111 222.376 1216 275 0.0 0.017 0.572 0.544 0.009 0.009 0.127 0.303 0.004
12112 293.615 952 310 0.0 0.013 0.578 0.494 0.003 0.003 0.083 0.411 0.003

Merging the Data

For my study, I did 50 optimization runs for each of two versions of my model. (GASP and a response surface model, or RSM.) This means I don’t have one metrics file, I have 50. Here’s what my directory listing looks like.

metrics_gasp_0.txt   metrics_gasp_40.txt      metrics_response_26.txt
metrics_gasp_10.txt  metrics_gasp_41.txt      metrics_response_27.txt
metrics_gasp_11.txt  metrics_gasp_42.txt      metrics_response_28.txt
metrics_gasp_12.txt  metrics_gasp_43.txt      metrics_response_29.txt
metrics_gasp_13.txt  metrics_gasp_44.txt      metrics_response_2.txt
metrics_gasp_14.txt  metrics_gasp_45.txt      metrics_response_30.txt
metrics_gasp_15.txt  metrics_gasp_46.txt      metrics_response_31.txt
metrics_gasp_16.txt  metrics_gasp_47.txt      metrics_response_32.txt
metrics_gasp_17.txt  metrics_gasp_48.txt      metrics_response_33.txt
metrics_gasp_18.txt  metrics_gasp_49.txt      metrics_response_34.txt
metrics_gasp_19.txt  metrics_gasp_4.txt       metrics_response_35.txt
metrics_gasp_1.txt   metrics_gasp_5.txt       metrics_response_36.txt
metrics_gasp_20.txt  metrics_gasp_6.txt       metrics_response_37.txt
metrics_gasp_21.txt  metrics_gasp_7.txt       metrics_response_38.txt
metrics_gasp_22.txt  metrics_gasp_8.txt       metrics_response_39.txt
metrics_gasp_23.txt  metrics_gasp_9.txt       metrics_response_3.txt
metrics_gasp_24.txt  metrics_response_0.txt   metrics_response_40.txt
metrics_gasp_25.txt  metrics_response_10.txt  metrics_response_41.txt
metrics_gasp_26.txt  metrics_response_11.txt  metrics_response_42.txt
metrics_gasp_27.txt  metrics_response_12.txt  metrics_response_43.txt
metrics_gasp_28.txt  metrics_response_13.txt  metrics_response_44.txt
metrics_gasp_29.txt  metrics_response_14.txt  metrics_response_45.txt
metrics_gasp_2.txt   metrics_response_15.txt  metrics_response_46.txt
metrics_gasp_30.txt  metrics_response_16.txt  metrics_response_47.txt
metrics_gasp_31.txt  metrics_response_17.txt  metrics_response_48.txt
metrics_gasp_32.txt  metrics_response_18.txt  metrics_response_49.txt
metrics_gasp_33.txt  metrics_response_19.txt  metrics_response_4.txt
metrics_gasp_34.txt  metrics_response_1.txt   metrics_response_5.txt
metrics_gasp_35.txt  metrics_response_20.txt  metrics_response_6.txt
metrics_gasp_36.txt  metrics_response_21.txt  metrics_response_7.txt
metrics_gasp_37.txt  metrics_response_22.txt  metrics_response_8.txt
metrics_gasp_38.txt  metrics_response_23.txt  metrics_response_9.txt
metrics_gasp_39.txt  metrics_response_24.txt
metrics_gasp_3.txt   metrics_response_25.txt

Python Data-Merging Script

To plot everything, it’s most convenient for me if all the data are in a single file.  There are many ways to combine all 100 files together, and even the Unix shell-scripting version is pretty straightforward. But since this post is about data analysis in Python anyway, I’ll give you a Python version.

1   def append_data(accumulator, model, seed):
2       filename = "metrics_{0}_{1}.txt".format(model, seed)
3       with open(filename, 'rb') as metrics:
4           header = metrics.readline().strip()
5   
6           for line in metrics:
7               line = line.strip()
8               line = "{0}\t{1}\t{2}\n".format(
9                       model, seed, line)
10              accumulator.append(line)
11
12      return header
13
14  models = ("response", "gasp")
15  seeds = range(50)
16  accumulator = []
17  for model in models:
18      for seed in seeds:
19          header = append_data(accumulator, model, seed)
20
21  with open("metrics.txt", 'wb') as accumulated:
22      header = "{0}\t{1}\t{2}\n".format(
23                      "model", "seed", header)
24      accumulated.write(header)
25      for line in accumulator:
26          accumulated.write(line)

This is a bit of a throwaway script (consider it released under the MIT license. Go do what you like with it.) It treats the data in the individual files like text, rather than converting to floating-point numbers and back again. It gathers every line from the individual data files, prepending the model and seed number, then writes them all back out as one file.

Let’s walk through the details…

Merging Loop

14  models = ("response", "gasp")
15  seeds = range(50)
16  accumulator = []
17  for model in models:
18      for seed in seeds:
19          header = append_data(accumulator, model, seed)

The action starts on line 14. This bit takes a list called accumulator and stuffs every line of every metrics file into it, except for the headers. The append_data function returns the header separately, and since I’m being lazy, I assume the header is always the same and let it get overwritten each time. When the loop exits, header is actually just the header (i.e. first line) of the last file.

Now, I’m not assuming you know Python, so I’ll make a couple of notes about the syntax here.

  • Parentheses, like on line 14, make a tuple object. This is like a list, but generally a tuple is short and immutable, while a list (made with brackets [ ]) is any length and mutable. (A list in Python is a general-purpose ordered collection of data.)
  • Indentation is meaningful to the Python interpreter. Everything inside a for loop is at a deeper level of indentation, and the loop ends when a shallower level of indentation (or the end of the script) is reached. So lines 17-19 define a nested for loop that covers both seeds for every model.

Function to append data

1   def append_data(accumulator, model, seed):
2       filename = "metrics_{0}_{1}.txt".format(model, seed)
3       with open(filename, 'rb') as metrics:
4           header = metrics.readline().strip()
5   
6           for line in metrics:
7               line = line.strip()
8               line = "{0}\t{1}\t{2}\n".format(
9                       model, seed, line)
10              accumulator.append(line)
11
12      return header

The append_data function appends to accumulator (a list) every line in the file identified by model and seed. Line 2 puts together a file name based on the model and seed number.

The with block starting on line 3 and ending on line 10 is a way of automatically releasing the metrics file when I’m done with it. The call to open gives us a file object I’m calling metrics. Because I used a with block, when the block ends, the file gets closed automatically. It’s an easy way to prevent your program from leaking open files.

On Line 4, I read the first line of the metrics file, its header, which will get returned at the end of the function. That’s not the main thing this function does, though. Lines 6-10 loop through the remaining lines in the file, prepend model and seed number to them, then append them to the accumulator list.

Writing out the data

21  with open("metrics.txt", 'wb') as accumulated:
22      header = "{0}\t{1}\t{2}\n".format(
23                      "model", "seed", header)
24      accumulated.write(header)
25      for line in accumulator:
26          accumulated.write(line)

Here I use another with block to take care of the file I’m writing data out to. Lines 22-23 create the header by prepending “model” and “seed” to it, then Line 24 writes it to the output file. Lines 25 and 26 loop through the accumulator list and write each line out to the metrics file.

Comment

I want to repeat that, although the metrics files were full of numbers, I treated them like text here. All this merging script does is add some more text (model and seed number) to each line before writing the same text it read in right back out to a file. Once I get to plotting things, I’ll need to treat numbers like numbers, which is where pandas comes in.

Resulting Data File

Here’s a sample of what the resulting data file (metrics.txt) looks like:

model seed NFE Elapsed Time Population Size Archive Size CV Generational Distance Additive Epsilon Indicator SBX + PM UM Differential Evolution + PM PCX SPX UNDX
response 0 3001 2.852 184 57 0.0 0.032 0.863 0.086 0.017 0.017 0.172 0.672 0.034
response 0 6001 5.163 368 123 0.0 0.020 0.785 0.223 0.008 0.016 0.074 0.669 0.008
response 0 9001 7.581 512 151 0.0 0.017 0.764 0.178 0.006 0.046 0.026 0.735 0.006
response 0 12002 9.947 512 133 0.0 0.017 0.758 0.148 0.007 0.037 0.074 0.725 0.007

Play Along at Home!

If you’re lucky enough to be a member of the Pat Reed group, I’m making my metrics files available on our group Dropbox, in Presentations/py_data_analysis/metrics.zip. I encourage you to type in my code, save it as accumulate.py (or whatever name strikes your fancy) in the directory where you put the metrics files, and then run it by typing python accumulate.py. If you don’t have Python on your machine, you can run it on the cluster if you first load the Python module: module load python/2.7.3

(This is part of your homework for Wednesday February 13!)

C++: Vectors vs. Arrays

In many of our programs we are using C-style arrays to store data. These are easy to write, but can be difficult to work with—they cannot be resized, they don’t know their own size, and most array operations require the use of pointers and memory management.

For many applications, you may want or need a data structure that is simpler to use. Check out the vector library in C++. The syntax for vectors may be new to you, but they make many operations a whole lot easier. A brief introduction to using the vector class is here.

Personally I like vectors because they make C++ operations more Matlab-esque, and they help prevent the memory management errors that are so common with C-style arrays. The downside, according to the C purists, is that vectors are slower than arrays. This can be true, depending on how you use them. But the vector class is really only an abstraction of the C-style array, so basic operations like storage and retrieval are still completed in constant time. The only time vectors slow down is when you append items to them, because behind the scenes, the vector is copying an entire array and adding a single element to it. This runs in O(N) time and can really slow things down if you do it every loop iteration. (Source: the tutorial above, and here).

So: vectors can make your life a lot easier, and they are the same speed as arrays for indexing operations. Just be careful to resize vectors as few times as necessary to preserve the efficiency of your program.

P.S. I’m sure there are dissenting opinions on the array/vector debate … please add your two cents.