September 18, 2023 by vyildiz65

How ChatGPT Helped Me To Convert a MATLAB Toolbox to Python and Learn Python Coding

This post is a guest post by Veysel Yildiz from the University of Sheffield. Veysel is a third-year PhD student in Civil Engineering working with Dr. Charles Rouge (website). Veysel’s research contributes to scientific and technical advances in hydropower plant design. If you are interested in contributing a guest post to this blog, please contact Lillian Lau (lbl59@cornell.edu).

Like many engineers with a graduate education, I learned MATLAB first before realising Python could be a better language for code I want to make freely available. But contrary to most, I had already written a 2500-line piece of software in MATLAB before realising other programming languages would be better. My software, the HYPER toolbox is the first to simultaneously optimise plant capacity and turbine design parameters for new run-of-river hydropower plants (for more information, please follow this link). In my PhD, I have been improving this toolbox to also evaluate the financial robustness of hydropower plant design in a changing world, i.e., is a plant still a sensible investment when climate and socio-economic conditions are different? This is a key question in countries such as Turkey, where I am from, that are getting drier. I have also been making my toolbox faster through algorithmic improvements, and want to release this fast version in a license-free language.

While I was proficient in MATLAB, my knowledge of Python was limited. Fortunately, with the help of ChatGPT, I was able to successfully navigate the transition from MATLAB to Python, both for my toolbox and for myself. Indeed, I was not only able to convert the code but also gain valuable insights into Python programming. In this blog post, I will share my ChatGPT experience, its benefits, and offer some practical tips for using ChatGPT effectively in Python coding.

The challenge of transitioning from MATLAB to Python

MATLAB is a sophisticated numerical computing environment known for its ease of use and extensive libraries for a wide range of scientific and engineering applications. Python, on the other hand, has grown in popularity due to its adaptability, open-source nature, and diverse set of libraries and frameworks. Given the changing environment of software development and user demands, it was vital to make the switch from MATLAB to Python. The main challenges I faced during this transition are:

Syntax Differences: MATLAB and Python have different syntax and conventions, making it challenging to convert code directly.
Lack of Proficiency: I was not proficient in Python, which added to the complexity of the task.
Code Size: With 2500 lines of code, the change could take several months.
Performance: Python can be slower than MATLAB for certain tasks such as matrix computation.

How ChatGPT accelerated the process

ChatGPT, is a flexible large language model that can answer queries, generate code samples, and explain concepts in a variety of programming languages, including Python.

Here’s how ChatGPT helped me through this process:

Code Conversion Guidance

I started asking for basic functions including how to load/read a dataset. Then, I asked for guidance on how to convert specific functions and syntax from MATLAB to Python. ChatGPT provided clear explanations and Python code examples, helping me understand the nuances of Python syntax. For example:

ChatGPT guided me through the conversion process step by step. It assisted me in identifying and addressing syntactic differences, data structure changes, and library replacements needed to make the code run in Python. The conversion was not a one-time event. It involved continuous iterations and refinements. After implementing ChatGPT’s suggestions for one part of the code, I would return with additional questions as I progressed to the next steps. This iterative approach allowed me to build a solid foundation in Python while converting the MATLAB code. Here is an example:

Debugging and Error Handling

During the conversion process, I encountered errors and unexpected behaviour in the Python code. Whenever it happened, I input the error messages to ChatGPT to identify the causes of the errors. ChatGPT described Python’s error messages and traceback information, allowing me to handle difficulties on my own. In other words, it assisted in debugging by offering tricks for error detection and correction. Here is an example:

ValueError: Unable to parse string “EKİM” at position 0

Code Optimization

When I successfully create a script that runs, I often ask ChatGPT to suggest methods to optimize the Python code for performance and readability. Here is an example:

Learning Python Concepts

As I progressed in converting the code with the assistance of ChatGPT, I discovered that the journey was not just about code translation; it was an invaluable opportunity to learn essential Python concepts and best practices. ChatGPT was quite beneficial in taking me through this learning process, providing explanations on Python’s data types, control structures, and how to use libraries efficiently.

A Word of Caution

ChatGPT is a powerful AI language model, but it, like any other, has limits. It might sometimes offer inaccurate or unsatisfactory code suggestions or descriptions. As a user, you should take caution and systematically validate the information given by ChatGPT. Here is an example:

Several Key Takeaway Messages of the Journey

Use clear instructions: When interacting with ChatGPT, provide clear and concise instructions. Describe the problem you’re trying to solve, the desired outcome, and any constraints or requirements. The more specific you are, the better the model’s response is likely to be.
Understand its limitations: ChatGPT may not always produce perfect or optimal solutions. It is important to evaluate and verify the code created to ensure it meets your requirements.
Review and validate generated code: Carefully review the code generated by ChatGPT. Check for accuracy, readability, and efficiency. Debug any issues that arise, and test the code with different inputs to ensure it works as expected.
Iterative refinement: If the initial code generated by ChatGPT doesn’t meet your requirements, provide feedback and iterate. Ask for specific improvements or clarifications, and guide the model towards a better solution.

Conclusion

Through the guidance and support of ChatGPT, I successfully converted 2500 lines of MATLAB code to Python in 10 days. This accomplishment was a significant milestone in my programming journey. Along the way, I gained proficiency in Python, a skill that opened up new opportunities in data science, and more. This experience demonstrates the power of AI-driven tools in supporting programmers with challenging tasks while supporting continuous learning.

If you find yourself in a similar transition or coding challenge, try using AI-powered language models like ChatGPT to assist facilitate your journey and enrich your programming skills. It is not just about solving issues, but also about learning and growing as a programmer in a dynamic technological environment.

ChatGPT is a powerful tool to assist you with coding tasks, but it does not replace developing your own coding skills and actively engage in the learning process.

September 13, 2023 by raffaelegiuseppecestari

Legacy Code Reborn: Wrapping C++ into MATLAB Simulink

Did you ever have the need to include legacy code in your own projects without rephrasing it into your new framework? In Matlab Simulink, there is a way to include C++ code within your system. This post will teach you how to fit legacy C++ code into your Simulink projects.

Let’s assume you have been given a C++ code in which neural networks are implemented and already trained (as an example, but this guide extends to any generic class/function). Let’s also assume that your goal is to include them within a larger control system. Recreating them from scratch and trying to translate the original code into Matlab could be an unwise choice, as the implementation error could be around the corner. The best solution is to include the neural networks directly as they are, wrapping them in a nice box. How to proceed?

Requirements
To fully appreciate the following post, you should have a basic understanding of object-oriented programming, C++, and Matlab Simulink.

Disclaimer
The one I will show you differs from the fastest solution you could think about. Indeed, there are easier ways to wrap your C++ code if it fulfills certain limitations declared in the C Function Block documentation. Still, this solution is the most complete you can think of! And always works.
Here below I recall the Matlab Simulink limitations from the official documentation.
The C Function block cannot be used to directly access all C++ classes. These types of classes are not supported:

Template classes
Standard Template Library (STL) containers
Classes with private constructors and destructors

If your code falls in one of the above cases, you should be interested in reading this post; otherwise, you can think about following the more straightforward way.

1. Installing the C++ compiler
To execute C++ code, you need a C++ compiler. If you already have one, skip this section. Otherwise, you have two possibilities, in both the cases, you will have a functioning C++ compiler on your device:

Installing the standard open-source C++ compiler.
In this case, I recommend MINGW (even if it is pretty old). You can install it simply by following the software installation guidelines.
Installing MINGW as Matlab adds-on.
Another (even easier) possibility is installing MINGW directly from Matlab as a standard Matlab toolbox.

To let Matlab know that you will need to compile C++ code, you need to specify (each time you execute the code) the following command in Matlab terminal:

mex -setup C++

This, if the compiler has been correctly installed (in my case through Microsoft Visual Studio 2022), should prompt out:

MEX configured to use 'Microsoft Visual C++ 2022' for C++ language compilation.

or, if you installed through Matlab adds-on:

MEX configured to use 'MinGW64 Compiler (C++)' for C++ language compilation.

2. The C Function Block

Here is where things start to be more challenging.
The starting point is understanding which Simulink block we are going to use to wrap around our C++ executed code. In this case, it is represented by the C Function Block.

The block has the following structure:

Output code: here is where you will generate your output. The following is a candidate example.
output = predict(myNN,input1,input2);
Here we invoke the predict method passing as arguments two possible inputs and returning the output (our hopefully prediction). The special guest is myNN. This is the key of the solution to the problem and represents an instance of the class of the object we are going to wrap.
Start code: here, you instantiate the myNN object. It is the equivalent of the constructor in the C++ class.
myNN = createNN();
Terminate code: here is where you destroy your object (equivalent to the destructor).
deleteNN(myNN);
Symbols: here, you declare your input, output and myNN object. You can specify names (input1,input2, myNN) and data types (int32,double…). Remark: it is fundamental that the type of the myNN object is VoidPointer. This depends on the fact that we are building a wrapper.

To conclude the inspection of the C Function block, we must look at the settings. Here we must specify all the source files (.cpp) and header files (.h) inside the corresponding sections. The files must be in the same folder of the Simulink file, unless you specify the directory in the corresponding field.

An important remark: the settings are common for all the C Function blocks you will have inside the project. Thus, you can’t structure your application thinking that each C Function block can have separate and independent source files. For this reason, even if you need to instantiate multiple classes of different natures inside your scheme, in the end, you will need one unique wrapper (with corresponding .cpp and .h files).

3. Writing your C++ Wrapper

After exploring the C Function block, we are ready to start writing our wrapper. Let’s assume you have legacy code that implements your neural network in source/header files. What you must do is:

Write your NN class, whose methods will be invoked by the wrapper.
Write the wrapper.

3a. Writing the class

We will call the files uniqueNN.cpp/uniqueNN.h.

#include "uniqueNN.h"
#include "ann.h"
// include here all additional dependences

myNN::myNN() {
}

myNN::~myNN() {
    delete ANN;
}

double myNN::predict(double input1, double input2) {
// copy-paste here your legacy code
}

In this class, you can see constructor, destructor, and predict method. Inside that method, you would copy-paste your legacy code. In there, you can do whatever you want. Remark: you don’t need to fulfill the limitations of the C Function Block because this class will be invoked by a “protective layer” represented by our wrapper!

Suppose your system has multiple components, and a different network with a different structure represents each. You can still use the very same class and implement a corresponding method for each of your original networks to produce the corresponding network’s prediction.

Clearly, you must specify all your needed files for executing your legacy code in source and header files.

3b. Implementing the wrapper

#include "uniqueNN.h"
// include here all additional standard libraries 

// Neural Networks

void* createNN()
{
    return new myNN();
}

void deleteNN(void* obj)
{
    delete static_cast<myNN*>(obj);
}

double predict(void* obj, double input1, double input2)
{
    myNN* nn = static_cast<myNN*>(obj);
    return nn-> predict(input1, input2);
}

Here we have three methods. These are the ones we have seen in the C Function Block! They will be invoked by Simulink.

As you can see, we find again the constructor, the destructor, and the predict method. For the sake of conciseness, I suggest using the same name for the method implemented in the class and the method implemented in the wrapper, but it is not mandatory. Notice that you must pass as an argument for the predict method in the wrapper the void pointer (your class object you are instantiating, that is the one you declared in the C Function Block!).

Remark: as you might have imagined, for each method implemented in the class (that in our example represented one different legacy neural network), you should create one corresponding method in the wrapper.

A final remark: the wrapper needs only the header of your class (for us uniqueNN.h) (not all the ones from legacy code, so do not put them because it is redundant!)

Here below the quality of the reconstruction of the legacy code through the C++ wrapper. The small differences that you might experience are due to internal data type differences between the legacy code simulator and the new one, absolutely negligible.

The figure above shows that the predictions from the neural network are nearly identical for both the wrapper-based simulink and original C++ implementations.

That’s it! Thank you for reading and if you have further questions, please feel free to contact me, here is my LinkedIn profile!

November 30, 2022 by Lillian Lau

The Building Blocks of Time Series Analysis (plus a demo!)

Recently I’ve been learning to view the world in frequencies and amplitudes instead of time and magnitudes, also known as Fourier-domain time series analysis. It’s become a rather popular method in the field of water supply planning, with applications in forecasting demand and water quality control. While this post will not delve into advanced applications, we will walk through some fundamental time series concepts to ease you into the Fourier Domain. I will then demonstrate how time series analysis can be used to identify long- and short-term trends in inflow data.

Important frequencies

When reviewing time series analysis, there are three fundamental frequencies to remember:

The sampling frequency, $f_{s}$ is the frequency at which data for a time series is collected. It is equivalent to the inverse of the time difference between two consecutive data points, $\frac{1}{\delta t}$ . It has units of Hertz ( $Hz$ ). Higher values of $f_{s}$ indicate a higher number of data points recorded per second, which means better data resolution.
The Nyquist frequency, $f_{Ny}$ is equal to $\frac{1}{2} f_{s}$ . As a rule of thumb, you should sample your time series at a frequency $f < f_{Ny}$ , as sampling at a higher frequencies will result in the time series signals overlapping. This is a phenomenon called aliasing, where the sampled points do not sufficiently represent the input signal, resulting in higher frequencies being incorrectly observed at lower frequencies (Ellis, 2012)
The cutoff frequency, $f_{cut}$ . The value of $f_{cut}$ defines the signal frequency that passes through or is stopped by a filter. This is a nice segue into the next section:

Filtering

Filtering does exactly what its name implies it does. It allows certain signal frequencies to pass through, filtering out the rest. There are three main types of filters:

Low-pass filter: A low-pass filter eliminates all frequencies above $f_{cut}$ , allowing lower frequencies to pass through. It behaves like a smoothing operator and works well for signal noise reduction.
High-pass filter: Unlike the low-pass filter, the high-pass filter eliminates all frequencies below $f_{cut}$ . This is a useful filter when you have discontinuous data, as high-pass filtering a signal maintains its discontinuities and does not artificially smooth it over.
Band-pass filter: This filter has both the properties of a low- and high-pass filter. It eliminates all frequencies that do not lie within a range of frequencies.

Windowing

The core idea behind windowing is that is is possible to obtain more detailed insight into your signal’s properties by dividing it into ‘windows’ of time. Windowing is useful to reduce artifacts in a signal caused by periodic extension, and the distinction between signal and noise becomes unclear. For particularly noisy signals, windowing can also be useful to amplify the signal-to-noise ratio, thereby making it easier to distinguish between the actual signal and white noise. This is possible as the large ‘main lobe’ of the window (both in the time and frequency domain), amplifies the lower-frequency, higher-amplitude signal, but dampens the higher-frequency, lower-amplitude noise.

Spectrograms

There is a key issue when analyzing a time series in the Fourier domain, and vice versa – there is no way (as of right now) to view when or where a specific frequency occurs, or which frequencies dominate at any specific time. A spectrogram (or the Short-Time Fourier Transform, STFT) of a signal address this issue by consolidating three types of information into one visualization:

The frequencies inherent to the signal
How the frequencies change over time
The relative strength of the frequencies (the frequency content or “power” of the frequency).

It allows us to view changes in the signal across both time and frequency, overcoming the limitation of only being applicable to stationary time series. However, it still applies a fixed window size, as we will see in a little bit, and cannot fully capture all the different frequencies that characterize a signal. Here, using wavelets is useful. Please refer to Rohini’s post on using wavelets on ENSO El Niño data here if you’re interested! On that note, let’s see how we can bring it all in with a quick demonstration!

Demonstrating basic time series analysis methods on an inflow timeseries

This demonstration will use MATLAB, and we begin by first loading and plotting a time series of weekly inflow data. Feel free to replace the inflow file with any time series you have on hand!

%%
% Read in and plot inflows of Jordan Lake in the Research Triangle

inflows_jl = readmatrix('jordan_lake_inflows.csv');
inflows_jl = inflows_jl(1,:);

% define sampling rate, Nyquist frequency, number of samples, sampling period, and frequency series
fs_inflow = 1/(7*24*3600);  
N = length(inflows_jl);
delta_t = 1/fs_inflow;
freqseries_inflow = (1/(N*delta_t))*(0:(N-1));
fNy = fs_inflow/2;

%% Plot the inflow time series
figure(1)
plot(inflows_jl, 'Color','#104E8B');
grid on
xlabel('Time (weeks)');
ylabel('Inflow (MGD)');
xlim([0 5200]); ylim([0 15e4]);
title('Jordan Lake inflow timeseries');

The output figure should look similar to this:

Figure 2: Jordan Lake weekly inflow time series over 95 years.

Next, we Fourier Transform the inflow time series using the MATLAB fft() function, then convert it’s amplitude to decibels by first applying the abs() function, and then the mag2db() function to the output of the Fourier Transform.

%% Perform and plot the FFT of inflow
fft_jl = abs(fft(inflows_jl));

figure(2)
loglog(freqseries_inflow, mag2db(fft_jl), '-ko', 'Color','#104E8B', 'MarkerSize',2);
grid on
xlabel('Frequency (Hz)');
ylabel('Amplitude (dB)');
title("FFT of Jordan Lake");

This returns the following plot:

Figure 3: The inflow time series in the Fourier domain.

Note the large spikes in Figure 3. Those indicate frequencies that result in large inflow events. In this blog post, we will focus on the two leftmost spikes: the four-year cycle and the annual cycle, shown below. These cycles were identified by first converting their frequencies into seconds, and then into weeks and years.

Figure 4: The inflow time series in the Fourier domain with the 4-year and annual cycle indicated.

We also want to confirm that these frequencies characterize the time series. To verify this, we apply a low-pass Butterworth filter:

%% Number of weeks in a year
num_weeks = 52.1429;

% Record the low-pass cutoff frequencies 
fcut1_jl = 8.40951e-9;
years1_jl = conv2weeks(1/fcut1_jl)/num_weeks;  

fcut2_jl = 3.16974e-8;
years2_jl = conv2weeks(1/fcut2_jl)/num_weeks;

% Set up the filters and apply to the inflow time series
[b1_fl, a1_fl] = butter(6,fcut1_fl/(fs_inflow/2));
inflows_filtered_fl1 = filter(b1_fl, a1_fl, inflows_jl);

[b2_fl, a2_fl] = butter(6,fcut2_fl/(fs_inflow/2));
inflows_filtered_fl2 = filter(b2_fl, a2_fl, inflows_jl);

%% Plot the filtered inflow time series 
figure(3)
plot(inflows_jl, 'Color','#104E8B',  'DisplayName', 'Original', 'LineWidth',2);
xlim([0 5200]);
ylim([0 15e4]);
xlabel('Time (Years)');
ylabel('Inflows (MGD)');
title('Jordan Lake 4-year cycle')
legend();
hold on
plot(inflows_filtered_jl1,'Color','#D88037', 'MarkerSize',2, 'DisplayName', '4-year cycle', 'LineWidth',2);
hold off
ticklocations = 0:520:5112;
xticks(ticklocations);
ticklabels = string(ticklocations);
ticklabels(mod(ticklocations, 520) ~= 0) = "";
xticklabels({'0', '10', '20', '30', '40', '50', '60', '70', '80', '90'});

% Plot their Fourier counterparts
fft_win1_jl = abs(fft(inflows_filtered_jl1));

figure(4)
loglog(freq_series(1:length(fft_jl)),fft_jl, 'Color','#104E8B', 'DisplayName','Original');
legend({'Original','4-year cycle'},'Location','southeast');
title("Low-pass filter (f_{cut}=", string(fcut1_jl));
xlabel("Frequency (Hz)");
ylabel("Amplitude");
hold on
loglog(freq_series(1:length(fft_win1_jl)),fft_win1_jl, 'Color','#D88037','DisplayName','4-year cycle');
hold off

Replacing inflows_filtered_jl1 and fft_win1_jl (the 4-year cycle) with that of the annual cycle will result in Figure 5 below.

Figure 5: The filtered inflow time series in both the time and Fourier domains.

As previously mentioned, the low-pass filter effectively attenuated all frequencies higher than their respective $f_{cut}$ values. These frequencies identified in Figures 3 and 4 can also be said to characterize this inflow time series as the peaks of their filtered counterparts align with that of the original time series.

Next, we construct a spectrogram for this inflow time series. Here, we will use the Blackman-Harris window, which is a window that has a relatively wide main lobe and small side lobes. This enables it to amplify frequencies that more significantly characterize inflow, while suppressing less-significant ones. This window can be generated using the MATLAB blackmanharris() function, as shown below:

%% Define window size
Noverlap1_jl = N1_jl/2;   % 4-year cycle
Noverlap2_jl = N2_jl/2;   % Annual cycle

% Generate window
win1_jl = blackmanharris(N1_jl, 'periodic');  % 4-year cycle
win2_jl = blackmanharris(N2_jl, 'periodic');  % Annual cycle

Finally, we are ready to apply the spectrogram() function to generate the spectrogram for this inflow time series! In the code below, we set up the necessary values and generate the figure:

%% Define overlap size
Noverlap1_jl = N1_jl/2;  % 4-year cycle
Noverlap2_jl = N2_jl/2;  % Annual cycle

%% Plot inflow spectrogram for annual sampling (Jordan Lake)
spectrogram(inflows_jl, win1_jl, Noverlap1_jl, N1_jl, fs_inflow, 'yaxis');
colormap(flipud(parula));  
clim([100,150]);
title('Jordan Lake 4-year cycle (Blackman-Harris)');

By replacing the values associated with the 4-year cycle with that of the annual cycle in the above code, the following figures can be generated:

Figure 6: Spectrograms generated using the 4-year and annual window respectively.

In Figure 6, the darker blue indicates more powerful fluctuations (high-inflow events). However, their low frequency implies that, in this time series, such event are fairly rare. On the other hand, a lighter blue to yellow indicates inflow events of higher frequencies, but are less powerful. While they occur more frequently, they do not influence the signals in the time series significantly. We can also see from these two figures, that, in the long- and mid-term, both low and high frequency events become less powerful, implying a decrease in inflow, even more rare rainfall events that historically result in large inflows.

Conclusion

In this post, I first introduced a few basic concepts and methods related to time series analysis and the Fourier Domain. Next, we applied these concepts to the simple analysis of an inflow time series using MATLAB. Nevertheless, this post and its contents barely scratch the surface of Fourier Transforms and time series analysis as a whole. I hope this gave you a flavor of what these tools enable you to do, and hope that it piqued your interest into exploring this topic further!

June 6, 2022 by Rohini Gupta

Introduction to Drought Metrics

In this blog post, we’re going to discuss meteorological and hydrological drought metrics. Droughts are an important but difficult hazard to characterize. Whereas hurricanes or tornadoes occur suddenly and persist for a short and clearly defined interval, a drought is usually thought of as a “creeping disaster”, in which it is often difficult to pinpoint exactly when a drought begins and when it ends. Droughts can last for decades (termed a megadrought) where dryness is chronic but still can be interspersed with brief wet periods.

Droughts are expected to become more prominent globally in the upcoming decades due to the warming of the climate and even now, many locations in the Western US are experiencing droughts that exceed or rival some of the driest periods in the last millennium (California and the Southwest US). Thus, substantial literature is focused on investigating drought through observed gauge data, reconstructions using paleodata, and projections of precipitation and streamflow from climate models to better understand what future droughts may look like. However, many studies differ in their characterization of drought risk. Different GCMs can create contrasting representations of drought risk which often is caused by differences in the underlying representations of future warming and precipitation. Further, it is important to consider that a drought does not have a universal definition. It manifests due to the complex interactions across the atmosphere, land and hydrology, and the human system [1]. To account for this complexity, different definitions and metrics for droughts have been developed to account for single or combinations of these sectors (meteorological, hydrological, agricultural, etc.). Though a single definition has not been agreed upon, it has been widely suggested that using any one measure of drought can be limiting. For example, Swann (2018) suggests that ignoring plant response in any metric will overestimate drought. Furthermore, Hao et al. (2013) explains that a meteorological (precipitation) drought may not lead to an agricultural (soil moisture) drought in tropical areas where there is high soil moisture content. In this blog post, we’re going to define and discuss some of these metrics that are used to represent droughts to aid in deciding what metrics are appropriate for your case study.

SPI and SSI

The Standardized Precipitation Index (SPI) and Standardized Streamflow Index (SSI) are the most common means of characterizing a meteorological or hydrologic drought respectively. These metrics are highly popular because they only require a monthly time series (ideally of length > 30 years) of either precipitation or streamflow. These metrics quantify how much a specific month’s precipitation or streamflow deviates from the long-term monthly average. The steps to calculate SPI are as follows (SSI is the same):

Determine a moving window or averaging period. This is usually a value from the following: 3, 6, 12, or 24 months. Any moving window from 1-6 months is better for capturing meteorological and soil moisture conditions whereas 6-24 month averaging windows will better represent droughts affecting streamflow, reservoir storage, and groundwater. If calculating a 6-month SPI, the way to factor in the windows is as follows: the aggregate precipitation attributed to month j is accumulated over month j-5 to j [4]. Repeat this procedure for the whole time series.
Next an appropriate probability density function is fit to the accumulated precipitation. Generally a gamma or Pearson Type III distribution works well for precipitation data, but be sure to check for goodness of fit.
The associated cumulative probability distribution from Step 2 is then estimated and subsequently transformed to a normal distribution. Now each data point, j, can be worked through this framework to ultimately calculate a precipitation deviation for a normally distributed probability density with a mean of zero and standard deviation of 1. Because the SPI is normalized, it means that you can use this metric to characterize both dry and wet periods.

The range that the SPI values can take are summarized below [5]:

SPI	Classification
> 2.00	Extremely Wet
1.50 to 1.99	Very Wet
1.00 to 1.49	Moderately Wet
0.00 to 0.99	Near Normal
0 to -0.99	Mild Drought
-1.00 to -1.49	Moderate Drought
-1.50 to -1.99	Severe Drought

SPI Values and Classifications

From the SPI, you can define various drought metrics of interest. We’ll use an SPI of <= -1 to define the consequential drought threshold.

Percentage of Time in Drought: How many months in the time series have an SPI less than -1 divided by the total months in the time period in question?

Drought Duration: What is the number of consecutive months with an SPI below -1?

Drought Intensity or Peak: What is the minimum value of a drought’s period’s SPI?

Drought Recovery: How many months does it take to get from the peak of the drought to an SPI above -1?

There are lots of packages that exist to calculate SPI and SSI and it’s not difficult to code up these metrics from scratch either. Today, I’m going to demonstrate one of the easiest toolboxes to use for MATLAB, the Standardized Drought Analysis Toolbox (SDAT). We’ll use the provided example time series, precip.txt, which is 500 months long. I specify that I want to utilize a window of 6 months (line 50). Running the script creates the SPI index and plots it for you.

Let’s investigate these results further. I’ll change the window to 12 months and plot the results side by side. In the plotting window, I isolate a piece of the time series that is of length 25 months. The instances of drought are denoted by the black squares and the peak of each drought is denoted by the circle. We can then calculate some of those metrics defined above.

SPI results across different averaging windows for a portion of the SPI index

Metric	6 Month SPI	12-month SPI
# of Drought Events	2	1
Duration	1,2 months	2 months
Recovery Time	1 month	1 month
Intensity	-1.78	-1.36

Metrics derived from SPI time series created from 2 different windows

Note how different the SPI metric and perception of drought is depending on your chosen window. The window will greatly influence your perception of the number of droughts experienced and their intensity which is worth addressing. As much as SPI and SSI are easy metrics to calculate, that simplicity comes with limitations. Because these metrics only use the precipitation or streamflow time series, there is no consideration of evapotranspiration which means that any information about how much warming plays a role in these drought conditions is neglected, which can be a large piece of the puzzle. These metrics tend to be very sensitive to the length of the record as well [6].

SPEI

Given the limitations of SPI, the Standardized Precipitation Evapotranspiration Index (SPEI) was proposed by Vicente-Serrano et al. (2010) that is based on both temperature and precipitation data. Mathematically, the SPEI is calculated similarly to the SPI, thought instead of using precipitation only, the aggregated value to generate the index on is now precipitation minus PET. PET can be calculated using a more physically-based equation such as the Penman-Montieth equation or the simpler Thornthwaite equation (which only uses air temperature to derive PET). After the aggregate data is created, one can simply outlined in the SPI section. Vicente-Serrano et al. (2010) compare SPI and SPEI in Indore, India and demonstrate that only SPEI can identify an increase in drought severity associated with higher water demand as a result of evapotranspiration. The key limitation of SPEI is mostly based on the fact that it requires a complete temperature and precipitation dataset which may limit its use due to insufficient data being available.

MSDI

Another metric to be aware of is the Multivariate Standardized Drought Index (MSDI) from Hao et al. (2013). This metric probabilistically combines SPI and SSI for a more comprehensive drought characterization that covers both meteorological and agricultural drought conditions. The SPI and SSI metrics are determined as usual and then a copula is fit to derive a joint probability distribution across the two indices. The MSDI index is ultimately found by pushing the joint probabilities into an inverse normal distribution, thus placing it in the same space as the original SPI and SSI. The authors demonstrate that (1) the difference between SPI and SSI decreases as drought duration increases and (2) the MDSI can capture the evolution of both meteorological and hydrological droughts and particularly show that the onset of droughts is dominated by the SPI index and drought persistence is dominated more by the SSI index.

Decadal Drought

If you are interested more in capturing drought at longer time scales, such as decadal to multi-decadal, a similar style of metric is proposed in Ault et al. (2014). The authors define a decadal drought, such as the 1930s dustbowl or 1950s drought in the Southwest as a precipitation value that is more than ½ sigma below the 11-year running mean. A multi-decadal drought is characterized as a precipitation value that is more than ½ sigma below a 35-year running mean. This definition is somewhat arbitrary and more of a worse-case planning scenario because it corresponds to drought events that are worse than what has been reconstructed over the past millennium.

I hope this post is useful to you! If you use other drought metrics that are not listed here or have certain preferences for metrics based on experiences, please feel free to comment below!

References

[1] Touma, D., Ashfaq, M., Nayak, M. A., Kao, S. C., & Diffenbaugh, N. S. (2015). A multi-model and multi-index evaluation of drought characteristics in the 21st century. Journal of Hydrology, 526, 196-207.

[2] Swann, A. L. (2018). Plants and drought in a changing climate. Current Climate Change Reports, 4(2), 192-201.

[3] Hao, Z., & AghaKouchak, A. (2013). Multivariate standardized drought index: a parametric multi-index model. Advances in Water Resources, 57, 12-18.

[4] Guenang, G. M., & Kamga, F. M. (2014). Computation of the standardized precipitation index (SPI) and its use to assess drought occurrences in Cameroon over recent decades. Journal of Applied Meteorology and Climatology, 53(10), 2310-2324.

[5] McKee, T. B., Doesken, N. J., & Kleist, J. (1993, January). The relationship of drought frequency and duration to time scales. In Proceedings of the 8th Conference on Applied Climatology (Vol. 17, No. 22, pp. 179-183).

[6] https://climatedataguide.ucar.edu/climate-data/standardized-precipitation-index-spi

[7] Vicente-Serrano S.M., Santiago Beguería, Juan I. López-Moreno, (2010) A Multi-scalar drought index sensitive to global warming: The Standardized Precipitation Evapotranspiration Index – SPEI. Journal of Climate 23: 1696-1718.

[8] Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate, 27(20), 7529-7549.

February 15, 2021 by Lillian Lau

MORDM Basics I: Synthetic Streamflow Generation

In this post, we will break down the key concepts underlying synthetic streamflow generation, and how it fits within the Many Objective Robust Decision Making (MORDM) framework (Kasprzyk, Nataraj et. al, 2012). This post is the first in a series on MORDM which will begin here: with generating and validating the data used in the framework. To provide some context as to what we are about to attempt, please refer to this post by Jon Herman.

What is synthetic streamflow generation?

Synthetic streamflow generation is a non-parametric, direct statistical approach used to generate synthetic streamflow timeseries from a reasonably long historical record. It is used when there is a need to diversify extreme event scenarios, such as flood and drought, or when we want to generate flows to reflect a shift in the hydrologic regime due to climate change. It is favored as it relies on a re-sampling of the historical record, preserving temporal correlation up to a certain degree, and results in a more realistic synthetic dataset. However, its dependence on a historical record also implies that this approach requires a relatively long historical inflow data. Jon Lamontagne’s post goes into further detail regarding this approach.

Why synthetic streamflow generation?

An important step in the MORDM framework is scenario discovery, which requires multiple realistic scenarios to predict future states of the world (Kasprzyk et. al., 2012). Depending solely on the historical dataset is insufficient; we need to generate multiple realizations of realistic synthetic scenarios to facilitate a comprehensive scenario discovery process. As an approach that uses a long historical record to generate synthetic data that has been found to preserve seasonal and annual correlation (Kirsch et. al., 2013; Herman et. al., 2016), this method provides us with a way to:

Fully utilize a large historical dataset
Stochastically generate multiple synthetic datasets while preserving temporal correlation
Explore many alternative climate scenarios by changing the mean and the spread of the synthetic datasets

The basics of synthetic streamflow generation in action

To better illustrate the inner workings of synthetic streamflow generation, it is helpful to use a test case. In this post, the historical dataset is obtained from the Research Triangle Region in North Carolina. The Research Triangle region consists of four main utilities: Raleigh, Durham, Cary and the Orange County Water and Sewer Authority (OWASA). These utilities are receive their water supplies from four water sources: the Little River Reservoir, Lake Wheeler, Lake Benson, and the Jordan Lake (Figure 1), and historical streamflow data is obtained from ten different stream gauges located at each of these water sources. For the purpose of this example, we will be using 81 years’ worth of weekly streamflow data available here.

Figure 1: The Research Triangle region (Trindade et. al., 2019).

The statistical approach that drive synthetic streamflow generation is called the Kirsch Method (Kirsch et. al., 2013). In plain language, this method does the following:

Converts the historical streamflows from real space to log space, and then standardize the log-space data.
Bootstrap the log-space historical matrix to obtain an uncorrelated matrix of historical data.
Obtain the correlation matrix of the historical dataset by performing Cholesky decomposition.
Impose the historical correlation matrix upon the uncorrelated matrix obtained in (2) to generate a standardized synthetic dataset. This preserves seasonal correlation.
De-standardize the synthetic data, and transform it back into real space.
Repeat steps (1) to (5) with a historical dataset that is shifted forward by 6 months (26 weeks). This preserves year-to-year correlation.

This post by Julie Quinn delves deeper into the Kirsch Method’s theoretical steps. The function that executes these steps can be found in the stress_dynamic.m Matlab file, which in turn is executed by the wsc_main_rate.m file by setting the input variable p = 0 as shown on Line 27. Both these files are available on GitHub here.

However, this is simply where things get interesting. Prior to this, steps (1) to (6) would have simply generated a synthetic dataset based on only historical statistical characteristics as validated here in Julie’s second blog post on a similar topic. Out of the three motivations for using synthetic streamflow generation, the third one (exploration of multiple scenarios) has yet to be satisfied. This is a nice segue into out next topic:

Generating multiple scenarios using synthetic streamflow generation

The true power of synthetic streamflow generation lies in its ability to generate multiple climate (or in this case, streamflow) scenarios. This is done in stress_dynamic.m using three variables:

Input variable	Data type
p	The lowest x% of streamflows
n	A vector where each element n_i is the number of copies of the p-lowest streamflow years to be added to the bootstrapped historical dataset.
m	A vector where each element m_i is the number of copies of the (1-p)-highest streamflow years to be added to the bootstrapped historical dataset.

Table 1: The input variables to the stress_dynamic function.

These three variables bootstrap (increase the length of) the historical record while allow us to perturb the historical streamflow record streamflows to reflect an increase in frequency or severity of extreme events such as floods and droughts using the following equation:

new_hist_years = old_historical_years + [(p*old_historical_years)*n_i ] + (old_hist_years – [(p*old_historical_years)m_i])

The stress_dynamic.m file contains more explanation regarding this step.

This begs the question: how do we choose the value of p? This brings us to the topic of the standardized streamflow indicator (SSI₆).

The SSI₆ is the 6-month moving average of the standardized streamflows to determine the occurrence and severity of drought on the basis of duration and frequency (Herman et. al., 2016). Put simply, this method determines the occurrence of drought if the the value of the SSI₆ < 0 continuously for at least 3 months, and SSI₆ < -1 at least once during the 6-month interval. The periods and severity (or lack thereof) of drought can then be observed, enabling the decision on the length of both the n and m vectors (which correspond to the number of perturbation periods, or climate event periods). We will not go into further detail regarding this method, but there are two important points to be made:

The SSI₆ enables the determination of the frequency (likelihood) and severity of drought events in synthetic streamflow generation through the values contained in p, n and m.
This approach can be used to generate flood events by exchanging the values between the n and m vectors.

A good example of point (2) is done in this test case, in which more-frequent and more-severe floods was simulated by ensuring that most of the values in m where larger than those of n. Please refer to Jon Herman’s 2016 paper titled ‘Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments’ for further detail.

A brief conceptual letup

Now we have shown how synthetic streamflow generation satisfies all three factors motivating its use. We should have three output folders:

synthetic-data-stat: contains the synthetic streamflows based on the unperturbed historical dataset
synthetic-data-dyn: contains the synthetic streamflows based on the perturbed historical dataset

Comparing these two datasets, we can compare how increasing the likelihood and severity of floods has affected the resulting synthetic data.

Validation

To exhaustively compare the statistical characteristics of the synthetic streamflow data, we will perform two forms of validation: visual and statistical. This method of validation is based on Julie’s post here.

Visual validation

Done by generating flow duration curves (FDCs) . Figure 2 below compares the unperturbed (left) and perturbed (right) synthetic datasets.

Figure 2: (Above) The FDC of the unperturbed historical dataset (pink) and its resulting synthetic dataset (blue). (Below) The corresponsing perturbed historical and synthetic dataset.

The bottom plots in Figure 2 shows an increase in the volume of weekly flows, as well as an smaller return period, when the the historical streamflows were perturbed to reflect an increasing frequency and magnitude of flood events. Together with the upper plots in Figure 2, this visually demonstrates that the synthetic streamflow generation approach (1) faithfully reconstructs historical streamflow patterns, (2) increases the range of possible streamflow scenarios and (3) can model multiple extreme climate event scenarios by perturbing the historical dataset. The file to generate this Figure can be found in the plotFDCrange.py file.

Statistical validation

The mean and standard deviation of the perturbed and unperturbed historical datasets are compared to show if the perturbation resulted in significant changes in the synthetic datasets. Ideally, the perturbed synthetic data would have higher means and similar standard deviations compared to the unperturbed synthetic data.

Figure 3: (Above) The unperturbed synthetic (blue) and historical (pink) streamflow datasets for each of the 10 gauges. (Below) The perturbed counterpart.

The mean and tails of the synthetic streamflow values of the bottom plots in Figure 3 show that the mean and maximum values of the synthetic flows are significantly higher than the unperturbed values. In addition, the spread of the standard deviations of the perturbed synthetic streamflows are similar to that of its unperturbed counterpart. This proves that synthetic streamflow generation can be used to synthetically change the occurrence and magnitude of extreme events while maintaining the periodicity and spread of the data. The file to generate Figure 3 can be found in weekly-moments.py.

Synthetic streamflow generation and internal variability

The generation of multiple unperturbed realizations of synthetic streamflow is vital for characterizing the internal variability of a system., otherwise known as variability that arises from natural variations in the system (Lehner et. al., 2020). As internal variability is intrinsic to the system, its effects cannot be eliminated – but it can be moderated. By evaluating multiple realizations, we can determine the number of realizations at which the internal variability (quantified here by standard deviation as a function of the number of realizations) stabilizes. Using the synthetic streamflow data for the Jordan Lake, it is shown that more than 100 realizations are required for the standard deviation of the 25% highest streamflows across all years to stabilize (Figure 4). Knowing this, we can generate sufficient synthetic realizations to render the effects of internal variability insignificant.

Figure 4: The highest 25% of synthetic streamflows for the Jordan Lake gauge.

The file internal-variability.py contains the code to generate the above figure.

How does this all fit within the context of MORDM?

So far, we have generated synthetic streamflow datasets and validated them. But how are these datasets used in the context of MORDM?

Synthetic streamflow generation lies within the domain of the second part of the MORDM framework as shown in Figure 5 above. Specifically, synthetic streamflow generation plays an important role in the design of experiments by preserving the effects of deeply uncertain factors that cause natural events. As MORDM requires multiple scenarios to reliably evaluate all possible futures, this approach enables the simulation of multiple scenarios, while concurrently increasing the severity or frequency of extreme events in increments set by the user. This will allow us to evaluate how coupled human-natural systems change over time given different scenarios, and their consequences towards the robustness of the system being evaluated (in this case, the Research Triangle).

Figure 4: The taxonomy of robustness frameworks. The bold-outlined segments highlight where MORDM fits within this taxonomy (Herman et. al., 2015).

Typically, this evaluation is performed in two main steps:

Generation and evaluation of multiple realizations of unperturbed annual synthetic streamflow. The resulting synthetic data is used to generate the Pareto optimal set of policies. This step can help us understand how the system’s internal variability affects future decision-making by comparing it with the results in step (2).
Generation and evaluation of multiple realizations of perturbed annual synthetic streamflow. These are the more extreme scenarios in which the previously-found Pareto-optimal policies will be evaluated against. This step assesses the robustness of the base state under deeply uncertain deviations caused by the perturbations in the synthetic data and other deeply uncertain factors.

Conclusion

Overall, synthetic streamflow generation is an approach that is highly applicable in the bottom-up analysis of a system. It preserves historical characteristics of a streamflow timeseries while providing the flexibility to modify the severity and frequency of extreme events in the face of climate change. It also allows the generation of multiple realizations, aiding in the characterization and understanding of a system’s internal variability, and a more exhaustive scenario discovery process.

This summarizes the basics of data generation for MORDM. In my next blog post, I will introduce risk-of-failure (ROF) triggers, their background, key concepts, and how they are applied within the MORDM framework.

References

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 142(11), 04016050. doi:10.1061/(asce)wr.1943-5452.0000701

Kasprzyk, J. R., Nataraj, S., Reed, P. M., & Lempert, R. J. (2013). Many objective robust decision making for complex environmental systems undergoing change. Environmental Modelling & Software, 42, 55-71. doi:10.1016/j.envsoft.2012.12.007

Kirsch, B. R., Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406. doi:10.1061/(asce)wr.1943-5452.0000287

Mankin, J. S., Lehner, F., Coats, S., & McKinnon, K. A. (2020). The value of initial condition large ensembles to Robust Adaptation Decision‐Making. Earth’s Future, 8(10). doi:10.1029/2020ef001610

Trindade, B., Reed, P., Herman, J., Zeff, H., & Characklis, G. (2017). Reducing regional drought vulnerabilities and multi-city robustness conflicts using many-objective optimization under deep uncertainty. Advances in Water Resources, 104, 195-209. doi:10.1016/j.advwatres.2017.03.023

April 17, 2019 by David Gold

Performing random seed analysis and runtime diagnostics with the serial Borg Matlab wrapper

Search with Multiobjective Evolutionary Algorithms (MOEAs) is inherently stochastic. MOEAs are initialized with a random population of solutions that serve as the starting point for the multiobjective search, if the algorithm gets “lucky”, the initial population may contain points in an advantageous region of the decision space that give the algorithm a head start on the search. On the other hand, the initial population may only contain solutions in difficult regions of the decision space, which may slow the discovery of quality solutions. To overcome the effects of initial parameterization, we perform a random seed analysis which involves running an ensemble of searches, each starting with a randomly sampled set of initial conditions which we’ll here on refer to as a “random seed”. We combine search results across all random seeds to generate a “reference set” which contains only the best (Pareto non-dominated) solutions across the ensemble.

Tracking the algorithm’s performance during search is an important part of a random seed analysis. When we use MOEAs to solve real world problems (ie. problems that don’t have analytical solutions), we don’t know the true Pareto set a priori. To determine if an algorithm has discovered an acceptable approximation of the true Pareto set, we must measure it’s performance across the search, and only end our analysis if we can demonstrate the search has ceased improving (of course this is not criteria for true convergence as it is possible the algorithm has simply failed to find better solutions to the problem, this is why performing rigorous diagnostic studies such as Zatarain et al., 2016 is important for understanding how various MOEAs perform in real world problems). To measure MOEA search performance, we’ll use hypervolume , a metric that captures both convergence and diversity of a given approximation set (Knowles and Corne, 2002; Zitzler et al., 2003). Hypervolume represents the fraction of the objective space that is dominated by an approximation set, as shown in Figure 1 (from Zatarain et al., 2017). For more information on MOEA performance metrics, see Joe’s post from 2013.

Figure 1: A 2 objective example of hypervolume from Zatarain et al,. 2017. To calculate hypervolume, an offset, delta, is taken from the bounds of the approximation set to construct a “reference point”. The hypervolume is a measure of the volume of the objective space between the approximation set and the reference point. A larger hypervolume indicates a better approximation set.

This post will demonstrate how to perform a random seed analysis and runtime diagnostics using the Matlab wrapper for the serial Borg MOEA (for background on the Borg MOEA, see Hadka and Reed, 2013). I’ll use the DTLZ2 3 objective test problem as an example, which tasks the algorithm with approximating a spherical Pareto-optimal front (Deb et al,. 2002). I’ve created a Github repository with relevant code, you can find it here.

In this demonstration, I’ll use the Matlab IDE and Bash shell scripts called from a Linux terminal (Window’s machines can use Cygwin, a free Linux emulator). If you are unfamiliar with using a Linux terminal, you can find a tutorial here. To perform runtime diagnostics, I’ll use the MOEAFramework, a Java library that you can download here (the demo version will work just fine for our purposes).

A modified Matlab wrapper that produces runtime files

In order to track search performance across time, we need snapshots of Borg’s archive during the search. In the parallel “master-worker” and “multi-master” versions of Borg, these snapshots are generated by the Borg C library in the form of “runtime” files. The snapshots provided by the runtime files contain information on the number of function evaluations completed (NFE), elapsed time, operator probabilities, number of improvements, number of restarts, population size, archive size and the decision variables and objectives within the archive itself.

To my knowledge, the current release of the serial Borg Matlab wrapper does not print runtime files. To perform runtime diagnostics, I had to modify the wrapper file, nativeborg.cpp. I’ve posted my edited version to the aforementioned Github repository.

Performing random seed analysis and runtime diagnostics

To perform a random seed analysis and runtime diagnostics with the Matlab wrapper, follow these steps:

1) Download the Borg MOEA and compile the Matlab wrapper

To request access to the Borg MOEA, complete step 2 of Jazmin’s introduction to Borg, found here . To run Borg with Matlab you must compile a MEX file, instructions for compiling for Windows can be found here, and here for Linux/Mac.

Once you’ve downloaded and compiled Borg for Matlab, clone the Github repository I’ve created and replace the nativeborg.cpp file from the Borg download with the edited version from the repository. Next, create three new folders in your working directory, one called “Runtime” and another called “Objectives” and the third called “metrics”. Make sure your working directory contains the following files:

borg.c
borg.h
mt19937ar.c
mt19937ar.h
nativeborg.cpp (version from the Git repository)
borg.m
DTLZ2.m (test problem code, supplied from Github repository)
calc_runtime_metrics.sh
append_hash.sh
MOEAFramework-2.12-Demo.jar

2) Use Matlab to run the Borg MOEA across an ensemble of random seeds

For this example we’ll use 10 seeds with 30,000 NFE each. We’ll print runtime snapshots every 500 NFE.

To run DTLZ2 across 10 seeds, run the following script in Matlab:

for i = [1:10]
    [vars, objs, runtime] = borg(12,3,0, @DTLZ2, 30000, zeros(1,12),ones(1,12), [0.01, 0.01, 0.01], {'frequency',500, 'seed', i});
    objFile = sprintf('Objectives/DTLZ2_3_S%i.obj',i);
    dlmwrite(objFile, objs, 'Delimiter', ' ');
end

The for loop above iterates across 10 random initialization of the algorithm. The first line within the for loop calls the Borg MOEA and returns decision variables (vars), objective values (objs) and a struct with basic runtime information. This function will also produce a runtime file, which will be printed in the Runtime folder created earlier (more on this later).

The second line within the for loop creates a string containing the name of a file to store the seed’s objectives and the third line prints the final objectives to this file.

3) Calculate the reference set across random seeds using the MOEAFramework

The 10 .obj files created in step two containing the final archives from each random seed. For our analysis, we want to generate a “reference set” of the best solutions across all seeds. To generate this set, we’ll use built in tools from the MOEAFramework. The MOEAFramework requires that all .obj files have “#” at the end of the file, which is annoying to add in Matlab. To get around this, I’ve written a simple Bash script called “append_hash.sh”.

In your Linux terminal navigate to the working directory with your files (the folder just above Objectives) and run the Bash script like this:

 ./append_hash.sh

Now that the hash tags have been appended to each .obj files, create an overall reference set by running the following command in your Linux Terminal.

java -cp MOEAFramework-2.12-Demo.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger -d 3 -e 0.01,0.01,0.01 -o Borg_DTLZ2_3.reference Objectives/*.obj

This command calls the MOEAFramework’s ResultFileMerger tool, which will merge results across random seeds. The -d flag specifies the number of objectives in our problem (3), the -e flag specifies the epsilons for each objective (.01 for all 3 objectives), the -o flag specifies the name of our newly created reference set file and the Objectives/*.obj tells the MOEAFramework to merge all files in the Objectives folder that have the extension “.obj”. This command will generate a new file named “Borg_DTLZ2_3.reference”, which will contain 3 columns, each corresponding to one objective. If we load this file into matlab and plot, we get the following plot of our Pareto approximate set.

Figure 2: The reference set generated by the Borg Matlab wrapper using 30,000 NFE.

4) Calculate and visualize runtime hypervolumes

We now have a reference set representing the best solutions across our random seeds. A final step in our analysis is to examine runtime data to understand how the search progressed across function evaluations. We’ll again use the MOEAFramework to examine each seed’s hypervolume at the distinct runtime snapshots provided in the .runtime files. I’ve written a Bash script to call the MOEAFramework, which is provided in the Git repository as “calc_runtime_metrics.sh” and reproduced below:

#/bin/bash

NSEEDS=10
SEEDS=$(seq 1 ${NSEEDS})
JAVA_ARGS="-cp MOEAFramework-2.12-Demo.jar"

for SEED in ${SEEDS}
do
	java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.ResultFileEvaluator -d 3 -i ./Runtime/runtime_S${SEED}.runtime -r Borg_DTLZ2_3.reference -o ./metrics/Borg_DTLZ2_3_S${SEED}.metrics
done

To execute the script in your terminal enter:

./calc_runtime_metrics.sh

The above command will generate 10 .metrics files inside the metrics folder, each .metric file contains MOEA performance metrics for one randome seed, hypervolume is in the first column, each row represents a different runtime snapshot. It’s important to note that the hypervolume calculated by the MOEAFramework here is the absolute hypervolume, but what we really want in this situation is the relative hypervolume to the reference set (ie the hypervolume achieved at each runtime snapshot divided by the hypervolume of the reference set). To calculate the hypervolume of the reference set, follow the steps presented in step 2 of Jazmin’s blog post (linked here), and divide all runtime hypervolumes by this value.

To examine runtime peformance across random seeds, we can load each .metric file into Matlab and plot hypervolume against NFE. The runtime hypervolume for the DTLZ2 3 objective test case I ran is shown in Figure 3 below.

Figure 3: Runtime results for the DTLZ2 3 objective test case

Figure 3 shows that while there is some variance across the seeds, they all approach the hypervolume of the reference set after about 10,000 NFE. This leveling off of our search across many initial parameterizations indicates that our algorithm has likely converged to a final approximation of our Pareto set. If this plot had yielded hypervolumes that were still increasing after the 30,000 NFE, it would indicate that we need to extend our search to a higher number of NFE.

References

Deb, K., Thiele, L., Laumanns, M. Zitzler, E., 2002. Scalable multi-objective optimization test problems, Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02, (1), 825-830

Hadka, D., Reed, P., 2013. Borg: an auto-adaptive many-objective evolutionary computing
framework. Evol. Comput. 21 (2), 231–259.

Knowles, J., Corne, D., 2002. On metrics for comparing nondominated sets. Evolutionary
Computation, 2002. CEC’02. Proceedings of the 2002 Congress on. 1. IEEE, pp. 711–716.

Zatarain Salazar, J., Reed, P.M., Herman, J.D., Giuliani, M., Castelletti, A., 2016. A diagnostic assessment of evolutionary algorithms for multi-objective surface water
reservoir control. Adv. Water Resour. 92, 172–185.

Zatarain Salazar, J. J., Reed, P.M., Quinn, J.D., Giuliani, M., Castelletti, A., 2017. Balancing exploration, uncertainty and computational demands in many objective reservoir optimization. Adv. Water Resour. 109, 196-210

Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., Da Fonseca, V.G., 2003. Performance
assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol.
Comput. 7 (2), 117–132.

March 27, 2019 by Peter Storm

From MATLAB to Julia: Insights from Translating an Opensource Kirsch-Nowak Streamflow Generator to Julia

A quick look into translating code: speed comparisons, practicality, and comments

As I am becoming more and more familiar with Julia—an open-source programming language—I’ve been attracted to translate code to not only run it on an opensource and free language but also to test its performance. Since Julia was made to be an open source language made to handle matrix operations efficiently (when compared to other high-level opensource languages), finding a problem to utilize these performance advantages only makes sense.

As with any new language, understanding how well it performs relative to the other potential tools in your toolbox is vital. As such, I decided to use a problem that is easily scalable and can be directly compare the performances of MATLAB and Julia—the Kirsch-Nowak synthetic stationary streamflow generator.

So, in an effort to sharpen my understanding of the Kirsch-Nowak synthetic stationary streamflow generator created by Matteo Giuliani, Jon Herman and Julianne Quinn, I decided to take on this project of converting from this generator from MATLAB. This specific generator takes in historical streamflow data from multiple sites (while assuming stationarity) and returns a synthetically generated daily timeseries of streamflow. For a great background on synthetic streamflow generation, please refer to this post by Jon Lamontagne.

Model Description

The example is borrowed from Julie’s code utilizes data from the Susquehanna River flows (cfs) at both Marietta (USGS station 01576000) and Muddy Run along with lateral inflows (cfs) between Marietta and Conowingo Damn (1932-2001). Additionally, evaporation rates (in/day) over the Conowingo and Muddy Run Dams (from an OASIS model simulation) utilized. The generator developed by Kirsch et al. (2013) utilizes a Cholesky decomposition to create a monthly synthetic record which preserves the autocorrelation structure of the historical data. The method proposed by Nowak et al. (2010) is then used to disaggregate to daily flows (using a historical month +/- 7 days). A full description of the methods can be found at this link.

Comparing Julia and MATLAB

Comparison of Performance between Julia and MATLAB

To compare the speeds of each language, I adapted the MATLAB code into Julia (shown here) on as nearly of equal basis as possible. I attempted to keep the loops, data structures, and function formulation as similar as possible, even calling similar libraries for any given function. julia_matlab_streamflow

When examining the performance between Julia (solid lines) and MATLAB (dashed lines), there is only one instance where MATLAB(x) outperformed Julia(+)—in the 10-realization, 1000-year simulation shown in the yellow dots in the upper left. Needless to say, Julia easily outperformed MATLAB in all other situations and required only 53% of the time on average (all simulations considered equal). However, Julia was much proportionally faster at lower dimensions of years (17-35% of the time required) than MATLAB. This is likely because I did not handle arrays optimally—the code could likely be sped up even more.

Considerations for Speeding Up Code

Row- Versus Column-Major Array Architecture

It is worth knowing how a specific language processes its arrays/matrices. MATLAB and Julia are both column-major languages, meaning the sequential indexes and memory paths are grouped by descending down row by row through a column then going through the next column. On the other hand, Numpy in Python specifically uses row-major architecture. The Wikipedia article on this is brief but well worthwhile for understanding these quirks.

This is especially notable because ensuring that proper indexing and looping methods are followed can substantially speed up code. In fact, it is likely that the reason Julia slowed down significantly on a 10-realization 1000-year simulation when compared to both its previous performances and MATLAB because of how the arrays were looped through. As a direct example shown below, when exponentiating through a [20000, 20000] array row-by-row took approximately 47.7 seconds while doing the same operation column-by-column only took 12.7 seconds.

5555

Dealing with Arrays

Simply put, arrays and matrices in Julia are a pain compared to MATLAB. As an example of the bad and the ugly, unlike in MATLAB where you can directly declare any size array you wish to work with, you must first create an array and then fill the array with individual array in Julia. This is shown below where an array of arrays is initialized below. However, once an array is established, Julia is extremely fast in loops, so dealing with filling a previously established array makes for a much faster experience.

# initialize output
qq = Array{Array}(undef, num_sites) #(4, 100, 1200)

for i = 1:num_sites
     qq[i] = Array{Float64}(undef, nR, nY * 12)
end

Once the plus side when creating arrays, Julia is extremely powerful in its ability to assign variable types to the components of a given array. This can drastically speed up your code during the day. Shown below, it is easy to the range of declarations and assignments being made to populate the array. There’s an easy example of declaring an array with zeros, and another where we’re populating an array using slices of another. Note the indexing structure for Qd_cg in the second loop–it is not technically a 3-D array but rather a 2-D array nested within a 1-D array–showing the issues mentioned prior.

delta = zeros(n_totals)
for i = 1:n_totals
     for j = 1:n_sites
          delta[i] += (Qtotals[month][j][i] - Z[j]) ^ 2
     end
end

q_ = Array{Float64, 2}(undef, num_realizations[k], 365 * num_years[k])
for i = 1: Nsites
     # put into array of [realizations, 365*num_yrs]
     for j = 1: num_realizations[k]
          q_[j, :] = Qd_cg[j][:, i]'
     end
end

Code Profiling: Order of Experiments

An interesting observation I’ve noticed is that Julia’s first run on a given block of code is substantially slower than every other attempt. Thus, it is likely worthwhile to run a smaller-scale array through to initialize the code if there are plans to move on to substantially more expensive operations (i.e. scaling up).

In the example below, we can see that the second iteration of the same exact code was over 10% faster when calling it a second time. However, when running the code without the function wrapper (in the original timed runs), the code was 10% faster (177 seconds) than the second sequential run shown below. This points to the importance of profiling and experimenting with sections of your code.

4444

Basic profiling tools are directly built into Julia, as shown in the Julia profiling documentation. This can be visualized easily using the ProfileView library. The Juno IDE (standard with Julia Pro) allegedly has a good built-in profile as well. However, it should be expected that most any IDE should do the trick (links to IDEs can be found here).

Syntax and Library Depreciation

While Julia is very similar in its structure and language to MATLAB, much of the similar language has depreciated as Julia has been rapidly upgraded. Notably, Julia released V1.0 in late 2018 and recently released V1.1, moving further away from similarities in function names. Thus, this stands as a lesson for individuals wishing to translate all of their code between these languages. I found a useful website that assists in translating general syntax, but many of the functions have depreciated. However, as someone who didn’t have any experience with MATLAB but was vaguely familiar with Julia, this was a godsend for learning differences in coding styles.

For example, creating an identity matrix in MATLAB utilizes the function eye(size(R)) to create an nxn matrix the size of R. While this was initially the language used in Julia, this specific language was depreciated in V0.7. To get around this, either ‘I’ can be used to create a scalable identity matrix or Matrix{Float64}(I, size(R), size(R)) declare an identity matrix of size(R) by size(R) for a more foolproof and faster operation.

When declaring functions, I have found Julia to be relatively straightforward and Pythonic in its declarations. While I still look to insert colons at the ends of declarations while forgetting to add ‘end’ at the end of functions, loops, and more, the ease of creating, calling, and interacting with functions makes Julia very accessible. Furthermore, its ability to interact with matrices in without special libraries (e.g. Numpy in Python) allows for more efficient coding without having to know specific library notation.

Debugging Drawbacks

One of the most significant drawbacks I run into when using Julia is the lack of clarity in generated error codes for common mistakes, such as adding extra brackets. For example, the following error code is generated in Python when adding an extra parenthesis at the end of an expression. 3333

However, Julia produces the follow error for an identical mistake:

One simple solution to this is to simply upgrade my development environment from Jupyter Notebooks to a general IDE to more easily root out issues by running code line-by-line. However, I see the lack of clarity in showing where specific errors arise a significant drawback to development within Julia. However, as shown in the example below where an array has gone awry, an IDE (such as Atom shown below) can make troubleshooting and debugging a relative breeze.

1111

Furthermore, when editing auxiliary functions in another file or module that was loaded as a library, Julia is not kind enough to simply reload and recompile the module; to get it to properly work in Atom, I had to shut down the Julia kernel then rerun the entirety of the code. Since Julia takes a bit to initially load and compile libraries and code, this slows down the debugging process substantially. There is a specific package (Revise) that exists to take care of this issue, but it is not standard and requires loading this specific library into your code.

GitHub Repositories: Streamflow Generators

PyMFGM: A parallelized Python version of the code, written by Bernardo Trindade

Kirsch-Nowak Stationary Generator in MATLAB: Developed by Matteo Giuliani, Jon Herman and Julianne Quinn

Kirsch-Nowak Stationary Generator in Julia: Please note that the results are not validated. However, you can easily access the Jupyter Notebook version to play around with the code in addition to running the code from your terminal using the main.jl script.

Full Kirsch-Nowak Streamflow Generator: Also developed by Matteo Giuliani, Jon Herman and Julianne Quinn and can handle rescaling flows for changes due to monsoons. I would highly suggest diving into this code alongside the relevant blog posts: Part 1 (explanation), Part 2 (validation).

July 19, 2018 by Felipe Cordera

Update on setting up the Borg Matlab Wrapper on Windows and tips for its use

It has been a long time since the last post about the use of Borg in Matlab, so this post pretends to be an update of Compiling the Borg Matlab Wrapper (Windows), and in an easier way.

The following steps were tested in Windows 10 and MATLAB R2017a, but should be valid for different versions too.

Step 0: Get Borg source code from Bitbucket repository

Check step 2 in the post Basic Borg MOEA use for the truly newbies (Part 1/2) (also check step 1 for a simplified explanation about Borg).

You will need to have the following files in the same directory:

borg.c
borg.h
mt19937ar.c
mt19937ar.h
nativeborg.cpp
borg.m
DTLZ2.m (test problem code)

You can find the first four files in the main directory you downloaded from Bitbucket, and to get the last three you have to navigate to the “plugins\Matlab” directory.

Step 1: Check compiler in Matlab

At Matlab command line run:

mex -setup

to check if there is a compiler already installed. If there is not, install the one that Matlab suggests (depends on the version).

In 2017a and previous versions there could be a bug when trying to download the suggested compiler (e.g. MIN GW in 2017a). Follow Matlab instructions. They will take you through some steps, which basically are: downloading a folder and pasting it in the installation directory of Matlab, replacing an existing one.

Alternatively (outside Matlab), you could download and install Visual Studio (Community version for free). Once installed, Matlab should recognize it immediately (check with mex -setup command anyway).

Step 2: Compile Borg in Matlab

Make sure you are in the directory with the required files from Step 0 and run (again, at Matlab command line):

mex nativeborg.cpp borg.c mt19937ar.c

The file nativeborg.mexw64 should be created (the extension should vary according to your system). Once you do that you are done, and you are ready to run the test problem.

Step 3: Run a test problem (DTLZ2)

Inside Matlab, make sure you are in the directory that contains:

borg.m
nativeborg.mexw64
DTLZ2.m

The file DTLZ2.m is an example of how an objective function should be formatted for use with the Borg Matlab Wrapper. It should take a vector of decision variables as input, and return a vector of objectives (and optionally, constraints) as output.

From the command line, you can optimize the DTLZ2 function as follows:

[vars, objs, runtime] = borg(11, 2, 0, @DTLZ2, 100000, 0.01*ones(1,2), zeros(1,11), ones(1,11));

which returns the decision variables and objectives of the Pareto optimal solutions (i.e. the Pareto-approximate set), and runtime statistics. The first three inputs to the borg function are: the number of decision variables, number of objectives, and number of constraints. After that comes the handle for the objective function, the number of function evaluations to run in the optimization, the epsilon precision values for the objectives, and finally the lower and upper bounds of the decision variables. For a more detailed description of inputs and outputs you can run:

help borg

Finally, to see the resulting Pareto frontier run:

scatter(objs(:,1), objs(:,2))

it should be a very good approximation of the first quadrant of the unit circle.

Now you just need to replace DTLZ2 with an objective function or model of your choice, and you’ll be ready to go!

Note: to successfully use the Borg Matlab Wrapper in the future make sure to have borg.m and nativeborg.mexw64 in the same directory you are working on, or in a directory whose path is saved in Matlab (you can manage your paths with addpath and savepath functions).

Some tips:

Tip 1:

When calling borg(…), one of the most important inputs is the objective function (the “@DTLZ2” function handle in the test problem) since it contains the structure of the problem you are solving (i.e. the simulation model of your problem).

As mentioned before, this function should take the decision variables as input and return the value of the objectives as output. However, in some cases your function could need to take as inputs also other information in addition to the decision variables, for example:

% objective function example 
obj_values = simModel(decisions, params)

where params could be some data read from a text file or you had calculated before (e.g. streamflow scenarios) and remain unchanged during the optimization procedure. If that is your case, you can first define params and then define a function handle where params is fixed:

params = whatever_your_function_needs;

% handle of simModel function that has the variable “params” fixed, so its input is just “decisions”
simModelFixed = @(decisions) simModel(decisions, params);

% call Borg
[vars, objs, runtime] = borg(..., simModelFixed, ...);

Maybe you could be thinking on reading or generating your additional parameters inside your simulation model function, but doing that, the process will be repeated in every function evaluation when running Borg, which could be highly inefficient.

Tip 2:

Be careful with NaN and Inf values in your objective function when using Borg as they will cause Matlab to crash and close with no error message. These values could arise accidentally if you forgot to condition some particular cases in your calculations (e.g. division by zero).

Tip 3:

When you run help borg you can observe that an optional input to the function is parameters. This input lets you modify internal parameters of the algorithm. For details on how to do that you can refer to the post Setting Borg parameters from the Matlab wrapper, but make sure to understand how Borg works before thinking of tuning it.

To learn about Borg features and full details please refer to “Hadka, D., and Reed, P.M. “Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework.” Evolutionary Computation, 21(2):231-259, 2013.” and references therein.

March 13, 2018 by Antonia Hadjimichael

A completely non-exhaustive list of tutorial resources for scientific computing

This is a short blog post to put together a list of resources with tutorials (similar to what’s usually found on this blog) for various programming languages. It is by no means exhaustive, so please comment if you feel there’s an important one I left out.

Matlab:

https://www.arnevogel.com/

Kinda new blog with Matlab tutorials on numerical methods

https://blogs.mathworks.com/loren/

One of the MathWorks blogs, great tutorials.

https://www.mathworks.com/support/learn-with-matlab-tutorials.html

Matlab tutorials from MathWorks

matlab

More Matlab tutorials, a lot of material on many topics

Python:

https://glowingpython.blogspot.com/

Not updated very frequently, but good data analysis and visualization tutorials are posted

https://pythonprogramming.net/

Updated regularly, some great data visualization and analysis tutorials. The tutorials come with videos. I really like this site.

http://treyhunner.com/

Updated regularly (about once a month) with python tutorials and general python coding insights. I really like the writing style of this author.

https://docs.scipy.org/doc/scipy/reference/tutorial/

Tutorials on using SciPy

C and C++:

https://www.cprogramming.com/

C and C++ programming tutorials, tips and tricks

http://www.cplusplus.com/articles/

Not really updated anymore but some good basic tutorials are listed

https://blog.knatten.org/

Hadn’t been updated in a while, but it looks like it’s been picked up again. Good for general C++ programming and good practice.

http://www.bfilipek.com/

C++ tutorials

General:

https://towardsdatascience.com/

This is a great general resource not devoted to a particular language. They cover topics in data science, machine learning and programming in general. Some great tutorials for Python and R.

https://projecteuler.net/

Mathematical programming problems to help with learning any language

https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md

Free programming books repository

Reddits (some of the bigger ones):

/r/matlab (General on matlab, community provides help with coding)

/r/programming (General on programming)

/r/learnprogramming (Community provides help with debugging questions)

/r/python (General on python, community provides help with coding)

/r/learnpython (Community provides help with python questions, smaller than /r/python)

/r/cpp (General on C++)

/r/cpp_questions (Community provides help with C++ questions)

I’ve also recently made /r/sci_comp which has very little activity for the moment, but the aim is to create a community with general resources on coding for scientific applications.

August 29, 2017 by Julie

Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plotFDCrange.py plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables.

plotFDCrange.py provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function monthly-moments.py. This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions autocorr.py and spatial-corr.py, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

Pair Number	Sites
1	Marietta and Muddy Run
2	Marietta and Lateral Inflows
3	Marietta and Evaporation
4	Muddy Run and Lateral Inflows
5	Muddy Run and Evaporation
6	Lateral Inflows and Evaporation