# Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plotFDCrange.py plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables.

plotFDCrange.py provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function monthly-moments.py. This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions autocorr.py and spatial-corr.py, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

 Pair Number Sites 1 Marietta and Muddy Run 2 Marietta and Lateral Inflows 3 Marietta and Evaporation 4 Muddy Run and Lateral Inflows 5 Muddy Run and Evaporation 6 Lateral Inflows and Evaporation

# Open Source Streamflow Generator Part I: Synthetic Generation

This post describes how to use the Kirsch-Nowak synthetic streamflow generator to generate synthetic streamflow ensembles for water systems analysis. As Jon Lamontagne discussed in his introduction to synthetic streamflow generation, generating synthetic hydrology for water systems models allows us to stress-test alternative management plans under stochastic realizations outside of those observed in the historical record. These realizations may be generated assuming stationary or non-stationary models. In a few recent papers from our group applied to the Red River and Lower Susquehanna River Basins (Giuliani et al., 2017; Quinn et al., 2017; Zatarain Salazar et al., In Revision), we’ve generated stationary streamflow ensembles by combining methods from Kirsch et al. (2013) and Nowak et al. (2010). We use the method of Kirsch et al. (2013) to generate flows on a monthly time step and the method of Nowak et al. (2010) to disaggregate these monthly flows to a daily time step. The code for this streamflow generator, written by Matteo Giuliani, Jon Herman and me, is now available on Github. Here I’ll walk through how to use the MATLAB code in the subdirectory /stationary_generator to generate correlated synthetic hydrology at multiple sites, and in Part II I’ll show how to use the Python code in the subdirectory /validation to statistically validate the synthetic hydrology. As an example, I’ll use the Lower Susquehanna River Basin (LSRB).

A schematic of the LSRB, reproduced from Giuliani et al. (2014) is provided below. The system consists of two reservoirs: Conowingo and Muddy Run. For the system model, we generate synthetic hydrology upstream of the Conowingo Dam at the Marietta gauge (USGS station 01576000), as well as lateral inflows between Marietta and Conowingo, inflows to Muddy Run and evaporation rates over Conowingo and Muddy Run dams. The historical hydrology on which the synthetic hydrologic model is based consists of the historical record at the Marietta gauge from 1932-2001 and simulated flows and evaporation rates at all other sites over the same time frame generated by an OASIS system model. The historical data for the system can be found here.

The first step to use the synthetic generator is to format the historical data into an nD × nS matrix, where nD is the number of days of historical data with leap days removed and nS is the number of sites, or hydrologic variables. An example of how to format the Susquehanna data is provided in clean_data.m. Once the data has been reformatted, the synthetic generation can be performed by running script_example.m (with modifications for your application). Note that in the LSRB, the evaporation rates over the two reservoirs are identical, so we remove one of those columns from the historical data (line 37) for the synthetic generation. We also transform the historical evaporation with an exponential transformation (line 42) since the code assumes log-normally distributed hydrologic data, while evaporation in this region is more normally distributed. After the synthetic hydrology is generated, the synthetic evaporation rates are back-transformed with a log-transformation on line 60. While such modifications allow for additional hydrologic data beyond streamflows to be generated, for simplicity I will refer to all synthetic variables as “streamflows” for the remainder of this post. In addition to these modifications, you should also specify the number of realizations, nR, you would like to generate (line 52), the number of years, nY, to simulate in each realization (line 53) and a string with the dimensions nR × nY for naming the output file.

The actual synthetic generation is performed on line 58 of script_example.m which calls combined_generator.m. This function first generates monthly streamflows at all sites on line 10 where it calls monthly_main.m, which in turn calls monthly_gen.m to perform the monthly generation for the user-specified number of realizations. To understand the monthly generation, we denote the set of historical streamflows as $\mathbf{Q_H}\in \mathbb{R}^{N_H\times T}$ and the set of synthetic streamflows as $\mathbf{Q_S}\in \mathbb{R}^{N_S\times T}$, where $N_H$ and $N_S$ are the number of years in the historical and synthetic records, respectively, and T is the number of time steps per year. Here T=12 for 12 months. For the synthetic generation, the streamflows in $\mathbf{Q_H}$ are log-transformed to yield the matrix $Y_{H_{i,j}}=\ln(Q_{H_{i,j}})$, where i and j are the year and month of the historical record, respectively. The streamflows in $\mathbf{Y_H}$ are then standardized to form the matrix $\mathbf{Z_H}\in \mathbb{R}^{N_H\times T}$ according to equation 1:

1) $Z_{H_{i,j}} = \frac{Y_{H_{i,j}}-\hat{\mu_j}}{\hat{\sigma_j}}$

where $\hat{\mu_j}$ and $\hat{\sigma_j}$ are the sample mean and sample standard deviation of the j-th month’s log-transformed streamflows, respectively. These variables follow a standard normal distribution: $Z_{H_{i,j}}\sim\mathcal{N}(0,1)$.

For each site, we generate standard normal synthetic streamflows that reproduce the statistics of $\mathbf{Z_H}$ by first creating a matrix $\mathbf{C}\in \mathbb{R}^{N_S\times T}$ of randomly sampled standard normal streamflows from $\mathbf{Z_H}$. This is done by formulating a random matrix $\mathbf{M}\in \mathbb{R}^{N_S\times T}$ whose elements are independently sampled integers from $(1,2,...,N_H)$. Each element of $\mathbf{C}$ is then assigned the value $C_{i,j}=Z_{H_{(M_{i,j}),j}}$, i.e. the elements in each column of $\mathbf{C}$ are randomly sampled standard normal streamflows from the same column (month) of $\mathbf{Z_H}$. In order to preserve the historical cross-site correlation, the same matrix $\mathbf{M}$ is used to generate $\mathbf{C}$ for each site.

Because of the random sampling used to populate $\mathbf{C}$, an additional step is needed to generate auto-correlated standard normal synthetic streamflows, $\mathbf{Z_S}$. Denoting the historical autocorrelation $\mathbf{P_H}$=corr($\mathbf{Z_H}$), where corr($\mathbf{Z_H}$) is the historical correlation between standardized streamflows in months i and j (columns of $\mathbf{Z_H}$), an upper right triangular matrix, $\mathbf{U}$, can be found using Cholesky decomposition (chol_corr.m) such that $\mathbf{P_H}=\mathbf{U^\intercal U}$. $\mathbf{Z_S}$ is then generated as $\mathbf{Z_S}=\mathbf{CU}$. Finally, for each site, the auto-correlated synthetic standard normal streamflows $\mathbf{Z_S}$ are converted back to log-space streamflows $\mathbf{Y_S}$ according to $Y_{S_{i,j}}=\hat{\mu_j}+Z_{S_{i,j}}\hat{\sigma_j}$. These are then transformed back to real-space streamflows $\mathbf{Q_S}$ according to $Q_{S_{i,j}}$=exp($Y_{S_{i,j}}$).

While this method reproduces the within-year log-space autocorrelation, it does not preserve year to-year correlation, i.e. concatenating rows of $\mathbf{Q_S}$ to yield a vector of length $N_S\times T$ will yield discontinuities in the autocorrelation from month 12 of one year to month 1 of the next. To resolve this issue, Kirsch et al. (2013) repeat the method described above with a historical matrix $\mathbf{Q_H'}\in \mathbb{R}^{N_{H-1}\times T}$, where each row i of $\mathbf{Q_H'}$ contains historical data from month 7 of year i to month 6 of year i+1, removing the first and last 6 months of streamflows from the historical record. $\mathbf{U'}$ is then generated from $\mathbf{Q_H'}$ in the same way as $\mathbf{U}$ is generated from $\mathbf{Q_H}$, while $\mathbf{C'}$ is generated from $\mathbf{C}$ in the same way as $\mathbf{Q_H'}$ is generated from $\mathbf{Q_H}$. As before, $\mathbf{Z_S'}$ is then calculated as $\mathbf{Z_S'}=\mathbf{C'U'}$. Concatenating the last 6 columns of $\mathbf{Z_S'}$ (months 1-6) beginning from row 1 and the last 6 columns of $\mathbf{Z_S}$ (months 7-12) beginning from row 2 yields a set of synthetic standard normal streamflows that preserve correlation between the last month of the year and the first month of the following year. As before, these are then de-standardized and back-transformed to real space.

Once synthetic monthly flows have been generated, combined_generator.m then finds all historical total monthly flows to be used for disaggregation. When calculating all historical total monthly flows a window of +/- 7 days of the month being disaggregated is considered. That is, for January, combined_generator.m finds the total flow volumes in all consecutive 31-day periods within the window from 7 days before Jan 1st to 7 days after Jan 31st. For each month, all of the corresponding historical monthly totals are then passed to KNN_identification.m (line 76) along with the synthetic monthly total generated by monthly_main.mKNN_identification.m identifies the k nearest historical monthly totals to the synthetic monthly total based on Euclidean distance (equation 2):

2) $d = \left[\sum^{M}_{m=1} \left({\left(q_{S}\right)}_{m} - {\left(q_{H}\right)}_{m}\right)^2\right]^{1/2}$

where ${(q_S)}_m$ is the real-space synthetic monthly flow generated at site m and ${(q_H)}_m$ is the real-space historical monthly flow at site m. The k-nearest neighbors are then sorted from i=1 for the closest to i=k for the furthest, and probabilistically selected for proportionally scaling streamflows in disaggregation. KNN_identification.m uses the Kernel estimator given by Lall and Sharma (1996) to assign the probability $p_n$ of selecting neighbor n (equation 3):

3) $p_{n} = \frac{\frac{1}{n}}{\sum^{k}_{i=1} \frac{1}{i}}$

Following Lall and Sharma (1996) and Nowak et al. (2010), we use $k=\Big \lfloor N_H^{1/2} \Big \rceil$. After a neighbor is selected, the final step in disaggregation is to proportionally scale all of the historical daily streamflows at site m from the selected neighbor so that they sum to the synthetically generated monthly total at site m. For example, if the first day of the month of the selected historical neighbor represented 5% of that month’s historical flow, the first day of the month of the synthetic series would represent 5% of that month’s synthetically-generated flow. The random neighbor selection is performed by KNN_sampling.m (called on line 80 of combined_generator.m), which also calculates the proportion matrix used to rescale the daily values at each site on line 83 of combined_generator.m. Finally, script_example.m writes the output of the synthetic streamflow generation to files in the subdirectory /validation. Part II shows how to use the Python code in this directory to statistically validate the synthetically generated hydrology, meaning ensure that it preserves the historical monthly and daily statistics, such as the mean, standard deviation, autocorrelation and spatial correlation.

# Algorithm Diagnostics Walkthrough using the Lake Problem as an example (Part 3 of 3: Metrics-based analysis of algorithm performance)

Now that you have your desired metrics based on part 2 of this series, it is possible to gain more insight into your algorithm performance. When I performed this analysis for the actual study, I used the AWRAnalysis.java, Analysis_Attainment_LakeProblem.sh and HypervolumeEval.java files found in the Github repository as explained in the README. I later discovered it was possible to do this within the framework, so that option will be discussed here.

It is possible to calculate the hypervolume of a Pareto Approximate Front within the framework using the SetHypervolume class. For more information on the MOEAFramework classes, please see the following link (http://moeaframework.org/javadoc/index.html).

I used the following command: (Note the change to version 2.3 because I reran this command today to check I remembered it correctly although it seems there is now a version 2.4. It is always best to use the newest version.)


java –cp MOEAFramework-2.3-Demo.jar org.moeaframework.analysis.sensitivity.SetHypervolume myLake4ObjStoch.reference –e 0.01,0.01,0.0001,0.0001 myLake4ObjStoch.reference



This returns a hypervolume value between 0 and 1 that is useful for threshold calculations as shown below.

To calculate threshold attainments, use the Analysis class. Below is an example of performing attainment analysis within the framework instead of using AWRAnalysis.java.  This approach generates a huge number of files, which are best understood when plotted, a subject for a future post.


#!/bin/bash
#source setup_LTM.sh

dim=4
problem=myLake4ObjStoch
epsilon="0.01,0.01,0.0001,0.0001"

algorithms="Borg eMOEA eNSGAII GDE3 MOEAD NSGAII"
seeds="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50"
percentiles="seq 1 1 100"
thresholds=(seq 0.01 0.01 1.0)

#compute averages across metrics
#echo "Computing averages across metrics..."
#for algorithm in ${algorithms} #do # echo "Working on: "${algorithm}
# java -classpath cygpath -wp $CLASSPATH org.moeaframework.analysis.sensitivity.MetricFileStatistics --mode average --output$WORK/metrics/${algorithm}_${problem}.average $WORK/metrics/${algorithm}_${problem}_*.metrics #done #echo "Done!" #compute search control metrics (for best and attainment) echo "Computing hypervolume search control metrics..." for algorithm in${algorithms}
do
echo "Working on: " ${algorithm} counter=$1
for percentile in ${percentiles} do java -classpath MOEAFramework-2.3-Demo.jar org.moeaframework.analysis.sensitivity.Analysis --parameterFile ./${algorithm}_params.txt --parameters ./${algorithm}_Latin --metric 0 --threshold${thresholds[$counter]} --hypervolume 0.7986 ./SOW6/metrics/average_replace_NaNs/${algorithm}_${problem}.average > ./test/Hypervolume_${percentile}_${algorithm}.txt counter=$((counter+1))
done
done
echo "Done!"

echo "Computing generational distance search control metrics..."
for algorithm in ${algorithms} do echo "Working on: "${algorithm}
counter=$1 for percentile in${percentiles}
do
java -classpath MOEAFramework-2.3-Demo.jar org.moeaframework.analysis.sensitivity.Analysis --parameterFile ./${algorithm}_params.txt --parameters ./${algorithm}_Latin --metric 1 --threshold ${thresholds[$counter]} ./SOW6/metrics/average_replace_NaNs/${algorithm}_${problem}.average > ./test/GenDist_${percentile}_${algorithm}.txt
counter=$((counter+1)) done done echo "Done!" echo "Computing additive epsilon indicator search control metrics..." for algorithm in${algorithms}
do
echo "Working on: " ${algorithm} counter=$1
for percentile in ${percentiles} do java -classpath MOEAFramework-2.3-Demo.jar org.moeaframework.analysis.sensitivity.Analysis --parameterFile ./${algorithm}_params.txt --parameters ./${algorithm}_Latin --metric 4 --threshold${thresholds[$counter]} ./SOW6/metrics/average_replace_NaNs/${algorithm}_${problem}.average > ./test/EpsInd_${percentile}_${algorithm}.txt counter=$((counter+1))
done
done
echo "Done!"



I did encounter some caveats while working through this process. Scripts for handling them and instructions are provided in the Diagnostic-Source README on Github. One caveat that is not covered there is increasing the speed of the hypervolume calculation. Please see Dave Hadka’s Hypervolume repository for more information (https://github.com/dhadka/Hypervolume).

# Algorithm Diagnostics Walkthrough using the Lake Problem as an example (Part 2 of 3: Calculate metrics for Analysis) Tori Ward

This post continues from Part 1, which provided examples of using the MOEAFramework to generate Pareto approximate fronts for a comparative diagnostic study.

Once one has finished generating all of the approximate fronts and respective reference sets one hopes to analyze, metrics may be calculated within the MOEAFramework. I calculated metrics for both my local reference sets and all of my individual approximations of the Pareto front. The metrics for the individual approximations were averaged for each parameterization across all seeds to determine the expected performance for a single seed.

Calculate Metrics

Local Reference Set Metrics

#!/bin/bash

NSAMPLES=50
NSEEDS=50
METHOD=Latin
PROBLEM=myLake4ObjStoch
ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg)

SEEDS=$(seq 1${NSEEDS})
JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"
set -e

for ALGORITHM in ${ALGORITHMS[@]} do NAME=${ALGORITHM}_${PROBLEM} PBS="\ #PBS -N${NAME}\n\
#PBS -l nodes=1\n\
#PBS -l walltime=96:00:00\n\
#PBS -o output/${NAME}\n\ #PBS -e error/${NAME}\n\
cd \$PBS_O_WORKDIR\n\ java${JAVA_ARGS} \
org.moeaframework.analysis.sensitivity.ResultFileEvaluator \
--b ${PROBLEM} --i ./SOW4/${ALGORITHM}_${PROBLEM}.reference \ --r ./SOW4/reference/${PROBLEM}.reference --o ./SOW4/${ALGORITHM}_${PROBLEM}.localref.metrics"
echo -e $PBS | qsub done  Individual Set Metrics #!/bin/bash NSAMPLES=50 NSEEDS=50 METHOD=Latin PROBLEM=myLake4ObjStoch ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg) SEEDS=$(seq 1 ${NSEEDS}) JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar" set -e for ALGORITHM in${ALGORITHMS[@]}
do
for SEED in ${SEEDS} do NAME=${ALGORITHM}_${PROBLEM}_${SEED}
PBS="\
#PBS -N ${NAME}\n\ #PBS -l nodes=1\n\ #PBS -l walltime=96:00:00\n\ #PBS -o output/${NAME}\n\
#PBS -e error/${NAME}\n\ cd \$PBS_O_WORKDIR\n\
java ${JAVA_ARGS} \ org.moeaframework.analysis.sensitivity.ResultFileEvaluator \ --b${PROBLEM} --i ./SOW4/sets/${ALGORITHM}_${PROBLEM}_${SEED}.set \ --r ./SOW4/reference/${PROBLEM}.reference --o ./SOW4/metrics/${ALGORITHM}_${PROBLEM}_${SEED}.metrics" echo -e$PBS | qsub
done
done



Average Individual Set Metrics across seeds for each parameterization

#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS -N moeaevaluations
#PBS -j oe
#PBS -l walltime=96:00:00

cd "$PBS_O_WORKDIR" NSAMPLES=50 NSEEDS=50 METHOD=Latin PROBLEM=myLake4ObjStoch ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg) SEEDS=$(seq 1 ${NSEEDS}) JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar" set -e # Average the performance metrics across all seeds for ALGORITHM in${ALGORITHMS[@]}
do
echo -n "Averaging performance metrics for ${ALGORITHM}..." java${JAVA_ARGS} \
org.moeaframework.analysis.sensitivity.SimpleStatistics \
-m average --ignore -o ./metrics/${ALGORITHM}_${PROBLEM}.average ./metrics/${ALGORITHM}_${PROBLEM}_*.metrics
echo "done."
done



At the end of this script, I also calculated the set contribution I mentioned earlier by including the following lines.

# Calculate set contribution
echo ""
echo "Set contribution:"
java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SetContribution \ -e 0.01,0.01,0.001,0.01 -r ./reference/${PROBLEM}.reference ./reference/*_${PROBLEM}.combined  Part 3 covers using the MOEAFramework for further analysis of these metrics. # Algorithm Diagnostics Walkthrough using the Lake Problem as an example (Part 1 of 3: Generate Pareto approximate fronts) This three part series is an overview of the algorithm diagnostics I performed in my Lake Problem study with the hope that readers may apply the steps to any problem of interest. All of the source code for my study, including the scripts used for the diagnostics can be found at https://github.com/VictoriaLynn/Lake-Problem-Diagnostics. The first step to using the MOEAFramework for comparative algorithm diagnostics was to create the simulation model on which I would be assessing algorithm performance. The Lake Problem was written in C++. The executable alone could be used for optimization with Borg and I created a java stub to connect the problem to the MOEAFramework. (https://github.com/VictoriaLynn/Lake-Problem-Diagnostics/blob/master/Diagnostic-Source/myLake4ObjStoch.java). Additional information on this aspect of a comparative study can be found in examples 4 and 5 for the MOEAFramework (http://moeaframework.org/examples.html) and in Chapter 5 of the manual. I completed the study using version 2.1, which was the newest at the time. I used the all in one executable instead of the source code although I compiled my simulation code within the examples subfolder of the source code. Once I had developed an appropriate simulation model to represent my problem, I could begin the diagnostic component of my study. I first chose algorithms of interest and determined the range of parameters from which I would like to sample. To determine parameter ranges, I consulted Table 1 of the 2013 AWR article by Reed et al. Reed, P., et al. Evolutionary Multiobjective Optimization in Water Resources: The Past, Present, and Future. (Editor Invited Submission to the 35th Anniversary Special Issue), Advances in Water Resources, 51:438-456, 2013. Example parameter files and the ones I used for my study can be found at https://github.com/VictoriaLynn/Lake-Problem-Diagnostics/tree/master/Diagnostic-Source/params. Once I had established parameter files for sampling, I found chapter 8 of the MOEAFramework manual to be incredibly useful. Below I walk through the steps I took in generating approximations of the Pareto optimal front for my problem across multiple seeds, algorithms, and parameterizations. All of the commands have been consolidated into the file Lake_Problem_Comparative_Study.sh on Github, but I had many separate files during my study, which will be separated into steps here. It may have been possible to automate the whole process, but I liked breaking it up into separate scripts to make sure I checked that the output made sense after each step. Step 1: Generate Parameter Samples To generate parameter samples for each algorithm, I used the following code, which I kept in a file called sample_parameters.sh. I ran all .sh scripts using the general command sh script_name.sh. NSAMPLES=500 METHOD=Latin PROBLEM=myLake4ObjStoch ALGORITHMS=(Borg MOEAD eMOEA NSGAII eNSGAII GDE3) JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar" # Generate the parameter samples echo -n "Generating parameter samples..." for ALGORITHM in${ALGORITHMS[@]}
do
java ${JAVA_ARGS} \ org.moeaframework.analysis.sensitivity.SampleGenerator \ --method${METHOD} --n ${NSAMPLES} --p${ALGORITHM}_params.txt \
--o ${ALGORITHM}_${METHOD}
done


Step 2: Optimize the problem using algorithms of interest This step had two parts: optimization with Borg and optimization with the MOEAFramework algorithms. To optimize using Borg, one needs to request Borg at http://borgmoea.org/. This is the only step that needs to be completed outside of the MOEAFramework. I then used the following script to generate approximations to the Pareto front for all 500 samples and 50 random seeds. The –l and –u flags indicate upper and lower bounds for decision variable values. Fortunately, it should soon be possible to type one value and specify the number of variables with that bound instead of typing all 100 values as shown here.

#!/bin/bash
#50 random seeds

NSEEDS=50
PROBLEM=myLake4ObjStoch
ALGORITHM=Borg

SEEDS=$(seq 1${NSEEDS})

for SEED in ${SEEDS} do NAME=${ALGORITHM}_${PROBLEM}_${SEED}
PBS="\
#PBS -N ${NAME}\n\ #PBS -l nodes=1\n\ #PBS -l walltime=96:00:00\n\ #PBS -o output/${NAME}\n\
#PBS -e error/${NAME}\n\ cd \$PBS_O_WORKDIR\n\
./BorgExec -v 100 -o 4 -c 1 \
-l 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 \
-u 0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1 \
-e 0.01,0.01,0.0001,0.0001 -p Borg_params.txt -i Borg_Latin -s ${SEED} -f ./sets/${ALGORITHM}_${PROBLEM}_${SEED}.set -- ./LakeProblem4obj_control "
echo -e $PBS | qsub done  Optimization with the MOEAFramework allowed me to submit jobs for all remaining algorithms and seeds with one script as shown below. In my study, I actually submitted epsilon dominance algorithms (included –e flag) and point dominance algorithms (did not include –e flag) separately; however, it is my understanding that it would have been fine to submit jobs for all algorithms with the epsilon flag, especially since I converted all point dominance approximations to the Pareto front to epsilon dominance when generating reference sets.  #!/bin/bash NSEEDS=50 PROBLEM=myLake4ObjStoch ALGORITHMS=(MOEAD GDE3 NSGAII eNSGAII eMOEA) SEEDS=$(seq 1 ${NSEEDS}) JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar" set -e for ALGORITHM in${ALGORITHMS[@]}
do
for SEED in ${SEEDS} do NAME=${ALGORITHM}_${PROBLEM}_${SEED}
PBS="\
#PBS -N ${NAME}\n\ #PBS -l nodes=1\n\ #PBS -l walltime=96:00:00\n\ #PBS -o output/${NAME}\n\
#PBS -e error/${NAME}\n\ cd \$PBS_O_WORKDIR\n\
java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.Evaluator -p${ALGORITHM}_params.txt -i ${ALGORITHM}_Latin -b${PROBLEM}
-a ${ALGORITHM} -e 0.01,0.01,0.0001,0.0001 -s${SEED} -o ./sets/${NAME}.set" echo -e$PBS | qsub
done

done



Step 3: Generate combined approximation set for each algorithm and Global reference set Next, I generated a reference set for each algorithm’s performance. This was useful as it made it easier to generate the global reference set for all algorithms across all seeds and parameterizations and it allowed me to calculate a percent contribution for each algorithm to the global reference set. Below is the script for the algorithm reference sets:

#!/bin/bash

NSAMPLES=50
NSEEDS=50
METHOD=Latin
PROBLEM=myLake4ObjStoch
ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg)

JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"
set -e

# Generate the combined approximation sets for each algorithm
for ALGORITHM in ${ALGORITHMS[@]} do echo -n "Generating combined approximation set for${ALGORITHM}..."
java ${JAVA_ARGS} \ org.moeaframework.analysis.sensitivity.ResultFileMerger \ -b${PROBLEM} -e 0.01,0.01,0.0001,0.0001 -o ./SOW4/reference/${ALGORITHM}_${PROBLEM}.combined \
./SOW4/sets/${ALGORITHM}_${PROBLEM}_*.set
echo "done."
done

In the same file, I added the following lines to generate the global reference set while running the same script.
# Generate the reference set from all combined approximation sets
echo -n "Generating reference set..."
java ${JAVA_ARGS} org.moeaframework.util.ReferenceSetMerger \ -e 0.01,0.01,0.0001,0.0001 -o ./SOW4/reference/${PROBLEM}.reference ./SOW4/reference/*_${PROBLEM}.combined > /dev/null echo "done."  If one wants to keep the decision variables associated with the reference set solutions, it is possible to use org.moeaframework.analysis.sensitivity.ResultFileMerger on all of the pertinent .set files. A final option for reference sets is to generate local reference sets for each parameterization of each algorithm. This was done with the following script: #!/bin/bash NSEEDS=50 ALGORITHMS=( GDE3 eMOEA Borg NSGAII eNSGAII MOEAD) PROBLEM=myLake4ObjStoch SEEDS=$(seq 1 ${NSEEDS}) # Evaluate all algorithms for all seeds for ALGORITHM in${ALGORITHMS[@]}
do
java -cp MOEAFramework-2.1-Demo.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger -d 4 -e 0.01,0.01,0.0001,0.0001 \
--output ./SOW4/${ALGORITHM}_${PROBLEM}.reference ./SOW4/objs/${ALGORITHM}_${PROBLEM}*.obj
done



Part 2 of this post walks through my calculation of metrics.

# Getting Started: Git and GitHub

Getting started using Git and GitHub can be overwhelming.  The intent of this post is to provide basic background information and easy-to-follow instruction for a new user of Git and GitHub.  After reading this post, I recommend reading Jon Herman’s Intro to git: Part 1 and Part 2 posts for additional information, including greater detail on important commands.  Joe Kasprzyk’s post on GitHub Pages is also helpful.

What are Git and Github?

Git is an open source VCS (Version Control System).  What does that mean?  Essentially, it is a tool for managing and sharing file revisions.  It may be utilized for code as well as other file types, such as Microsoft Word documents.  Version control is important in group programming collaboration, so you definitely want to “git” on Git.  Git is a distributed VCS, which allows you to push (share) and pull (acquire) version changes from a remote shared copy.  Thus, you may work on your own changes of a shared code locally with options of sending revisions to the remote master copy and incorporating collaborators’ changes from the remote copy into your local copy.  Although Git is particularly useful for code collaboration, it is also beneficial for individual use to reduce headaches from losing changes or breaking code.  To learn more about Git and how it differs from other VCSs, please see the Getting Started – Git Basics section of the Git Reference Book. .

So, what is GitHub?  GitHub hosts Git repositories (essentially project folders) and offers additional collaboration features.  BitBucket is another example of a Git host.  Since GitHub is public (private repositories are not free), it allows users to see how your code is evolving over time and offer input – this is the real power of Git / GitHub.

In order to utilize GitHub, you must first download Git and then set-up GitHub.  Both Git and GitHub are operated through using the command line interface as the shell, which is a mechanism for the user to communicate with the operating system through a series of commands rather than by point-and-click.  However, if you are uncomfortable using the command line, there are GUIs available for both Git and GitHub.

Basic Terminology

There is quite a bit of lingo that you will want to get a handle on before continuing onward.  Below, I have provided a boiled down list of terms you need to know to get started.

Repository (or Repo): Location or “folder” for a project’s files and revision history

Fork: Copy of (or to copy) another user’s repo for you to use and/or edit without affecting the original repo

Clone: Copy of (or to copy) a repo on your local machine rather than on a server

Remote: Copy of a repo on a server that can be updated through syncing with local clones

Master Branch: Primary version of a repo

Branch:  Parallel version of a repo that allows you to make changes without affecting the master version

Upstream / Downstream: Upstream refers to previous versions or primary branches and downstream refers to changes on forks or branches you are working on.

Merge: Applying the changes from one branch to another

Commit: Change (or revision) made to a repo.  Be sure to write a clear commit message when “saving” or making the commit so that the next user understands the changes.

Pull: Taking changes from a remote repo and merging them with your local branch

Pull Request: Method to submit changes to a remote repo

Push: Sending updates to a remote repo

Owner: Original creator of a repo

Collaborator: One that is invited to contribute to a repo by the owner

Contributor:  One that has contributed to a repo without collaborator access

Steps to Get Started

Follow the outlined steps below to get up-and-running on Git / GitHub.  Please provide comments if any steps are unclear.

2.  Install GIT

For Windows Installation:

• Leave the default components
• Opt to use “Git from Git Bash only” to prevent changes to your PATH

3.  Set-Up GIT

After the download is complete, open the Git bash (Windows) or the terminal (Mac / Linux).  Bash is a UNIX shell – this means that you need to use Linux commands instead of Windows commands used typically on the command line interface.

First, you want to make a few configuration changes to set up your identity so that your commits are labeled.  Since you will be using GitHub, no other setup is required for Git.


$git config --global user.name "Your Name in Quotes"$ git config --global user.email "Your E-mail in Quotes"



Second, you want to authenticate with GitHub from Git, which means that you will select a communications protocol, HTTPS or SSH, that will allow you to connect to a GitHub repo from Git.  Based on your choice, GitHub has very clear instructions on set-up found at https://help.github.com/articles/set-up-git.

If you would like to limit time using the command line, you will want to download the GitHub desktop client (for Windows or for Mac).  This is especially helpful if you want to clone with SSH because the desktop client will configure SSH keys for you without use of the command line.

What’s Next?

You are all set to start using Git / GitHub to collaborate on code.  You will want to practice creating a repo, forking a repo, making a commit, etc – follow John Herman’s posts, Intro to git: Part 1 and Part 2.