# Fisheries Training Part 1 – Harvest Optimization and MOEA Diagnostics

Welcome to the second post in the Fisheries Training Series, in which we are studying decision making under deep uncertainty within the context of a complex harvested predator-prey fishery. The accompanying GitHub repository, containing all of the source code used throughout this series, is available here. The full, in-depth Jupyter Notebook version of this post is available in the repository as well.

This post builds off of the initial post, Fisheries Training 0: Exploring Predator-Prey Dynamics, and presents the following:

1. A brief re-cap of the harvested predator-prey model
2. Formulation of the harvesting policy and an overview of radial basis functions (RBFs)
3. Formulation of the policy objectives
4. A simulation model for the harvested system
5. Optimization of the harvesting policy using the PyBorg MOEA
• Installation of Platypus and PyBorg*
• Optimization problem formulation
• Basic MOEA diagnostics

Note
*The PyBorg MOEA used in this demonstration is derived from the Borg MOEA and may only be used with permission from its creators. Fortunately, it is freely available for academic and non-commercial use. Visit BorgMOEA.org to request access.

Now, onto the tutorial!

## Harvested predator-prey model

In the previous post, we introduced a modified form of the Lotka-Volterra system of ordinary differential equations (ODEs) defining predator-prey population dynamics.

This modified version includes a non-linear predator population growth dynamic original proposed by Arditi and Akçakaya (1990), and includes a harvesting parameter, $z$. This system of equations is defined in Hadjimichael et al. (2020) as:

$\frac{dx}{dt} = bx\Big(1 - \frac{x}{K}\Big) - \frac{\alpha xy}{y^m + \alpha hx} - zx$

$\frac{dy}{dt} = \frac{c\alpha xy}{y^m + \alpha hx} - dy$

Where $x$ is the prey population being harvested and $y$ is the predator population. Please refer to Post 0 of this series for the rest of the parameter descriptions, and for insights into the non-linear dynamics that result from these ODEs. It also demonstrates how the system alternates between ‘basins’ of stability and population collapse.

## Harvesting policy

In this post, we instead focus on the generation of harvesting policies which can be operated safely in the system without causing population collapse. Rather than assigning a deterministic (specific, pre-defined) harvest effort level for every time period, we instead design an adaptive policy which is a function of the current state of the system:

$z_t = f(\cdot)$

The problem then becomes the optimization of the control rule, $f(\cdot)$, rather than specific parameter values, $z = [z_1, z_2, ..., z_t]$. The process of optimizing the parameters of a state-aware control rule is known as Direct Policy Search (DPS; Quinn et al, 2017).

Previous work done by Quinn et al. (2017) showed that an adaptive policy, generated using DPS, was able to navigate deeply uncertain ecological tipping points more reliably than intertemporal policies which prescribed specific efforts at each timestep.

The core of the DPS method are radial basis functions (RBFs), which are flexible, parametric function formulations that map the current state of the system to policy action. A previous study by Giuliani et al (2015) demonstrated that RBFs are highly effective in generating Pareto-approximate sets of solutions, and that they perform well when applied to horizons different from the optimized simulation horizon.

There are various RBF approaches available, such as the cubic RBF used by Quinn et al. (2017). Here, we use the Gaussian RBF introduced by Hadjimichael et al. (2020), where the harvest effort during the next timestep, $z_{t+1}$, is mapped to the current prey population levels, $x_t$ by the function:

$z_{t+1} = \sum_{i=1}^n w_i \Big[exp\Big[-\Big(\frac{x_t-c_i}{b_i}\Big)^2\Big]\Big]$

In this formulation $c_i, r_i,$ and $w_i$ are the center, radius, and weights of each RBF $i$ respectively. Additionally, $n$ is the number of RBFs used in the function; in this study we use $n = 2$ RBFs. With two RBFs, there are a total of 6 parameters. Increasing the number of RBFs allows for more flexible function forms to be achieved. However, two RBFs have been shown to be sufficient for this problem.

The sum of the weights must be equal to one, such that:

$\sum_{i=1}^n w_i= 1$

The function harvest_streategy() is contained within the fish_game_functions.py script, which can be accessed here in the repository.

A simplified rendition of the harvest_strategy() function, evaluate_RBF(), is shown below and uses the RBF parameter values (i.e., $[c_1, b_1, w_1, c_2, b_2, w_2]$), and the current prey population, to calculate the next year’s harvesting effort.

import numpy as np
import matplotlib.pyplot as plt

def evaluate_RBF(x, RBF_params, nRBFs):
"""
Parameters:
-----------
x : float
The current state of the system.
RBF_params : list [3xnRBFs]
The RBF parameters in the order of [c, r, w,...,c, r, w].
nRBFs : int
The number of RBFs used in the mapping function.

Returns:
--------
z : float
The policy action.
"""

c = RBF_params[0::3]
r = RBF_params[1::3]
w = RBF_params[2::3]

# Normalize the weights
w_norm = []
if np.sum(w) != 0:
for w_i in w:
w_norm.append(w_i / np.sum(w))
else:
w_norm = (1/nRBFs)*np.ones(len(w))

z = 0.0

for i in range(nRBFs):

# Avoid division by zero
if r[i] != 0:
z = z + w[i] * np.exp(-((x - c[i])/r[i])**2)
else:
z = z + w[i] * np.exp(-((x - c[i])/(10**-6))**2)

# Impose limits on harvest effort
if z < 0:
z = 0
elif z > 1:
z = 1

return z


To better understand the nature of the harvesting policy, it is helpful to visualize the policy function, $z = f(\cdot)$.

For some arbitrary selection of RBF parameters:

$[c_1, b_1, w_1, c_2, b_2, w_2] = [0.2, 1.1, 0.41, 0.34,0.7, 0.59]$

The following function will plot the harvesting strategy:

def plot_RBF_policy(x_range, x_label, y_range, y_label, RBF_params, nRBFs):
"""
Parameters:
-----------
RBF_params : list [3xnRBFs]
The RBF parameters in the order of [c, r, w,...,c, r, w].
nRBFs : int
The number of RBFs used in the mapping function.

Returns:
--------
None.
"""
# Step size
n = 100
x_min = x_range[0]
x_max = x_range[1]
y_min = y_range[0]
y_max = y_range[1]

# Generate data
x_vals = np.linspace(x_min, x_max, n)
y_vals = np.zeros(n)

for i in range(n):
y = evaluate_RBF(x_vals[i], RBF_params, nRBFs)

# Check that assigned actions are within range
if y < y_min:
y = y_min
elif y > y_max:
y = y_max

y_vals[i] = y

# Plot
fig, ax = plt.subplots(figsize = (5,5), dpi = 100)
ax.plot(x_vals, y_vals, label = 'Policy', color = 'green')
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.set_title('RBF Policy')
plt.show()
return


Let’s take a look at the policy that results from the random RBF parameters listed above. Setting my problem-specific inputs, and running the function:

# Set the RBF parameters
nRBFs = 2
RBF_params = [0.2, 1.1, 0.41, 0.34,0.7, 0.59]

# Specify plot ranges
x_range = [0, 1]
x_label = 'Population ($x$)'
y_range = [0,1]
y_label = 'Harvest Effort ($z$)'

# Plot the policy curve
plot_RBF_policy(x_range, x_label, y_range, y_label, RBF_params, nRBFs)


This policy does not make much intuitive sense… why should harvesting efforts be decreased when the fish population is large? Well, that’s because we chose these RBF parameter values randomly.

To demonstrate the flexibility of the RBF functions and the variety of policy functions that can result from them, I generated a few (n = 7) policies using a random sample of parameter values. The parameter values were sampled from a uniform distribution over each parameters range: $c_i, b_i, w_i \in [0,1]$. Below is a plot of the resulting random policy functions:

Finding the sets of RBF parameter values that result in Pareto-optimal harvesting policies is the next step in this process!

## Harvest strategy objectives

We take a multi-objective approach to the generation of a harvesting strategy. Given that the populations are vulnerable to collapse, it is important to consider ecological objectives in the problem formulation.

Here, we consider five objectives, described below.

#### Objective 1: Net present value

The net present value (NPV) is an economic objective corresponding to the amount of fish harvested.

During the simulation-optimization process (later in this post), we simulate a single policy $N$ times, and take the average objective score over the range of simulations. This method helps to account for variability in expected outcomes due to natural stochasticity. Here, we use $N = 100$ realizations of stochasticity.

With that in mind, the NPV ($O_1$) is calculated as:

$O_1 = \frac{1}{N} \sum_{i=1}^N\Big( \sum_{t=0}^T \frac{z_{t+1,i}x_{t,i}}{(1+\delta)^t}\Big)$

where $\delta$ is the discount rate which converts future benefits to present economic value, here $\delta = 0.05$.

#### Objective 2: Prey population deficit

The second objective aims to minimize the average prey population deficit, relative to the prey population carrying capacity, $K$:

$O_2 = \frac{1}{N} \sum_{i=1}^N\Big( \frac{1}{T} \sum_{t=1}^T \frac{K - x_{t,i}}{K}\Big)$

#### Objective 3: Longest duration of consecutive low harvest

In order to maintain steady harvesting levels, we minimize the longest duration of consecutive low harvests. Here, a subjective definition of low harvest is imposed. In a practical decision making process, this threshold may be solicited from the relevant stakeholders.

Objective 3 is defined as:

$O_3 = \frac{1}{N} \sum_{i=1}^N(max_T(\phi_{t,i}))$

where

And the low harvest limit is: $limit = 5\%$.

#### Objective 4: Worst harvest instance

In addition to avoiding long periods of continuously low harvest, the stakeholders have a desire to limit financial risks associated with very low harvests. Here, we minimize the worst 1% of harvest.

The fourth objective is defined as:

$O_4 = \frac{1}{N} \sum_{i=1}^N(percentile_T(z_{t+1,i}x_{t,i}, 1))$

#### Objective 5: Harvest variance

Lastly, policies which result in low harvest variance are more easily implemented, and can limit corresponding variance in fish populations.

The last objective minimizes the harvest variance, with the objective score defined as:

$O_5 = \frac{1}{N} \sum_{i=1}^N(Var_T(z_{t+1,i}x_{t,i}))$

#### Constraint: Avoid collapse of predator population

During the optimization process, we are able to include constraints on the harvesting policies.

Since population collapse is a stable equilibrium point, from which the population will not regrow, it is imperative to consider policies which prevent collapse.

With this in mind, the policy must not result in any population collapse across the $N$ realizations of environmental stochasticity. Mathematically, this is enforced by:

$\frac{1}{N} \sum_{i=1}^N(\Psi_{t,i})) = 0$

where

#### Problem formulation

Given the objectives described above, the optimization problem is:

$Minimize \ F(z_x) = (-O_1, O_2, O_3, -O_4, O_5)$

## Simulation model of the harvested system

Here, we provide an overview of the fish_game_5_objs() model which combines many of the preceding topics. The goal for this model is to take a set of RBF parameters, which define the harvesting strategy, simulate the policy for some length of time, and then return the objective scores resulting from the policy.

Later, this model will allow for the optimization of the harvesting policy RBF parameters through a Multi-Objective Evolutionary Algorithm (MOEA). The MOEA will evaluate many thousands of policies (RBF parameter combinations) and attempt to find, through evolution, those RBF parameters which yield best objective performance.

A brief summary of the model process is described here, but the curious learner is encouraged to take a deeper look at the code and dissect the process.

The model can be understood as having three major sections:

1. Initialization of storage vectors, stochastic variables, and assumed ODE parameters.
2. Simulation of policy and fishery populations over time period T.
3. Calculation of objective scores.
def fish_game_5_objs(vars):
"""
Defines the full, 5-objective fish game problem to be solved

Parameters
----------
vars : list of floats
Contains the C, R, W values

Returns
-------
objs, cnstr
"""

# Get chosen strategy
strategy = 'Previous_Prey'

# Define variables for RBFs
nIn = 1 # no. of inputs (depending on selected strategy)
nOut = 1 # no. of outputs (depending on selected strategy)
nRBF = 2 # no. of RBFs to use

nObjs = 5
nCnstr = 1 # no. of constraints in output

tSteps = 100 # no. of timesteps to run the fish game on
N = 100 # Number of realizations of environmental stochasticity

# Define assumed system parameters
a = 0.005
b = 0.5
c = 0.5
d = 0.1
h = 0.1
K = 2000
m = 0.7
sigmaX = 0.004
sigmaY = 0.004

# Initialize storage arrays for populations and harvest
x = np.zeros(tSteps+1) # Prey population
y = np.zeros(tSteps+1) # Predator population
z = np.zeros(tSteps+1) # Harvest effort

# Create array to store harvest for all realizations
harvest = np.zeros([N,tSteps+1])
# Create array to store effort for all realizations
effort = np.zeros([N,tSteps+1])
# Create array to store prey for all realizations
prey = np.zeros([N,tSteps+1])
# Create array to store predator for all realizations
predator = np.zeros([N,tSteps+1])

# Create array to store metrics per realization
NPV = np.zeros(N)
cons_low_harv = np.zeros(N)
harv_1st_pc = np.zeros(N)
variance = np.zeros(N)

# Create arrays to store objectives and constraints
objs = [0.0]*nObjs
cnstr = [0.0]*nCnstr

# Create array with environmental stochasticity for prey
epsilon_prey = np.random.normal(0.0, sigmaX, N)

# Create array with environmental stochasticity for predator
epsilon_predator = np.random.normal(0.0, sigmaY, N)

# Go through N possible realizations
for i in range(N):

# Initialize populations and values
x[0] = prey[i,0] = K
y[0] = predator[i,0] = 250
z[0] = effort[i,0] = harvest_strategy([x[0]], vars, [[0, K]], [[0, 1]], nIn, nOut, nRBF)
NPVharvest = harvest[i,0] = effort[i,0]*x[0]

# Go through all timesteps for prey, predator, and harvest
for t in range(tSteps):

# Solve discretized form of ODE at subsequent time step
if x[t] > 0 and y[t] > 0:
x[t+1] = (x[t] + b*x[t]*(1-x[t]/K) - (a*x[t]*y[t])/(np.power(y[t],m)+a*h*x[t]) - z[t]*x[t])* np.exp(epsilon_prey[i]) # Prey growth equation
y[t+1] = (y[t] + c*a*x[t]*y[t]/(np.power(y[t],m)+a*h*x[t]) - d*y[t]) *np.exp(epsilon_predator[i]) # Predator growth equation

# Solve for harvesting effort at next timestep
if t <= tSteps-1:
if strategy == 'Previous_Prey':
input_ranges = [[0, K]] # Prey pop. range to use for normalization
output_ranges = [[0, 1]] # Range to de-normalize harvest to
z[t+1] = harvest_strategy([x[t]], vars, input_ranges, output_ranges, nIn, nOut, nRBF)

# Store values in arrays
prey[i,t+1] = x[t+1]
predator[i,t+1] = y[t+1]
effort[i,t+1] = z[t+1]
harvest[i,t+1] = z[t+1]*x[t+1]
NPVharvest = NPVharvest + harvest[i,t+1]*(1+0.05)**(-(t+1))

# Solve for objectives and constraint
NPV[i] = NPVharvest
low_hrv = [harvest[i,j]<prey[i,j]/20 for j in range(len(harvest[i,:]))] # Returns a list of True values when there's harvest below 5%
count = [ sum( 1 for _ in group ) for key, group in itertools.groupby( low_hrv ) if key ] # Counts groups of True values in a row
if count: # Checks if theres at least one count (if not, np.max won't work on empty list)
cons_low_harv[i] = np.max(count)  # Finds the largest number of consecutive low harvests
else:
cons_low_harv[i] = 0
harv_1st_pc[i] = np.percentile(harvest[i,:],1)
variance[i] = np.var(harvest[i,:])

# Average objectives across N realizations
objs[0] = -np.mean(NPV) # Mean NPV for all realizations
objs[1] = np.mean((K-prey)/K) # Mean prey deficit
objs[2] = np.mean(cons_low_harv) # Mean worst case of consecutive low harvest across realizations
objs[3] = -np.mean(harv_1st_pc) # Mean 1st percentile of all harvests
objs[4] = np.mean(variance) # Mean variance of harvest

cnstr[0] = np.mean((predator < 1).sum(axis=1)) # Mean number of predator extinction days per realization

# output should be all the objectives, and constraint
return objs, cnstr


The next section shows how to optimize the harvest policy defined by vars, using the PyBorg MOEA.

## A (Very) Brief Overview of PyBorg

PyBorg is the secondary implementation of the Borg MOEA written entirely in Python by David Hadka and Andrew Dircks. It is made possible using functions from the Platypus optimization library, which is a Python evolutionary computing framework.

As PyBorg is Borg’s Python wrapper and thus derived from the original Borg MOEA, it can only be used with permission from its creators. To obtain permission for download, please visit BorgMOEA and complete the web form. You should receive an email with the link to the BitBucket repository shortly.

## Installation

The repository you have access to should be named ‘Serial Borg MOEA’ and contain a number of folders, including one called Python/. Within the Python/ folder, you will be able to locate a folder called pyborg. Once you have identified the folder, please follow these next steps carefully:

1. Check your current Python version. Python 3.5 or later is required to enable PyBorg implementation.
2. Download the pyborg folder and place it in the folder where this Jupyter Notebook all other Part 1 training material is located.
3. Install the platypus library. This can be in done via your command line by running one of two options:

If you are using pip:
pip install platypus-opt


If you are using conda:

conda config --add channels conda-forge
conda install platypus-opt

1. Make sure the following training startup files are located within the same folder as this Jupyter Notebook:
a) fish_game_functions.py: Contains all function definitions to setup the problem, run the optimization, plot the hypervolume, and conduct random seed analysis.
b) Part 1 - Harvest Optimization and MOEA Diagnostics.ipynb: This is the current notebook and where the Fisheries fame will be demonstrated.

We are now ready to proceed!

## Optimization of the Fisheries Game

### Import all libraries

All functions required for this post can be found in the fish_game_functions.py file. This code is adapted from Antonia Hadjimichael’s original post on exploring the Fisheries Game dynamics using PyBorg.

# import all required libraries
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
from fish_game_functions import *
from platypus import Problem, Real, Hypervolume, Generator
from pyborg import BorgMOEA
import time
import random


### Formulating the problem

Define number of decision variables, constraints, and specify problem formulation:

# Set the number of decision variables, constraints and performance objectives
nVars = 6   # Define number of decision variables
nObjs = 5   # Define number of objectives
nCnstr = 1      # Define number of decision constraints

# Define the upper and lower bounds of the performance objectives
objs_lower_bounds = [-6000, 0, 0, -250, 0]
objs_upper_bounds = [0, 1, 100, 0, 32000]


### Initialize the problem for optimization

We call the fisheries_game_problem_setup.py function to set up the optimization problem. This function returns a PyBorg object called algorithm in this exercise that will be optimized in the next step.

def fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=100):
"""
Sets up and runs the fisheries game for a given population size

Parameters
----------
nVars : int
Number of decision variables.
nObjs : int
Number of performance objectives.
nCnstr : int
Number of constraints.
pop_size : int, optional
Initial population size of the randomly-generated set of solutions.
The default is 100.

Returns
-------
algorithm : pyBorg object
The algorthm to optimize with a unique initial population size.

"""
# Set up the problem
problem = Problem(nVars, nObjs, nCnstr)
nVars = 6   # Define number of decision variables
nObjs = 5   # Define number of objective -- USER DEFINED
nCnstr = 1      # Define number of decision constraints

problem = Problem(nVars, nObjs, nCnstr)

# set bounds for each decision variable
problem.types[0] = Real(0.0, 1.0)
problem.types[1] = Real(0.0, 1.0)
problem.types[2] = Real(0.0, 1.0)
problem.types[3] = Real(0.0, 1.0)
problem.types[4] = Real(0.0, 1.0)
problem.types[5] = Real(0.0, 1.0)

# all values should be nonzero
problem.constraints[:] = "==0"

# set problem function
if nObjs == 5:
problem.function = fish_game_5_objs
else:
problem.function = fish_game_3_objs

algorithm = BorgMOEA(problem, epsilons=0.001, population_size=pop_size)
return algorithm

# initialize the optimization
algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr)


### Define parameters for optimization

Before optimizing, we have to define our desired population size and number of function evaluations (NFEs). The NFEs correspond to the number of evolutions of the set of solutions. For complex, many-objective problems, it may be necessary for a large NFE.

Here, we start with a small limit on NFE, to test the speed of the optimization. Limiting the optimization to 100 NFE is going to produce relatively poor performing solutions, however it is a good starting point for our diagnostic tests.

init_nfe = 100
init_pop_size = 100


### Begin the optimization

In addition to running the optimization, we also time the optimization to get a general estimate on the time the full hypervolume analysis will require.

# begin timing the Borg run
borg_start_time = time.time()

algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(init_pop_size))
algorithm.run(int(init_nfe))

# end timing and print optimization time
borg_end_time = time.time()

borg_total_time = borg_end_time - borg_start_time

print(f"borg_total_time={borg_total_time}s")

Output: borg_total_time=33.62936472892761s

NOTICE:
Running the PyBrog MOEA 100 times took ~34 seconds (on the machine which this was written on…). Keep this in mind, that increasing the NFE will require correspondingly more time. If you increase the number too much, your machine may take a long time to compute the final Pareto-front.

Here, we plot a 3-dimensional plot showing the tradeoff between a select number of objectives. If you have selected the 5-objective problem formulation, you should select the three objectives you would like to analyze the tradeoff surface for. Please select the (abbreviated) objective names from the following list:

Objective 1: Mean NPV
Objective 2: Mean prey deficit
Objective 3: Mean WCLH
Objective 4: Mean 1% harvest
Objective 5: Mean harvest variance

# Plot objective tradeoff surface
fig_objs = plt.figure(figsize=(8,8))

# Select the objectives to plot from the list provided in the description above
obj1 = 'Mean NPV'
obj2 = 'Mean prey deficit'
obj3 = 'Mean 1% harvest'

plot_3d_tradeoff(algorithm, ax_objs, nObjs, obj1, obj2, obj3)


The objectives scores arn’t very good, but that is because the number of function evaluations is so low. In order to get a better set of solutions, we need to run the MOEA for many function evaluations.

The next section demonstrates the change in objective performance with respect to the number of function evaluations.

## MOEA Diagnostics

A good MOEA is assessed by it’s ability to quickly converge to a set of solutions (the Pareto-approximate set) that is also diverse. This means that the final set of solutions is close to the true set, as well as covers a large volume of the multi-dimensional problem space. There are three quantitative metrics via which convergence and diversity are evaluated:

1. Generational distance approximates the average distance between the true Pareto front and the Pareto-approximate reference set that your MOEA identifies. It is the easiest metric to meet.
2. Epsilon indicator is a harder metric than generational distance to me et. A high-performing MOEA will have a low epsilon indicator value when the distance of its worst-performing approximate solution from the true Pareto set is small.
3. Hypervolume measures the ‘volume’ that a Pareto front covers across all dimensions of a problem. It is the hardest metric to meet and the most computationally intensive.

Both the generational distance and epsilon indicator metrics require a reference set, which is the known, true Pareto front. Conversely, the hypervolume does not have such a requirement. Given that the Fisheries Game is a complex, multi-dimensional, many-stakeholder problem with no known solution, the hypervolume metric is thus the most suitable to evaluate the ability of PyBorg to quickly converge to a diverse Pareto-approximate set of solutions.

More detailed descriptions of each metric are provided in this handy blog post by Joe Kasprzyk.

### Hypervolume

The hypervolume is a measure of the multi-dimensional volume dominated by the approximated Pareto front. As the Pareto front advances toward the “ideal” solution, this value approaches 1.

The efficiency of an MOEA in optimizing a solution can be considered by measuring the hypervolume with respect to the number of function evaluations. This allows the user to understand how quickly the MOEA is converging to a good set of solutions, and how many function evaluations are needed to achieve a good set of solutions.

#### Defining hypervolume parameters

First, we define the maximum number of function evaluations (maxevals) and the NFE step size (frequency) for which we would like to evaluate the problem hypervolume over. Try modifying these values to see how the plot changes.

Mind that the value of maxevals should always be more than that of your initial NFE, and that the value of frequency should be less than that of the initial NFE. Both values should be integer values.

Also be mindful that increasing the maxevals > 1000 is going to result in long runtimes.

maxevals = 500
frequency = 100


#### Plotting the hypervolume

Using these parameters, we then plot the hypervolume graph, showing the change in hypervolume value over the NFEs.

fig_hvol = plt.figure(figsize=(10,7))

plot_hvol(algorithm, maxevals, frequency, objs_lower_bounds, objs_upper_bounds, ax_hvol)

plt.title('PyBorg Runtime (Hypervolume)')
plt.xlabel('Number of Function Evaluations')
plt.ylabel('Hypervolume')
plt.show()


#### Perform random seed analysis

Next, we perform random seed analysis (RSA).

Generally, RSA is performed to track an algorithm’s performance during search. In addition, it is also done to determine if an algorithm has discovered an acceptable approximation of the true Pareto set. More details on RSA can be found here in a blog post by Dave Gold.

For the Fisheries Game, we conduct RSA to determine if PyBorg’s performance is sensitive to the size of its initial population. We do this using the folllowing steps:

1. Run an ensemble of searches, each starting with a randomly sampled set of initial conditions (aka “random seeds”)
2. Combine search results across all random seeds to generate a “reference set” that contains only the best non-dominated solutions across the ensemble
3. Repeat steps 1 and 2 for an initial population size of 200, 400, etc.
pop_size_list = [100, 200, 400, 800, 1000]

fig_rand_seed = plt.figure(figsize=(10,7))

for p in range(len(pop_size_list)):
fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size_list[p])
algorithm = fisheries_game_problem_setup(nVars, nObjs, nCnstr, pop_size=int(init_pop_size))
algorithm.run(int(init_nfe))

plot_hvol(algorithm, maxevals, frequency, objs_lower_bounds, objs_upper_bounds,
ax_rand_seed, pop_size_list[p])

plt.title('PyBorg Random Seed Analysis')
plt.xlabel('Number of Function Evaluations')
plt.ylabel('Hypervolume')
plt.legend()
plt.show()


Notice that the runs performed with different initial population sizes tend to converge toward a similar hypervolume value after 500 NFEs.

This reveals that the PyBorg MOEA is not very sensitive to the specific initial parameters; it is adaptable enough to succeed under different configurations.

## Conclusion

A classic decision-making idiom says ‘defining the problem is the problem’. Hopefully, this post has revealed that to be true; we have shown that changes to the harvesting strategy functions, simulation model, or objective scores can result in changes to the resulting outcomes.

And if you’ve made it this far, congratulations! Take a minute to think back on the progression of this post: we revisited the harvested predator-prey model, formulated the harvesting policy using RBFs, and formulated the policy objectives and its associated simulation model. Next, we optimized the harvesting policy using the PyBorg MOEA and performed basic MOEA diagnostics using hypervolume as our measure, and executed random seed analysis.

If you’ve progressed through this tutorial using the Jupyter Notebook, we encourage you to re-visit the source code involved in this process. The next advisable step is to re-produce this problem from scratch, as this is the best way to develop a detailed understanding of the process.

Next time, we will explore the outcomes of this optimization, by considering the tradeoffs present across the Pareto set of solutions.

# Containerizing Rhodium with Docker

Ever installed a new library only for it to throw version depreciation errors up on your terminal? Or have warnings print in your output line instead of the figure you so painstakingly coded? Fear not – containerization is here to save the day! But before we get too excited, there are a few things (and terms) to learn about containerization using Docker.

In this post, we will be walking through a brief description of containerization and explain a few of its key terms. At the end, we will perform an exercise by containerizing the Rhodium robust decision-making library. More information about this library and how it can be used for exploratory modeling can be found in this post by Andrew Dircks. For a specific application of this library to the Lake Problem, please refer to this post by Antonia Hadjimichael.

## Explaining containerization (and what is a base image?)

Using the image above, picture your hardware (laptop, desktop, supercomputer) as a large cargo ship, with its engines being its operating system. In the absence of containerization, an application (app) is developed in a specific computing environment, akin to placing cargo in a permanent storage hold under the deck of a ship. Methods for cargo loading and removal are strongly dictated by the shape and size of the ship. Similarly, a non-containerized app can only be reliably executed given that it is installed in a computing environment that is nearly or almost completely identical to that in which is was developed in.

On the contrary, containerization bundles everything an app might need to run in a ‘container’ – the code, its required libraries, and their associated dependencies – therefore enabling an app to be run consistently on any infrastructure. By extension, this renders a containerized application version- and operating system (OS)-independent. These ‘containers’ are thus easily loaded and installed onto any ‘cargo ship’. The piece of software that enables the efficient execution of containerized apps is the container engine. This nifty tool is responsible for handling app user input and ensuring the correct installation, startup and running of the containerized app. The engine also pulls, loads, and builds the container image, which is a (misleadingly-named) file, or repository of files, that contains all the information that the engine will need to build the app on a new machine.

In this post, we will be walking through the containerization of the Rhodium library using Docker, which is a container hub that let’s you develop, store and build your container images. It is the first and most commonly-used container hub (at the moment).

## Let’s containerize!

### Setup

If you use either a Windows or Mac machine, please install Docker Desktop from this site. Linux machines should first install Docker and then Docker Compose. Make sure to create an account and login.

Next, clone the PRIM, Platypus, and Rhodium repositories onto your local machine. You can directly download a .zip file of the repository here or you can clone the repository via your command line/terminal into a folder within a directory of your choice:

C:\Users\username> cd directory-of-choice


Great, your repositories are set and ready to go! These should result in three new folders: Rhodium, Platypus, and PRIM. Now, in the same terminal window, navigate to the PRIM folder and run the following:

C:\Users\username\directory-of-choice\PRIM>python setup.py develop


Repeat for the Platypus folder. This is to make sure that you have both PRIM and Project Platypus installed and setup on your local machine.

### Updating the requirements.txt file

Now, navigate back to the original directory-of-choice. Open the Rhodium folder, locate and open the requirements.txt file. Modify it so it looks like this:

matplotlib==3.5.1
numpy==1.22.1
pandas==1.4.0
mpldatacursor==0.7.1
six==1.16.0
scipy==1.7.3
prim
platypus-opt
sklearn==1.0.2


This file tells Docker that these are the required versions of libraries to install when building and installing your app.

### Creating a Dockerfile

To begin building the container image for Docker to pack and build as Rhodium’s container, first create a new text file and name is Dockerfile within the Rhodium folder. Make sure to remove the .txt extension and save it as “All types” to avoid appending an extension. Open it using whichever text file you are comfortable with. The contents of this file should look like as follows. Note that the comments are for explanatory purposes only.

# state the base version of Python you are working with
# for my machine, it is Python 3.9.1
FROM python:3.9.1

# set the Rhodium repository as the container
WORKDIR /rhodium_app

# copy the requirements file into the new working directory
COPY requirements.txt .

# install all libraries and dependencies declared in the requirements file
RUN pip install -r requirements.txt

# copy the rhodium subfolder into the new working directory
# find this subfolder within the main Rhodium folder
COPY rhodium/ .

# this is the command the run when the container starts
CMD ["python", "./setup.py"]


The “.” indicates that you will be copying the file from your present directory into your working directory.

### Build the Docker image

Once again in your terminal, check that you are in the same directory as before. Then, type in the following:

\$ docker build -t rhodium_image .


Hit enter. If the containerization succeeded, you should see the following in your terminal (or something similar to it):

Congratulations, you have successfully containerized Rhodium! You are now ready for world domination!

## References

2022. [online] Available at: <https://www.redhat.com/en/topics/cloud-native-apps/what-is-containerization&gt; [Accessed 1 February 2022].

Education, I., 2022. containerization. [online] Ibm.com. Available at: <https://www.ibm.com/cloud/learn/containerization&gt; [Accessed 1 February 2022].

Iordache, A., Scott, D., Turner, S. and Dalleau, F., 2022. Containerized Python Development – Part 1 – Docker Blog. [online] Docker Blog. Available at: <https://www.docker.com/blog/containerized-python-development-part-1/&gt; [Accessed 1 February 2022].

Runnable Docker Guides. 2022. Dockerize your Python Application. [online] Available at: <https://runnable.com/docker/python/dockerize-your-python-application&gt; [Accessed 1 February 2022].

# Teaching Tools for Complex Adaptive Systems

This semester, I am taking a foundational class in the Systems Engineering department here at Cornell and I wanted to use this blog post to relay some cool packages and tools that we have used that hopefully can be useful teaching material for emerging faculty or anyone looking for interactive systems tutorials.

To begin, we have to first define what falls under the umbrella of complex adaptive systems. In a nutshell, these systems tend to (1) have networks of many components, (2) typically involve non-linear interactions between components, (3) exhibit self-organizing behavior, (4) have the potential to exhibit emergent properties. One really beautiful website that explains these properties in more detail is Complexity Explained, which started as a community outreach project to try to explain complex systems to a wider audience within the science community and the public. The website features interactive animations of systems properties and a short booklet that can be downloaded (in many languages) with key concepts.

It is well known that complex systems are hard for humans to understand because many of the characteristics are non-intuitive for us. For example, self-organizing behavior is often contradictory to our own lives (when can you remember a time that a system around you naturally seemed to become more orderly as time passed?). Emergent properties can come about in long time scales that are often far distanced from the original action. We can’t always understand how decisions on the microscale resulted in large macroscale processes. Thus, in order to best approach complex systems, we must have the ability to interact with them, model them, and map out their complex behavior under many conditions. Below, I am introducing some tools that might help foster more understanding of these ideas using simple, yet dynamically rich cases.

## PyCX

One of the main creators of the Complexity Explained website and a visiting lecturer to my systems class is Hiroki Sayama, a world-renowned researcher and director of the Center for Collective Dynamics of Complex Systems at Binghamton University. Dr. Sayama has created a python package called PyCX that contains sample Python codes of complex systems that a user can run interactively and then manipulate or build off of. Simply download the package off of GitHub and all of the code and a simulator will be available to you. Figure 1 shows an example interactive simulation of a Turing pattern. In 1952, Alan Turing authored a paper where he described how patterns in animals’ coats such such as stripes and spots, can arise naturally from a chaotic system. He uses a simple set of reaction-diffusion equations to describe this process. Figure 1 shows the python simulator in PyCX, the equation for the Turing pattern, and the evolution from the random initialization to the ordered spots.

PyCX also allows you to toggle the parameters of the problem, which can express how small perturbations in the system can lead to substantially different outcomes. You can adjust these parameters within the source python code (which I believe is more useful for students rather than just clicking a “play” button). Figure 2 shows the difference in behavior across a forest fire model when the initial density is adjusted from 35% to 40% of the space.

## Golly- Game of Life Simulator

Golly is an open-source tool for visualizing cellular automata, including Conway’s Game of Life. Golly allows the user to draw different patterns and apply specific rules for how the systems evolve. You can stop the simulation midway and apply different rules to the existing patterns.

## Swarm Behavior

Dr. Sayama also developed a really interesting Java application to study swarm behavior, or collective behavior that is exhibited by entities, typically animals. This application, called swarm chemistry creates agents with different kinetic parameters that dictate dynamics. The application allows you to mix agents into a single population and observe how emergent dynamics form. Figure 4 shows the opening interface when you click the .jar executable. The application brings up 6 random agents that exhibit some dynamic behavior. By clicking on any two agents, you will create a new population that shows how the dynamics of the agents interact (Figure 5). You can keep mixing agents and adding more random swarms. You can individually mutate certain swarms or edit the parameters as well. The pictures do not do this application justice. It is super fun (and slightly addicting) and a great way to get students excited about the concepts.

I had so much fun using these packages in class and I hope that these tools can help you/your students become more engaged and excited about complex systems!

## References

My knowledge of these tools came from Hiroki Sayama’s guest lectures in SYSEN 6000 at Cornell University and from:

Sayama, H. (2015) Introduction to the Modeling and Analysis of Complex Systems,Open SUNY Textbooks, Milne Library, State University of New York at Geneseo.

# Introduction to PyBorg – basic setup and running

PyBorg is a new secondary implementation of Borg, written entirely in Python using the Platypus optimization library. PyBorg was developed by Andrew Dircks based on the original implementation in C and it is intended primarily as a learning tool as it is less efficient than the original C version (which you can still use with Python but through the use of the plugin “wrapper” also found in the package). PyBorg can be found in the same repository where the original Borg can be downloaded, for which you can request access here: http://borgmoea.org/#contact

This blogpost is intended to demonstrate this new implementation. To follow along, first you need to either clone or download the BitBucket repository after you gain access.

Setting up the required packages is easy. In your terminal, navigate to the Python directory in the repository and install all prerequisites using python setup.py install. This will install all requirements (i.e. the Platypus library, numpy, scipy and six) for you in your current environment.

You can test that everything works fine by running the optimization on the DTLZ2 test function, found in dtlz2.py. The script creates an instance of the problem (as it is already defined in the Platypus library), sets it up as a ploblem for Borg to optimize and runs the algorithm for 10,000 function evaluations:

    # define a DTLZ2 problem instance from the Platypus library
nobjs = 3
problem = DTLZ2(nobjs)

# define and run the Borg algorithm for 10000 evaluations
algorithm = BorgMOEA(problem, epsilons=0.1)
algorithm.run(10000)


A handy 3D scatter plot is also generated to show the optimization results.

The repository also comes with two other scripts dtlz2_runtime.py and dtlz2_advanced.py.
The first demonstrates how to use the Platypus hypervolume indicator at a specified runtime frequency to get learn about its progress as the algorithm goes through function evaluations:

The latter provides more advanced functionality that allows you define custom parameters for Borg. It also includes a function to generate runtime data from the run. Both scripts are useful to diagnose how your algorithm is performing on any given problem.

The rest of this post is a demo of how you can use PyBorg with your own Python model and all of the above. I’ll be using a model I’ve used before, which can be found here, and I’ll formulate it so it only uses the first three objectives for the purposes of demonstration.

The first thing you need to do to optimize your problem is to define it. This is done very simply in the exact same way you’d do it on Project Platypus, using the Problem class:

from fishery import fish_game
from platypus import Problem, Real
from pyborg import BorgMOEA

# define a problem
nVars = 6
nObjs = 3

problem = Problem(nVars, nObjs) # first input is no of decision variables, second input is no of objectives
problem.types[:] = Real(0, 1) #defines the type and bounds of each decision variable
problem.function = fish_game #defines the model function


This assumes that all decision variables are of the same type and range, but you can also define them individually using, e.g., problem.types[0].

Then you define the problem for the algorithm and set the number of function evaluations:

algorithm = BorgMOEA(problem, epsilons=0.001) #epsilons for each objective
algorithm.run(10000) # number of function evaluations


If you’d like to also produce a runtime file you can use the detailed_run function included in the demo (in the files referenced above), which wraps the algorithm and runs it in intervals so the progress can be monitored. You can combine it with runtime_hypervolume to also track your hypervolume indicator. To use it you need to define the total number of function evaluations, the frequency with which you’d like the progress to be monitored and the name of the output file. If you’d like to calculate the Hypervolume (you first need to import it from platypus) you also need to either provide a known reference set or define maximum and minimum values for your solutions.

maxevals = 10000
frequency = 100
output = "fishery.data"
hv = Hypervolume(minimum=[-6000, 0, 0], maximum=[0, 1, 100])

nfe, hyp = detailed_run(algorithm, maxevals, frequency, output, hv)


My full script can be found below. The detailed_run function is an edited version of the default that comes in the demo to also include the hypervolume calculation.

from fishery import fish_game
from platypus import Problem, Real, Hypervolume
from pyborg import BorgMOEA
from runtime_diagnostics import detailed_run

# define a problem
nVars = 6 # no. of decision variables to be optimized
nObjs = 3

problem = Problem(nVars, nObjs) # first input is no of decision variables, second input is no of objectives
problem.types[:] = Real(0, 1)
problem.function = fish_game

# define and run the Borg algorithm for 10000 evaluations
algorithm = BorgMOEA(problem, epsilons=0.001)
#algorithm.run(10000)

# define detailed_run parameters
maxevals = 10000
frequency = 100
output = "fishery.data"
hv = Hypervolume(minimum=[-6000, 0, 0], maximum=[0, 1, 100])

nfe, hyp = detailed_run(algorithm, maxevals, frequency, output, hv)

# plot the results using matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax.scatter([s.objectives[0] for s in algorithm.result],
[s.objectives[1] for s in algorithm.result],
[s.objectives[2] for s in algorithm.result])
ax.set_xlabel('Objective 1')
ax.set_ylabel('Objective 2')
ax.set_zlabel('Objective 3')
ax.scatter(-6000, 0, 0, marker="*", c='orange', s=50)
plt.show()

plt.plot(nfe, hyp)
plt.title('PyBorg Runtime Hypervolume Fish game')
plt.xlabel('Number of Function Evaluations')
plt.ylabel('Hypervolume')
plt.show()


It produces the following two figures:

# A video training on Rhodium

A few weeks ago I filmed a video training guide to the Rhodium framework for the annual meeting of the society for Decision Making Under Deep Uncertainty. Rhodium is a Python library that facilitates Many Objective Robust Decision making. The training walks through a demonstration of Rhodium using the Lake Problem. The training introduces a live Jupyter notebook Antonia and I created using Binder.

1. Watch the demo video below
2. Access the Binder Hub this link: https://mybinder.org/v2/gh/dgoldri25/Rhodium/7982d8fcb1de9a84f074cc
3. Click on the file called “DMDU_Rhodium_Demo.ipynb” to open the live demo
4. Begin using Rhodium!

# PyCharm and Git for productive multi-project workflows

I wanted to write this blogpost because I’ve seen great improvements to my workflow when I transitioned to this system and thought others might benefit also. My everyday research tasks require the following:

• a Python development environment on my local machine
• management of project-specific dependencies
• version control my changes
• execution on some high-performance computing resource.

My local machine runs on Mac OS, but everything I show here should be directly translatable to Windows or other operating systems. My setup is the following:

• Anaconda – to manage my Python environments and packages
• PyCharm – the Python development environment
• Git(Hub) – for version control

These are the steps I follow every time I start a new project:

1. Create an empty repository on GitHub
2. Clone the empty repository on my local machine
3. Open PyCharm and select the directory of the repository I just created

When it opens, the PyCharm project will be empty and will have a default Python interpreter associated with it. What I do is I create a separate Conda environment for each of my projects, so there’s a clean separation between the packages used by each.

4. Create python environment specific to this project, by going to Preferences and selecting your current project. There, you can define your project’s (Python) interpreter. Clicking on it just shows the default Python 2.7 interpreter, which we would like to change.

As you can see, I have a separate Conda environment for each of my projects, so I manage packages and dependencies for each one.

Here I create a new environment for my new project.

5. Manage packages needed. There’s two ways for this: either through PyCharm or through Anaconda. Through PyCharm, you can use the same page to install, uninstall or update packages as needed.

Through Anaconda, you can use the Navigator, which also allows you to customize several other things about your environment, like which applications you’d like to work with.

6. Set up version control and use code on other computing resources. PyCharm has Git features integrated (overviewed already in this blog here and here) and creating a project the way I showed also ensures that PyCharm knows which repository you’re working with, without you having to set it manually. I use the built-in PyCharm functionality to commit my changes to my repository, but you can also do it through the Terminal or other means.

7. Set up project on computing resources. To do so, you need two main components. A clone of your repository in the cluster you’re working on and an environment .yml file (I explain what this is and how to generate it with one command here), listing all your environment’s dependencies. Create a virtual environment for the project in the cluster and pull any updates from your local machine.

This is more or less all I do. I have virtual environments for each of my projects both locally and on the clusters I am working on and use PyCharm and Git to manage all the dependencies and versions. I have been using this setup for the past 5-6 months and I have seen a lot of improvements in my organization and productivity, so hopefully others will find it helpful also.

# Using Rhodium for exploratory modeling

Rhodium is a powerful, simple, open source Python library for multiobjective robust decision making. As part of Project Platypus, Rhodium is compatible with Platypus (a MOEA optimization library) and PRIM (the Patent Rule Induction Method for Python), making it a valuable tool for bridging optimization and analysis.

In the Rhodium documentation, a simple example of optimization and analysis uses the Lake Problem (DPS formulation). The actual optimization is performed in the line:

optimize(model, "NSGAII", 10000)


This optimize function uses the Platypus library directly for optimization; here the NSGAII algorithm is used for 10,000 function evaluations on the defined Lake Problem (model). This optimization call is concise and simple, but there are a few reasons why it may not be ideal.

1. Speed. Python, an interpreted language, is inherently slower than compiled languages (Java, C/C++, etc.) The Platypus library is built entirely in Python, making optimization slow.
2. Scalability. Platypus has support for parallelizing optimization, but this method is not ideal for large-scale computational experiments on computing clusters.
3. MOEA Suite. State of the art MOEAs such as the Borg MOEA are not implemented in Platypus for licensing reasons, so it is not usable directly by Rhodium.

Thus, external optimization is necessary for computationally demanding Borg runs. Luckily, Rhodium is easily compatible with external data files, so analysis with Rhodium of independent optimizations is simple. In this post, I’ll use a sample dataset obtained from a parallel Borg run of the Lake Problem, using the Borg wrapper.

The code and data used in this post can be found in this GitHub repository. lakeset.csv contains a Pareto approximate Lake Problem set. Each line is a solution, where the first six values are the decision variables and the last four are the corresponding objectives values.

We’ll use Pandas for data manipulation. The script below reads the sample .csv file with Pandas, converts it to a list of Python dictionaries, and creates a Rhodium DataSet. There are a few important elements to note. First, the Pandas to_dict function takes in an optional argument ‘records’ to specify the format of the output. This specific format creates a list of Python dictionaries, where each element of the list is an individual solution (i.e. a line from the .csv file) with dictionary keys corresponding to the decision / objective value names and dictionary values as each line’s data. This is the format necessary for making a Rhodium DataSetwhich we create by calling the constructor with the dictionary as input.

import pandas as pd
from rhodium import *

# use pandas to read the csv file

# convert the pandas data frame to a Python dict in record format
dictionary = frame.to_dict('records')

# create a Rhodium DataSet instance from the Python dictionary
dataset = DataSet(dictionary)


Printing the Rhodium DataSet with print(dataset) yields:

...
...
Index 204:
c1: 0.286373779
r1: 0.126801547
w1: 0.6265428129999999
c2: -0.133307575
r2: 1.3584425430000002
w2: 0.10987546599999999
benefit: -0.412053431
concentration: 0.359441661
inertia: -0.98979798
reliability: -0.9563


Once we have a Rhodium DataSet instantiated, we access many of the library’s functionalities, without performing direct optimization with Platypus. For example, if we want the policy with the lowest Phosphorus concentration (denoted by the ‘concentration’ field), the following code outputs:

policy = dataset.find_min('concentration')
print(policy)

{'c1': 0.44744488600000004, 'r1': 0.9600368159999999, 'w1': 0.260339899, 'c2': 0.283860122, 'r2': 1.246763577, 'w2': 0.5300663529999999, 'benefit': -0.213267399, 'concentration': 0.149320863, 'inertia': -1.0, 'reliability': -1.0}


Rhodium also offers powerful plotting functionalities. For example, we can easily create a Parallel Axis plot of our data to visualize the trade-offs between objectives. The following script uses the parallel_coordinates function in Rhodium on our external dataset. Here, since parallel_coordinates takes a Rhodium model as input, we can: 1) define the external optimization problem as a Rhodium model, or 2) define a ‘dummy’ model that gives us just enough information to create plots. For the sake of simplicity, we will use the latter, but the first option is simple to set up if there exists a Python translation of your problem/model. Note, to access the scenario discovery and sensitivity analysis functionalities of Rhodium, it is necessary to create a real Rhodium Model.

# define a trivial "dummy" model in Rhodium with an arbitrary function
model = Model(lambda x: x)

# set up the model's objective responses to match the keys in your dataset
# here, all objectives are minimized
# this is the only information needed to create a parallel coordinate plot
model.responses = [Response("benefit", Response.MINIMIZE),
Response("concentration", Response.MINIMIZE),
Response("inertia", Response.MINIMIZE),
Response("reliability", Response.MINIMIZE)]

# create the parallel coordinate plot from the results of our external optimization
fig = parallel_coordinates(model, dataset, target="bottom",
brush=[Brush("reliability < -0.95"), Brush("reliability >= -0.95")])


# A template for reproducible papers

Writing fully reproducible papers is something everyone talks about but very few people actually do. Following nice examples I’ve seen developed by others (see here and here), I wanted to develop a GitHub template that I could easily use to organize the analysis I perform for each paper. I wanted it to be useful for the Reed group in general, but also anyone else who’d like to use it, so the version I’m presenting today is an initial version that will be adapted and evolve as our needs grow.

The template can be found here: https://github.com/antonia-had/paper_template and this blogpost will discuss its contents. The repository is set up as a template, so you can use “Import repository” when you create a new repository for your project or click on the green “Use this template” button on the top right.

The idea is that everything is organized and documented well so that another person can easily replicate your work. This will help with your own tools being more widely used and cited, but also future group members to easily pick up from where you left. The other selfish way in which this has helped me is that it forces me to spend some time and arrange things from the beginning so I can be more organized (and therefore more productive) during the project. Most importantly, when a paper does get accepted you don’t need to go back and organize everything so it looks halfway decent for a public repository. For these reasons I try to use a template like this from the early stages of a project.

A lot of the template is self explanatory, but I’ll go through to explain what is in it in short. The idea is you take it and just replace the text with your own in the README files and use it as a guide to organize your paper analysis and results.

There are directories to organize your content to code, data, and results (or anything else that works for you). Every directory has its own README listing its contents and how they should be used. All code that you didn’t write and data that you didn’t generate need to be cited. Again, this is useful to document from the beginning so you don’t need to search for it later.

Most of my work is done in Python, so I wrote up how to handle Python dependencies. The way I suggest going about it is through a ‘.yml‘ file that specifies all the dependencies (i.e. all the packages and versions your script uses) for your project. I believe the best way to handle this is by creating a Python environment for every project you work on so you can create a separate list of dependencies for each. We have a nice blogpost on how to create and manage Python environments here.

When the project is done and you’re ready to submit or publish your paper, export all dependencies by running:

conda env export > environment.yml --no-builds

and store your environment.yml in the GitHub repository. When someone else needs to replicate your results, they would just need to create the same Python environment (running conda env create --file environment.yml) before executing your scripts.

Finally, you could automate the analysis and figure production with a makefile that executes your scripts so who ever is replicating does not need to manually execute all of them. This also helps avoiding somebody executing the scripts in the wrong order. An example of this can be found in this template. The makefile can also be in Python, like Julie did here.

In recognition that this is not an exhaustive template of everything one might need, the aim is to have this blog post and the template itself evolve as the group identifies needs and material to be added.

# Make your Git repository interactive with Binder

Have you ever tried to demo a piece of software you wrote only to have the majority of participants get stuck when trying to configure their computational environment? Difficulty replicating computational environments can prevent effective demonstration or distribution of even simple codes. Luckily, new tools are emerging that automate this process for us. This post will focus on Binder, a tool for creating custom computing environments that can be distributed and used by many remote users simultaneously. Binder is language agnostic tool, and can be used to create custom environments for R, Python and Julia. Binder is powered by BinderHub, an open source service in the cloud. At the bottom of this post, I’ll provide an example of an interactive Python Jupyter Notebook that I created using BinderHub.

## BinderHub

BinderHub combines two useful libraries: repo2docker and JupyterHub. repo2docker is a tool to build, run and push Docker images from source code repositories. This allows you to create copies of custom environments that users can replicate on any machine. These copies are can be stored and distributed along with the remote repository. JuptyerHub is a scalable system that can be used to spawn multiple Jupyter Notebook servers. JuptyerHub takes the Docker image created by repo2docker and uses it to spawn a Jupyter Notebook server on the cloud. This server can be accessed and run by multiple users at once. By combining repo2docker and JupyterHub, BinderHub allows users to both replicate complex environments and easily distribute code to large numbers of users.

## Creating your own BinderHub deployment

Creating your own BinderHub deployment is incredibly easy. To start, you need a remote repository containing two things: (1) a Jupyter notebook with supporting code and (2) configuration files for your environment. Configuration files can either be an environment.yml file (a standard configuration file that can be generated with conda, see example here) or a requirements.txt file (a simple text file that lists dependencies, see example here).

To create an interactive BinderHub deployment:

1. Push your code to a remote repository (for example Github)
2. Go to mybinder.org and paste the repository’s URL into the dialoge box (make sure to select the proper hosting service)
3. Specify the branch if you are not on the Master
4. Click “Launch”

The website will generate a URL that you can copy and share with users. I’ve created an example for our Rhodium tutorial, which you can find here:

https://mybinder.org/v2/gh/dgoldri25/Rhodium/master?filepath=DMDU_Rhodium_Demo.ipynb

To run the interactive Jupyter Notebook, click on the file titled “Rhodium_Demo.ipynb”. Happy sharing!

# EnGauge: R Code Repository for Environmental Gauge Data Acquisition, Processing, and Visualization

### Introduction and Motivation

Gauge data is an essential component of water systems research projects; however, data acquisition, processing, and exploratory (spatio-temporal) data analysis often consumes a large chunk of limited project research time. I developed the EnGauge GitHub repository to reduce the time required to download, process, and explore streamflow, water quality, and weather station gauge data that are hosted primarily on U.S. government servers. This repository compiles and modifies functions from other Packages for Hydrological Data Retrieval and Statistical Analysis, and develops new functions for processing and exploring the data.

### Data Acquisition

Given a polygon shapefile of the region of interest and an optional radial buffer size, the types of gauge data downloaded can include:

1. USGS streamflow from the NWIS portal
2. EPA STORET, USGS, USDA and other water quality data via the water quality portal
3. NOAA ACIS, GHCN weather station data

The USGS R package dataRetrieval and the NOAA rnoaa package contain the primary functions used for data acquisition. Additional references to learn about these packages are available in the EnGauge README file and at the provided web links.

### Data Processing

Significant processing is required to use some of these gauge datasets for environmental modeling. The EnGauge repository has functions that may be used to address the following common data processing needs:

1. Check for duplicate records
2. Check for zeros and negative values
3. Check detection limits
4. Fill date gaps (add NAs to dates missing from timeseries)
5. Aggregate to daily, monthly, and/or annual timeseries
6. Project spatial data to a specified coordinate system
7. Write processed data to shapefiles, .txt files, and lists that can be loaded into other software for further analysis and/or modeling.

### Data Visualization and Exploratory Data Analysis – From GitHub Example

This example is applied to the Gwynns Falls watershed in the Baltimore Ecosystem Study Long Term Ecological Research site. The following figures are some of the output from the EnGague USGSdataRetrieval.R script (as of commit 2fc84cd).

1. Record lengths at each gauge
2. Locations of sites with zero and/or negative values
3. Locations of sites with different water quality information: total nitrogen and total phosphorus in this example
4. Locations of sites with certain weather station data: maximum temperature in this example
5. Visualizing quality codes on timeseries
6. Summary exploratory spatial data analysis for sites
7. Summary daily, monthly, annual information
8. Monthly heatmap
9. Outlier visualization: currently implements a simplistic global spatio-temporal method defined by flows greater than a selected quantile. Plots offer qualitative support for the flows at other stations on the dates with high outliers at the reference station.
10. DEM vs. Gauge Elevation: If you supply a DEM, the reported gauge elevation can be compared to the DEM elevation within the region of interest (ROI)
11. Seasonal Scatterplot with Histograms: If you have two timeseries of different data types, e.g. streamflow and water quality, a scatterplot by season may be made (not in example code, but a function is available in the repository).

### Concluding Thoughts

This repository can be used to download gauge data from several sources, to employ standard data processing methods across those sources, and to explore the resulting data. Spend less time getting your data ready to do your research, and more time thinking about what your data are telling you and actually using it for modeling. Check out the EnGague repository for your next research project!