# Introduction

Being able to automate data retrieval helps alleviate encountering irritating, repetitive tasks. Often times, these data files can be separated by a range of times, site locations, or even type of measurement, causing them to be cumbersome to download manually. This blog post outlines how to download multiple zipped csv files from a webpage using both R and Python.

We will specifically explore downloading historical hourly locational marginal pricing (LMP) data files from PJM, a regional transmission organization coordinating a wholesale electricity market across 13 Midwestern, Southeastern, and Northeastern states. The files in question are located here.

# First, Using Python

A distinct advantage of using Python over R is being able to write fewer lines for the same results. However, the differences in the time required to execute the code are generally minimal between the two languages.

### Constructing URLs

The URL for the most recent file from the webpage for August 2017 has the following format: “http://www.pjm.com/pub/account/lmpmonthly/201708-da.csv
Notably, the URL is straightforward in its structure and allows for a fairly simple approach in attempting to create this URL from scratch.

To begin, we can deconstruct the URL into three aspects: the base URL, the date, and the file extension.
The base URL is fairly straightforward because it is constant: “http://www.pjm.com/pub/account/lmpmonthly/
As is the file extension: “-da.csv”

However, the largest issue presented is how to approach recreating the date: “201708”
It is in the form “yyyymm”, requiring a reconstruction of a 4-digit year (luckily these records don’t go back to the first millennium) and a 2-digit month.

In Python, this is excessively easy to reconstruct using the .format() string modifying method. All that is required is playing around with the format-specification mini-language within the squiggly brackets. Note that the squiggly brackets are inserted within the string with the following form: {index:modifyer}

In this example, we follow the format above and deconstruct the URL then reassemble it within one line in Python. Assigning this string to a variable allows for easy iterations through the code by simply changing the inputs. So let’s create a quick function named url_creator that allows for the input of a date in the format (yyyy, mm) and returns the constructed URL as a string:

def url_creator(date):
"""
Input is list [yyyy, mm]
"""
yyyy, mm = date

return "http://www.pjm.com/pub/account/lmpmonthly/{0:}{1:02d}-da.zip".format(yyyy, mm)


To quickly generate a list of the dates we’re interested in, creating a list of with entries in the format (yyyy, mm) for each of the month’s we’re interested in is easy enough using a couple of imbedded for loops. If you need alternative dates, you can easily alter this loop around or manually create a list of lists to accommodate your needs:

#for loop method
dates = []
for i in range(17):
for j in range(12):
dates.append([2001 + i, j + 1])

#alternative manual method
dates = [[2011, 1], [2011, 2], [2011, 3]]


### Retrieving and Unpacking Zipped Files

Now that we can create the URL for this specific file type by calling the url_creator function and have all of the dates we may be interested in, we need to be able to access, download, unzip/extract, and subsequently save these files. Utilizing the urllib library, we can request and download the zipped file. Note that urllib.request.urlretrieve only retrieves the files and does not attempt to read it. While we could simply save the zipped file at this step, it is preferred to extract it to prevent headaches down the line.

Utilizing the urllib library, we can extract the downloaded files to the specified folder. Notably, I use the operating system interface method getcwd while extracting the file to save the resulting csv file into the directory where this script is running. Following this, the extracted file is closed.

import zipfile, urllib, os
from urllib.request import Request,urlopen, urlretrieve

for date in dates:
baseurl = url_creator(date)

local_filename, headers = urllib.request.urlretrieve(url = baseurl)
zip_ref = zipfile.ZipFile(file = local_filename, mode = 'r')
zip_ref.extractall(path = os.getcwd())     #os.getcwd() directs to current working directory
zip_ref.close()


At this point, the extracted csv files will be located in the directory where this script was running.

# Alternatively, Using R

We will be using an csv files to specify date ranges instead of for loops as shown in the Python example above. The advantage of using the csv files rather than indexing i=2001:2010 is that if you have a list of non-consecutive months or years, it is sometimes easier to just make a list of the years and cycle through all of the elements of the list rather than trying to figure out how to index the years directly. However, in this example, it would be just as easy to index through the years and month rather than the csv file.

This first block of code sets the working directory and reads in two csv files from the directory. The first csv file contains a list of the years 2001-2010. The second CSV file lists out the months January-December as numbers.

#Set working directory
#specifically specified for the directory where we want to "dump" the data
setwd("E:/Data/WebScraping/PJM/Markets_and_operations/Energy/Real_Time_Monthly/R")

#Create csv files with a list of the relevant years and months


The breakdown of the URL was previously shown. To create this in R, we will generate a for-loop that will change this link to cycle through all the years and months listed in the csv files. The index i and j iterate through the elements in the year and month csv file respectively. (The if statement will be discussed at the end of this section). We then use the “downloadurl” and “paste” functions to turn the entire link into a string. The parts of the link surrounded in quotes are unchanging and denoted in green. The “as.character” function is used to cycle in the elements from the csv files and turn them into characters. Finally, the sep=’’ is used to denote that there should be no space separating the characters. The next two lines download the file into R as a temporary file.

#i indexes through the years and j indexes through the months
for (i in (1:16)){
for( j in (1:12)){
if (j>9){

#dowload the url and save it as a temporary file

temp="tempfile"


Next, we unzip the temporary file and finds the csv (which is in the form: 200101-da.csv), reading in the csv file into the R environment as “x”. We assign the name “200101-da” to “x”. Finally, it is written as a csv file and stored in the working directory, the temporary file is deleted, and the for-loop starts over again.

    #read in the csv into the global environment
#Assign a name to the csv "x"
newname=paste(as.character(years[i,1]),as.character(months[j,1]),sep="")
assign(newname,x)
#create a csv that is stored in your working directory
write.csv(x,file=paste(newname,".csv",sep=""))

}

#If j is between 1 and 10, the only difference is that a "0" has to be added
#in front of the month number


### Overcoming Formatting

The reason for the “if” statement is that it is not a trivial process to properly format the dates in Excel when writing to a csv to create the list of dates. When inputing “01”, the entry is simply changed to “1” when the file is saved as a csv. However, note that in the name of the files, the first nine months have a 0 preceding their number. Therefore, the URL construct must be modified for these months. The downloadurl has been changed so that a 0 is added before the month number if j < 10. Aside from the hard-coded zeroes, the block of code is the same as the one above.

else {

temp="tempfile"

newname=paste(as.character(years[i,1]),"0",as.character(months[j,1]),sep="")
assign(newname,x)
write.csv(x,file=paste(newname,".csv",sep=""))

}


# Acknowledgements

This was written in collaboration with Rohini Gupta who contributed by defining the problem and explaining her approach in R.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Versions: R = 3.4.1 Python = 3.6.2
To see the code in its entirety, please visit the linked GitHub Repository.

# Python example: Hardy Cross method for pipe networks

I teach a class called Water Resources Engineering at the University of Colorado Boulder, and I have often added in examples where students use Python to perform the calculations.  For those of you who are new to programming, you might get a kick out of this example since it goes through an entire procedure of how to solve an engineering problem using Python, and the numpy library.  The example is two YouTube videos embedded below! The best way to watch them, though, is in full screen, which you can get to by clicking the name of the video, opening up a new YouTube tab in your browser.

Let us know if you have any questions in the comments section below. For more on Python from this blog, search “Python” in the search box.

# Open Source Streamflow Generator Part II: Validation

This is the second part of a two-part blog post on an open source synthetic streamflow generator written by Matteo Giuliani, Jon Herman and me, which combines the methods of Kirsch et al. (2013) and Nowak et al. (2010) to generate correlated synthetic hydrologic variables at multiple sites. Part I showed how to use the MATLAB code in the subdirectory /stationary_generator to generate synthetic hydrology, while this post shows how to use the Python code in the subdirectory /validation to statistically validate the synthetic data.

The goal of any synthetic streamflow generator is to produce a time series of synthetic hydrologic variables that expand upon those in the historical record while reproducing their statistics. The /validation subdirectory of our repository provides Python plotting functions to visually and statistically compare the historical and synthetic hydrologic data. The function plotFDCrange.py plots the range of the flow duration (probability of exceedance) curves for each hydrologic variable across all historical and synthetic years. Lines 96-100 should be modified for your specific application. You may also have to modify line 60 to change the dimensions of the subplots in your figure. It’s currently set to plot a 2 x 2 figure for the four LSRB hydrologic variables.

plotFDCrange.py provides a visual, not statistical, analysis of the generator’s performance. An example plot from this function for the synthetic data generated for the Lower Susquehanna River Basin (LSRB) as described in Part I is shown below. These probability of exceedance curves were generated from 1000 years of synthetic hydrologic variables. Figure 1 indicates that the synthetic time series are successfully expanding upon the historical observations, as the synthetic hydrologic variables include more extreme high and low values. The synthetic hydrologic variables also appear unbiased, as this expansion is relatively equal in both directions. Finally, the synthetic probability of exceedance curves also follow the same shape as the historical, indicating that they reproduce the within-year distribution of daily values.

Figure 1

To more formally confirm that the synthetic hydrologic variables are unbiased and follow the same distribution as the historical, we can test whether or not the synthetic median and variance of real-space monthly values are statistically different from the historical using the function monthly-moments.py. This function is currently set up to perform this analysis for the flows at Marietta, but the site being plotted can be changed on line 76. The results of these tests for Marietta are shown in Figure 2. This figure was generated from a 100-member ensemble of synthetic series of length 100 years, and a bootstrapped ensemble of historical years of the same size and length. Panel a shows boxplots of the real-space historical and synthetic monthly flows, while panels b and c show boxplots of their means and standard deviations, respectively. Because the real-space flows are not normally distributed, the non-parametric Wilcoxon rank-sum test and Levene’s test were used to test whether or not the synthetic monthly medians and variances were statistically different from the historical. The p-values associated with these tests are shown in Figures 2d and 2e, respectively. None of the synthetic medians or variances are statistically different from the historical at a significance level of 0.05.

Figure 2

In addition to verifying that the synthetic generator reproduces the first two moments of the historical monthly hydrologic variables, we can also verify that it reproduces both the historical autocorrelation and cross-site correlation at monthly and daily time steps using the functions autocorr.py and spatial-corr.py, respectively. The autocorrelation function is again set to perform the analysis on Marietta flows, but the site can be changed on line 39. The spatial correlation function performs the analysis for all site pairs, with site names listed on line 75.

The results of these analyses are shown in Figures 3 and 4, respectively. Figures 3a and 3b show the autocorrelation function of historical and synthetic real-space flows at Marietta for up to 12 lags of monthly flows (panel a) and 30 lags of daily flows (panel b). Also shown are 95% confidence intervals on the historical autocorrelations at each lag. The range of autocorrelations generated by the synthetic series expands upon that observed in the historical while remaining within the 95% confidence intervals for all months, suggesting that the historical monthly autocorrelation is well-preserved. On a daily time step, most simulated autocorrelations fall within the 95% confidence intervals for lags up to 10 days, and those falling outside do not represent significant biases.

Figure 3

Figures 4a and 4b show boxplots of the cross-site correlation in monthly (panel a) and daily (panel b) real-space hydrologic variables for all pairwise combinations of sites. The synthetic generator greatly expands upon the range of cross-site correlations observed in the historical record, both above and below. Table 1 lists which sites are included in each numbered pair of Figure 4. Wilcoxon rank sum tests (panels c and d) for differences in median monthly and daily correlations indicate that pairwise correlations are statistically different (α=0.5) between the synthetic and historical series at a monthly time step for site pairs 1, 2, 5 and 6, and at a daily time step for site pairs 1 and 2. However, biases for these site pairs appear small in panels a and b. In summary, Figures 1-4 indicate that the streamflow generator is reasonably reproducing historical statistics, while also expanding on the observed record.

Figure 4

Table 1

 Pair Number Sites 1 Marietta and Muddy Run 2 Marietta and Lateral Inflows 3 Marietta and Evaporation 4 Muddy Run and Lateral Inflows 5 Muddy Run and Evaporation 6 Lateral Inflows and Evaporation

# Open Source Streamflow Generator Part I: Synthetic Generation

This post describes how to use the Kirsch-Nowak synthetic streamflow generator to generate synthetic streamflow ensembles for water systems analysis. As Jon Lamontagne discussed in his introduction to synthetic streamflow generation, generating synthetic hydrology for water systems models allows us to stress-test alternative management plans under stochastic realizations outside of those observed in the historical record. These realizations may be generated assuming stationary or non-stationary models. In a few recent papers from our group applied to the Red River and Lower Susquehanna River Basins (Giuliani et al., 2017; Quinn et al., 2017; Zatarain Salazar et al., In Revision), we’ve generated stationary streamflow ensembles by combining methods from Kirsch et al. (2013) and Nowak et al. (2010). We use the method of Kirsch et al. (2013) to generate flows on a monthly time step and the method of Nowak et al. (2010) to disaggregate these monthly flows to a daily time step. The code for this streamflow generator, written by Matteo Giuliani, Jon Herman and me, is now available on Github. Here I’ll walk through how to use the MATLAB code in the subdirectory /stationary_generator to generate correlated synthetic hydrology at multiple sites, and in Part II I’ll show how to use the Python code in the subdirectory /validation to statistically validate the synthetic hydrology. As an example, I’ll use the Lower Susquehanna River Basin (LSRB).

A schematic of the LSRB, reproduced from Giuliani et al. (2014) is provided below. The system consists of two reservoirs: Conowingo and Muddy Run. For the system model, we generate synthetic hydrology upstream of the Conowingo Dam at the Marietta gauge (USGS station 01576000), as well as lateral inflows between Marietta and Conowingo, inflows to Muddy Run and evaporation rates over Conowingo and Muddy Run dams. The historical hydrology on which the synthetic hydrologic model is based consists of the historical record at the Marietta gauge from 1932-2001 and simulated flows and evaporation rates at all other sites over the same time frame generated by an OASIS system model. The historical data for the system can be found here.

The first step to use the synthetic generator is to format the historical data into an nD × nS matrix, where nD is the number of days of historical data with leap days removed and nS is the number of sites, or hydrologic variables. An example of how to format the Susquehanna data is provided in clean_data.m. Once the data has been reformatted, the synthetic generation can be performed by running script_example.m (with modifications for your application). Note that in the LSRB, the evaporation rates over the two reservoirs are identical, so we remove one of those columns from the historical data (line 37) for the synthetic generation. We also transform the historical evaporation with an exponential transformation (line 42) since the code assumes log-normally distributed hydrologic data, while evaporation in this region is more normally distributed. After the synthetic hydrology is generated, the synthetic evaporation rates are back-transformed with a log-transformation on line 60. While such modifications allow for additional hydrologic data beyond streamflows to be generated, for simplicity I will refer to all synthetic variables as “streamflows” for the remainder of this post. In addition to these modifications, you should also specify the number of realizations, nR, you would like to generate (line 52), the number of years, nY, to simulate in each realization (line 53) and a string with the dimensions nR × nY for naming the output file.

The actual synthetic generation is performed on line 58 of script_example.m which calls combined_generator.m. This function first generates monthly streamflows at all sites on line 10 where it calls monthly_main.m, which in turn calls monthly_gen.m to perform the monthly generation for the user-specified number of realizations. To understand the monthly generation, we denote the set of historical streamflows as $\mathbf{Q_H}\in \mathbb{R}^{N_H\times T}$ and the set of synthetic streamflows as $\mathbf{Q_S}\in \mathbb{R}^{N_S\times T}$, where $N_H$ and $N_S$ are the number of years in the historical and synthetic records, respectively, and T is the number of time steps per year. Here T=12 for 12 months. For the synthetic generation, the streamflows in $\mathbf{Q_H}$ are log-transformed to yield the matrix $Y_{H_{i,j}}=\ln(Q_{H_{i,j}})$, where i and j are the year and month of the historical record, respectively. The streamflows in $\mathbf{Y_H}$ are then standardized to form the matrix $\mathbf{Z_H}\in \mathbb{R}^{N_H\times T}$ according to equation 1:

1) $Z_{H_{i,j}} = \frac{Y_{H_{i,j}}-\hat{\mu_j}}{\hat{\sigma_j}}$

where $\hat{\mu_j}$ and $\hat{\sigma_j}$ are the sample mean and sample standard deviation of the j-th month’s log-transformed streamflows, respectively. These variables follow a standard normal distribution: $Z_{H_{i,j}}\sim\mathcal{N}(0,1)$.

For each site, we generate standard normal synthetic streamflows that reproduce the statistics of $\mathbf{Z_H}$ by first creating a matrix $\mathbf{C}\in \mathbb{R}^{N_S\times T}$ of randomly sampled standard normal streamflows from $\mathbf{Z_H}$. This is done by formulating a random matrix $\mathbf{M}\in \mathbb{R}^{N_S\times T}$ whose elements are independently sampled integers from $(1,2,...,N_H)$. Each element of $\mathbf{C}$ is then assigned the value $C_{i,j}=Z_{H_{(M_{i,j}),j}}$, i.e. the elements in each column of $\mathbf{C}$ are randomly sampled standard normal streamflows from the same column (month) of $\mathbf{Z_H}$. In order to preserve the historical cross-site correlation, the same matrix $\mathbf{M}$ is used to generate $\mathbf{C}$ for each site.

Because of the random sampling used to populate $\mathbf{C}$, an additional step is needed to generate auto-correlated standard normal synthetic streamflows, $\mathbf{Z_S}$. Denoting the historical autocorrelation $\mathbf{P_H}$=corr($\mathbf{Z_H}$), where corr($\mathbf{Z_H}$) is the historical correlation between standardized streamflows in months i and j (columns of $\mathbf{Z_H}$), an upper right triangular matrix, $\mathbf{U}$, can be found using Cholesky decomposition (chol_corr.m) such that $\mathbf{P_H}=\mathbf{U^\intercal U}$. $\mathbf{Z_S}$ is then generated as $\mathbf{Z_S}=\mathbf{CU}$. Finally, for each site, the auto-correlated synthetic standard normal streamflows $\mathbf{Z_S}$ are converted back to log-space streamflows $\mathbf{Y_S}$ according to $Y_{S_{i,j}}=\hat{\mu_j}+Z_{S_{i,j}}\hat{\sigma_j}$. These are then transformed back to real-space streamflows $\mathbf{Q_S}$ according to $Q_{S_{i,j}}$=exp($Y_{S_{i,j}}$).

While this method reproduces the within-year log-space autocorrelation, it does not preserve year to-year correlation, i.e. concatenating rows of $\mathbf{Q_S}$ to yield a vector of length $N_S\times T$ will yield discontinuities in the autocorrelation from month 12 of one year to month 1 of the next. To resolve this issue, Kirsch et al. (2013) repeat the method described above with a historical matrix $\mathbf{Q_H'}\in \mathbb{R}^{N_{H-1}\times T}$, where each row i of $\mathbf{Q_H'}$ contains historical data from month 7 of year i to month 6 of year i+1, removing the first and last 6 months of streamflows from the historical record. $\mathbf{U'}$ is then generated from $\mathbf{Q_H'}$ in the same way as $\mathbf{U}$ is generated from $\mathbf{Q_H}$, while $\mathbf{C'}$ is generated from $\mathbf{C}$ in the same way as $\mathbf{Q_H'}$ is generated from $\mathbf{Q_H}$. As before, $\mathbf{Z_S'}$ is then calculated as $\mathbf{Z_S'}=\mathbf{C'U'}$. Concatenating the last 6 columns of $\mathbf{Z_S'}$ (months 1-6) beginning from row 1 and the last 6 columns of $\mathbf{Z_S}$ (months 7-12) beginning from row 2 yields a set of synthetic standard normal streamflows that preserve correlation between the last month of the year and the first month of the following year. As before, these are then de-standardized and back-transformed to real space.

Once synthetic monthly flows have been generated, combined_generator.m then finds all historical total monthly flows to be used for disaggregation. When calculating all historical total monthly flows a window of +/- 7 days of the month being disaggregated is considered. That is, for January, combined_generator.m finds the total flow volumes in all consecutive 31-day periods within the window from 7 days before Jan 1st to 7 days after Jan 31st. For each month, all of the corresponding historical monthly totals are then passed to KNN_identification.m (line 76) along with the synthetic monthly total generated by monthly_main.mKNN_identification.m identifies the k nearest historical monthly totals to the synthetic monthly total based on Euclidean distance (equation 2):

2) $d = \left[\sum^{M}_{m=1} \left({\left(q_{S}\right)}_{m} - {\left(q_{H}\right)}_{m}\right)^2\right]^{1/2}$

where ${(q_S)}_m$ is the real-space synthetic monthly flow generated at site m and ${(q_H)}_m$ is the real-space historical monthly flow at site m. The k-nearest neighbors are then sorted from i=1 for the closest to i=k for the furthest, and probabilistically selected for proportionally scaling streamflows in disaggregation. KNN_identification.m uses the Kernel estimator given by Lall and Sharma (1996) to assign the probability $p_n$ of selecting neighbor n (equation 3):

3) $p_{n} = \frac{\frac{1}{n}}{\sum^{k}_{i=1} \frac{1}{i}}$

Following Lall and Sharma (1996) and Nowak et al. (2010), we use $k=\Big \lfloor N_H^{1/2} \Big \rceil$. After a neighbor is selected, the final step in disaggregation is to proportionally scale all of the historical daily streamflows at site m from the selected neighbor so that they sum to the synthetically generated monthly total at site m. For example, if the first day of the month of the selected historical neighbor represented 5% of that month’s historical flow, the first day of the month of the synthetic series would represent 5% of that month’s synthetically-generated flow. The random neighbor selection is performed by KNN_sampling.m (called on line 80 of combined_generator.m), which also calculates the proportion matrix used to rescale the daily values at each site on line 83 of combined_generator.m. Finally, script_example.m writes the output of the synthetic streamflow generation to files in the subdirectory /validation. Part II shows how to use the Python code in this directory to statistically validate the synthetically generated hydrology, meaning ensure that it preserves the historical monthly and daily statistics, such as the mean, standard deviation, autocorrelation and spatial correlation.

# Using Borg in Parallel and Serial with a Python Wrapper – Part 2

This blog post is Part 2 of a two-part series that will demonstrate how I have coupled a pure Python simulation model with the Borg multi-objective evolutionary algorithm (MOEA). I recommend reading Part 1 of this series before you read Part 2. In Part 1, I explain how to get Borg and provide sample code showing how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py). In Part 2, I provide details for more advanced simulation-optimization setups that require you pass additional information from the borg wrapper into the simulation model (the “function evaluation”) other than just decision variable values.

In Part 1, the example simulation model I use (PySedSim) is called through a function handle “Simulation_Caller” in the example_sim_opt.py file. Borg needs only this function handle to properly call the simulation model in each “function evaluation”. Borg’s only interaction with the simulation model is to pass the simulation model’s function handle (e.g., “Simulation_Caller”) the decision variables, and nothing else. In many circumstances, this is all you need.

However, as your simulation-optimization setup becomes more complex, in order for your simulation model (i.e., the function evaluation) to execute properly, you may need to pass additional arguments to the simulation model from Borg other than just the decision variables. For example, in my own use of Borg in a simulation-optimization setting, in order to do optimization I first import a variety of assumptions and preferences to set up a Borg-PySedSim run. Some of those assumptions and preferences are helpful to the simulation model (PySedSim) in determining how to make use of the decision variable values Borg feeds it. So, I would like to pass those relevant assumptions and preferences directly into the Borg wrapper (borg.py), so the wrapper can in turn pass them directly into the simulation model along with the decision variable values.

Before I show how to do this, let me provide a more concrete example of how/why I am doing this in my own research. In my current work, decision variable values represent parameters for a reservoir operating policy that is being optimized. The simulation model needs to know how to take the decision variable values and turn them into a useful operating policy that can be simulated. Some of this information gets imported in order to run Borg, so I might as well pass that information directly into the simulation model while I have it on hand, rather than importing it once again in the simulation model.

To do what I describe above, we just need to modify the two functions in the example_sim_opt.py module so that a new argument “additional_inputs” is passed from borg to the simulation handle.  Using my python code from blog post 1, I provide code below that is modified in the Simulation_Caller() function on lines 5, 21, 22 and 27; and in the Optimization() function on lines 55, 56 and 70. After that code, I then indicate how I modify the borg.py wrapper so it can accept this information.

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

'''
Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

Note: You could also just put your simulation/function evaluation code here.

Args:
vars: A list of decision variable values from Borg
additional_inputs: A list of python data structures you want to pass from Borg into the simulation model.
Returns:
performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
'''

borg_vars = vars  # Decision variable values from Borg

# Unpack lists of additional inputs from Borg (example assumes additional inputs is a python list with two items)

# Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
op_policy_params = np.asarray(borg_vars)
# Call/run simulation model with decision vars and additional relevant inputs, return multi-objective performance:
return performance

def Optimization():

'''

Purpose: Call this method from command line to initiate simulation-optimization experiment

Returns:
--pareto approximate set file (.set) for each random seed
--Borg runtime file (.runtime) for each random seed

'''

import borg as bg  # Import borg wrapper

parallel = 1  # 1= master-slave (parallel), 0=serial

# The following are just examples of relevant MOEA specifications. Select your own values.
nSeeds = 25  # Number of random seeds (Borg MOEA)
num_dec_vars = 10  # Number of decision variables
n_objs = 6  # Number of objectives
n_constrs = 0  # Number of constraints
num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
runtime_freq = 1000  # Interval at which to print runtime details for each random seed
decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective
borg_dict_1 = {'simulation_preferences_1': [1,2]}  # reflects data you want Borg to pass to simulation model
borg_dict_2 = {'simulation_preferences_2': [3,4]}  # reflects data you want Borg to pass to simulation model

# Where to save seed and runtime files
main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
output_location = main_output_file_dir + os_fold + 'sets'

# If using master-slave, start MPI. Only do once.
if parallel == 1:
bg.Configuration.startMPI()  # start parallelization with MPI

# Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
for j in range(nSeeds):
# Instantiate borg class, then set bounds, epsilon values, and file output locations
borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller, add_sim_inputs = [borg_dict_1, borg_dict_2])
borg.setBounds(*decision_var_range)  # Set decision variable bounds
borg.setEpsilons(*epsilon_list)  # Set epsilon values
# Runtime file path for each seed:
runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
if parallel == 1:
# Run parallel Borg
result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

if parallel == 0:
# Run serial Borg
result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
"runtimefile": runtime_filename})

if result:
# This particular seed is now finished being run in parallel. The result will only be returned from
# one node in case running Master-Slave Borg.
result.display()

# Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
f.write('#Borg Optimization Results\n')
f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
' are the ' + 'objective values\n')
for solution in result:
line = ''
for i in range(len(solution.getVariables())):
line = line + (str(solution.getVariables()[i])) + ' '

for i in range(len(solution.getObjectives())):
line = line + (str(solution.getObjectives()[i])) + ' '

f.write(line[0:-1]+'\n')
f.write("#")
f.close()

# Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
# the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
for solution in result:
line = ''
for i in range(len(solution.getObjectives())):
line = line + (str(solution.getObjectives()[i])) + ' '

f2.write(line[0:-1]+'\n')
f2.write("#")
f2.close()

print("Seed %s complete") %j

if parallel == 1:
bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
'''
Function to determine whether operating system is (1) Windows, or (2) Linux

Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
post-simulation.
'''

if platform.system() == 'Windows':
os_fold_op = '\\'
elif platform.system() == 'Linux':
os_fold_op = '/'
else:
os_fold_op = '/'  # Assume unix OS if it can't be identified

return os_fold_op


Next, you will need to acquire the Borg wrapper using the instructions I specified in my previous blog post. You will need to make only two modifications: (1) modify the Borg class in borg.py so it accepts the inputs you want to pass to the simulation; and (2) some additional internal accounting in borg.py to ensure those inputs are passed to the borg.py methods that deal with your function handle. I will address these two in order.

First, modify the Borg class in borg.py so it now accepts an additional input (I only show some of the borg.py code here, just to indicate where changes are being made):


class Borg:
def __init__(self, numberOfVariables, numberOfObjectives, numberOfConstraints, function, epsilons = None, bounds = None, directions = None, add_sim_inputs=None):

# add_sim_inputs is the new input you will pass to borg



Then, modify the portion of the borg.py wrapper where self.function is called, so it can accommodate any simulation inputs you have specified.


self.function = _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions)
else:
# More simulation inputs are specified and can be passed to the simulation handle



After the above, the last step is to modify the _functionWrapper method in borg.py:


def _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions=None, addl_inputs=None):
# addl_inputs will be passed into the simulation model
def innerFunction(v,o,c):
global terminate
try:
result = function(*[v[i] for i in range(numberOfVariables)])
else:
result = function([v[i] for i in range(numberOfVariables)], addl_inputs)



# Making Watershed Maps in Python

This post builds off of earlier posts by Jon Lamontagne and Jon Herman on making global maps in Python using matplotlib and basemap. However rather than making a global map, I’ll show how to zoom into a particular region, here the Red River basin in East Asia. To make these maps, you’ll need to have basemap installed (from github here, or using a Windows installer here).

The first step is to create a basemap. Both Jons used the ‘robin’ global projection to do this in their posts. Since I’m only interested in a particular region, I just specify the bounding box using the lower and upper latitudes and longitudes of the region I’d like to plot. As Jon H points out, you can also specify the resolution (‘f’ = full, ‘h’ =high, ‘i’ = intermediate, ‘l’ = low, ‘c’ = crude), and you can even use different ArcGIS images for the background (see here). I use ‘World_Shaded_Relief’. It’s also possible to add a lot of features such as rivers, countries, coastlines, counties, etc. I plot countries and rivers. The argument ‘zorder’ specifies the order of the layering from 1 to n, where 1 is the bottom layer and n the top.


from mpl_toolkits.basemap import Basemap
from matplotlib import pyplot as plt

fig = plt.figure()
fig.set_size_inches([17.05,8.15])

# plot basemap, rivers and countries
m = basemap(llcrnrlat=19.5, urcrnrlat=26.0, llcrnrlon=99.6, urcrnr=107.5, resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)



The above code makes the following image (it takes some time, since I’m using high resolution):

Now let’s add a shaded outline of the Red River basin. To do this, you need a shapefile of the basin. The FAO provides a shapefile of major watersheds in the world, from which you can extract the watershed you’re interested in using ArcGIS (see instructions here). In this shapefile, the Red River is labeled by its name in Vietnamese, ‘Song Hong.’ I chose not to draw the bounds of the basin in my map because it would be too busy with the country borders. Instead, I shaded the region gray (facecolor=’0.33′) with a slightly darker border (edgecolor=’0.5′) and slight transparency (alpha=0.5). To do that, I had to collect all of the patches associated with the shapefile (which I called ‘Basin’ when reading it in) that needed to be shaded.


from matplotlib.patches import Polygon
from matplotlib.collections import Patch Collection

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1: # attribute in attribute table of shapefile
patches.append(Polygon(np.array(shape), True))



This creates the following image:

Now let’s add the locations of major dams and cities in the basin using ‘scatter‘. You could again do this by adding a shapefile, but I’m just going to add their locations manually, either by uploading their latitude and longitude coordinates from a .csv file or by passing them directly.


import numpy as np

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv', delimiter=',', skiprows=1, usecols=[1,2])
x, y = m(damsLatLong[:,1], damLatLong[:,0]) # m(longitude, latitude)
m.scatter(x, y, c='k', s = 150, marker = '^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)



This makes the following image:

If we want to label the dams and cities, we can add text specifying where on the map we’d like them to be located. This may require some guess-and-check work to determine the best place (comment if you know a better way!). I temporarily added gridlines to the map to aid in this process using ‘drawparallels‘ and ‘drawmeridians‘.


# label dams and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center', va='center', color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')



Now our map looks like this:

That looks nice, but it would be helpful to add some context as to where in the world the Red River basin is located. To illustrate this, we can create an inset of the greater geographical area by adding another set of axes with its own basemap. This one can be at a lower resolution.


from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1) # locations
axins.set_xlim(90, 115) # longitude boundaries of inset map
axins.set_ylim(8, 28) # latitude boundaries of inset map

# remove tick marks from inset axes
plt.xticks(visible=False)
plt.yticks(visible=False)

# add basemap to inset map
m2 = Basemap(llcrnrlat=8.0, urcrnclat=28.0, llcrnr=90.0, urcrnrlon=115.0, resolution='l', ax=axins)
m2.drawcountries(color='k', linewidth=0.5)



This image looks like this:

Now let’s highlight a country of interest (Vietnam) in green and also add the Red River basin in light gray again.


# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset



Now our map looks like this:

Finally, let’s label the countries in the inset. Some of the countries are too small to fit their name inside, so we’ll have to create arrows pointing to them using ‘annotate‘. In this function, ‘xy’ specifies where the arrow points to and ‘xytext’ where the text is written relative to where the arrow points.


# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(102.5, 20.2, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(9.5, 21.0, 'China', fontsize=11, ha='center', va='center', color='k')

# add arrows to label Vietnam and Cambodia
plt.annotate('Vietnam', xy=(108.0, 14.0), xycoords='data', xytext=(5.0, 20.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)
plt.annotate('Cambodia', xy=(104.5, 12.0), xycoords='data', xytext=(-60.0, -25.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)



Now our map looks like this:

I think that’s pretty good, so let’s save it ;). See below for all the code used to make this map, with all the import statements at the beginning rather than sporadically inserted throughout the code!

If you’re looking for any other tips on how to make different types of maps using basemap, I recommend browsing through the basemap toolkit documentation and this basemap tutorial, where I learned how to do most of what I showed here.


from mpl_toolkits.basemap import Basemap
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import numpy as np

# set-up Vietnam basemap
fig = plt.figure()
fig.set_size_inches([17.05, 8.15])

# plot basemap, rivers and countries
m = Basemap(llcrnrlat=19.5,urcrnrlat=26.0,llcrnrlon=99.6,urcrnrlon=107.5,resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1:
patches.append(Polygon(np.array(shape), True))

# plot dams
x, y = m(damsLatLong[:,1], damsLatLong[:,0])
m.scatter(x, y, c='k', s=150, marker='^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

# label reservoirs and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center',va='center',color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1)
axins.set_xlim(90, 115)
axins.set_ylim(8,28)

plt.xticks(visible=False)
plt.yticks(visible=False)

m2 = Basemap(llcrnrlat=8.0,urcrnrlat=28.0,llcrnrlon=90.0,urcrnrlon=115.0,resolution='l',ax=axins)
m2.drawcountries(color='k',linewidth=0.5)

# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset

# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center',va='center',color='k')
plt.text(102.5, 20.2, 'Laos', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'Thailand', fontsize=11, ha='center', va='center', color='k')
plt.text(96.5, 21.0, 'Myanmar', fontsize=11, ha='center', va='center', color='k')

plt.annotate('Vietnam', xy=(108.0,14.0), xycoords='data', xytext=(5.0,20.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)
plt.annotate('Cambodia', xy=(104.5,12.0), xycoords='data', xytext=(-60.0,-25.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)

fig.savefig('RedRiverMap.png')
fig.clf()



# Using Borg in Parallel and Serial with a Python Wrapper – Part 1

Simulation and optimization are frequently used to solve complex water resources and environmental systems problems. By itself, a simulation model begs the question “what to simulate?” Similarly, by itself, an optimization model begs the question “is the solution really best?” For this reason, simulation and optimization models are frequently coupled.

This blog post is part 1 of a multi-part series that will demonstrate how I have coupled a pure Python simulation model with the multi-objective evolutionary optimization algorithm Borg. In this post, I will show how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py).

Please see this previous blog post for some background about Borg, and how to obtain it. My instructions below assume you have access to the Borg files.

In the setup I will describe below, Borg parameterizes and iteratively refines solutions (e.g., reservoir operating policies) to a problem, optimizing them in response to their simulated performance with respect to multiple objectives.

• Serial (i.e., borg.c, libborg.so, etc.) and/or master-slave (i.e., borgms.c, libborgms.so, etc.) implementations of Borg, depending upon your ability to parallelize.
• Python wrapper for Borg (borg.py), which will allow you to to access Borg easily in Python.

You will need to create the following files yourself (I provide sample code below for these files):

• example_sim_opt.py—A python module that should contain two main functions:
1. A simulation caller, which takes decision variables and returns multi-objective performance. This function is called “Simulation_Caller” in the example_sim_opt.py code below.
2. An optimization function, which calls the Borg MOEA through its python wrapper borg.py. This borg.py wrapper provides extensive docstring documentation regarding required arguments, returned values, etc., so I do suggest reading through the wrapper if you have questions (e.g., about the python data types of arguments and returns).

Note that the file and function names above are just example names. You can name the above files whatever you want. Just be sure to modify the code I provide below to reflect the new names.

A sample of code for example_sim_opt.py is as follows:

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

def Simulation_Caller(vars):
'''
Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

Note: You could also just put your simulation/function evaluation code here.

Args:
vars: A list of decision variable values from Borg

Returns:
performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
'''

borg_vars = vars  # Decision variable values from Borg

# Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
op_policy_params = np.asarray(borg_vars)
# Call/run simulation model, return multi-objective performance:
performance = pysedsim.PySedSim(decision_vars = op_policy_params)
return performance

def Optimization():

'''

Purpose: Call this method from command line to initiate simulation-optimization experiment

Returns:
--pareto approximate set file (.set) for each random seed
--Borg runtime file (.runtime) for each random seed

'''

import borg as bg  # Import borg wrapper

parallel = 1  # 1= master-slave (parallel), 0=serial

# The following are just examples of relevant MOEA specifications. Select your own values.
nSeeds = 25  # Number of random seeds (Borg MOEA)
num_dec_vars = 10  # Number of decision variables
n_objs = 6  # Number of objectives
n_constrs = 0  # Number of constraints
num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
runtime_freq = 1000  # Interval at which to print runtime details for each random seed
decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective

# Where to save seed and runtime files
main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
output_location = main_output_file_dir + os_fold + 'sets'

# If using master-slave, start MPI. Only do once.
if parallel == 1:
bg.Configuration.startMPI()  # start parallelization with MPI

# Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
for j in range(nSeeds):
# Instantiate borg class, then set bounds, epsilon values, and file output locations
borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller)
borg.setBounds(*decision_var_range)  # Set decision variable bounds
borg.setEpsilons(*epsilon_list)  # Set epsilon values
# Runtime file path for each seed:
runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
if parallel == 1:
# Run parallel Borg
result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

if parallel == 0:
# Run serial Borg
result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
"runtimefile": runtime_filename})

if result:
# This particular seed is now finished being run in parallel. The result will only be returned from
# one node in case running Master-Slave Borg.
result.display()

# Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
f.write('#Borg Optimization Results\n')
f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
' are the ' + 'objective values\n')
for solution in result:
line = ''
for i in range(len(solution.getVariables())):
line = line + (str(solution.getVariables()[i])) + ' '

for i in range(len(solution.getObjectives())):
line = line + (str(solution.getObjectives()[i])) + ' '

f.write(line[0:-1]+'\n')
f.write("#")
f.close()

# Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
# the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
for solution in result:
line = ''
for i in range(len(solution.getObjectives())):
line = line + (str(solution.getObjectives()[i])) + ' '

f2.write(line[0:-1]+'\n')
f2.write("#")
f2.close()

print("Seed %s complete") %j

if parallel == 1:
bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
'''
Function to determine whether operating system is (1) Windows, or (2) Linux

Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
post-simulation.
'''

if platform.system() == 'Windows':
os_fold_op = '\\'
elif platform.system() == 'Linux':
os_fold_op = '/'
else:
os_fold_op = '/'  # Assume unix OS if it can't be identified

return os_fold_op


The following is an example of how you would submit a batch script on a Linux cluster to run a parallelized simulation-optimization experiment using the example_sim_opt.py and borg.py files. Note that in the parallelized version, simulations (i.e., “function evaluations”) are being run in parallel by separate processors.

You would need the following two files:

1. example_sim_opt_caller.py. This is a python file that is used to call example_sim_opt.py
2. example_sim_opt_batch_script.pbs. This is batch script that runs example_sim_opt_caller.py in parallel on a cluster using open MPI.

Example code for example_sim_opt_caller.py:


'''
Purpose: To initiate the optimization process, which will iteratively call the simulation model.
'''

import example_sim_opt  # Import main optimization module that uses borg python wrapper

# Module within example
example_sim_opt.Optimization()



Example code for example_sim_opt_batch_script.pbs:


#PBS -l nodes=8:ppn=16
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -o pysedsim_output.out

cd \$PBS_O_WORKDIR
source /etc/profile.d/modules.sh