# Welcome to our blog!

Welcome to Water Programming! This blog is a collaborative effort by Pat Reed’s group at Cornell, Joe Kasprzyk’s group at CU Boulder, Jon Herman’s group at UC Davis, and others who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here. To gain access please email joseph.kasprzyk “at” colorado.edu.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email joseph.kasprzyk “at” colorado.edu. You’ll need a WordPress.com account.

# Introduction

Being able to automate data retrieval helps alleviate encountering irritating, repetitive tasks. Often times, these data files can be separated by a range of times, site locations, or even type of measurement, causing them to be cumbersome to download manually. This blog post outlines how to download multiple zipped csv files from a webpage using both R and Python.

We will specifically explore downloading historical hourly locational marginal pricing (LMP) data files from PJM, a regional transmission organization coordinating a wholesale electricity market across 13 Midwestern, Southeastern, and Northeastern states. The files in question are located here.

# First, Using Python

A distinct advantage of using Python over R is being able to write fewer lines for the same results. However, the differences in the time required to execute the code are generally minimal between the two languages.

### Constructing URLs

The URL for the most recent file from the webpage for August 2017 has the following format: “http://www.pjm.com/pub/account/lmpmonthly/201708-da.csv
Notably, the URL is straightforward in its structure and allows for a fairly simple approach in attempting to create this URL from scratch.

To begin, we can deconstruct the URL into three aspects: the base URL, the date, and the file extension.
The base URL is fairly straightforward because it is constant: “http://www.pjm.com/pub/account/lmpmonthly/
As is the file extension: “-da.csv”

However, the largest issue presented is how to approach recreating the date: “201708”
It is in the form “yyyymm”, requiring a reconstruction of a 4-digit year (luckily these records don’t go back to the first millennium) and a 2-digit month.

In Python, this is excessively easy to reconstruct using the .format() string modifying method. All that is required is playing around with the format-specification mini-language within the squiggly brackets. Note that the squiggly brackets are inserted within the string with the following form: {index:modifyer}

In this example, we follow the format above and deconstruct the URL then reassemble it within one line in Python. Assigning this string to a variable allows for easy iterations through the code by simply changing the inputs. So let’s create a quick function named url_creator that allows for the input of a date in the format (yyyy, mm) and returns the constructed URL as a string:

def url_creator(date):
"""
Input is list [yyyy, mm]
"""
yyyy, mm = date

return "http://www.pjm.com/pub/account/lmpmonthly/{0:}{1:02d}-da.zip".format(yyyy, mm)


To quickly generate a list of the dates we’re interested in, creating a list of with entries in the format (yyyy, mm) for each of the month’s we’re interested in is easy enough using a couple of imbedded for loops. If you need alternative dates, you can easily alter this loop around or manually create a list of lists to accommodate your needs:

#for loop method
dates = []
for i in range(17):
for j in range(12):
dates.append([2001 + i, j + 1])

#alternative manual method
dates = [[2011, 1], [2011, 2], [2011, 3]]


### Retrieving and Unpacking Zipped Files

Now that we can create the URL for this specific file type by calling the url_creator function and have all of the dates we may be interested in, we need to be able to access, download, unzip/extract, and subsequently save these files. Utilizing the urllib library, we can request and download the zipped file. Note that urllib.request.urlretrieve only retrieves the files and does not attempt to read it. While we could simply save the zipped file at this step, it is preferred to extract it to prevent headaches down the line.

Utilizing the urllib library, we can extract the downloaded files to the specified folder. Notably, I use the operating system interface method getcwd while extracting the file to save the resulting csv file into the directory where this script is running. Following this, the extracted file is closed.

import zipfile, urllib, os
from urllib.request import Request,urlopen, urlretrieve

for date in dates:
baseurl = url_creator(date)

local_filename, headers = urllib.request.urlretrieve(url = baseurl)
zip_ref = zipfile.ZipFile(file = local_filename, mode = 'r')
zip_ref.extractall(path = os.getcwd())     #os.getcwd() directs to current working directory
zip_ref.close()


At this point, the extracted csv files will be located in the directory where this script was running.

# Alternatively, Using R

We will be using an csv files to specify date ranges instead of for loops as shown in the Python example above. The advantage of using the csv files rather than indexing i=2001:2010 is that if you have a list of non-consecutive months or years, it is sometimes easier to just make a list of the years and cycle through all of the elements of the list rather than trying to figure out how to index the years directly. However, in this example, it would be just as easy to index through the years and month rather than the csv file.

This first block of code sets the working directory and reads in two csv files from the directory. The first csv file contains a list of the years 2001-2010. The second CSV file lists out the months January-December as numbers.

#Set working directory
#specifically specified for the directory where we want to "dump" the data
setwd("E:/Data/WebScraping/PJM/Markets_and_operations/Energy/Real_Time_Monthly/R")

#Create csv files with a list of the relevant years and months


The breakdown of the URL was previously shown. To create this in R, we will generate a for-loop that will change this link to cycle through all the years and months listed in the csv files. The index i and j iterate through the elements in the year and month csv file respectively. (The if statement will be discussed at the end of this section). We then use the “downloadurl” and “paste” functions to turn the entire link into a string. The parts of the link surrounded in quotes are unchanging and denoted in green. The “as.character” function is used to cycle in the elements from the csv files and turn them into characters. Finally, the sep=’’ is used to denote that there should be no space separating the characters. The next two lines download the file into R as a temporary file.

#i indexes through the years and j indexes through the months
for (i in (1:16)){
for( j in (1:12)){
if (j>9){

#dowload the url and save it as a temporary file

temp="tempfile"


Next, we unzip the temporary file and finds the csv (which is in the form: 200101-da.csv), reading in the csv file into the R environment as “x”. We assign the name “200101-da” to “x”. Finally, it is written as a csv file and stored in the working directory, the temporary file is deleted, and the for-loop starts over again.

    #read in the csv into the global environment
#Assign a name to the csv "x"
newname=paste(as.character(years[i,1]),as.character(months[j,1]),sep="")
assign(newname,x)
#create a csv that is stored in your working directory
write.csv(x,file=paste(newname,".csv",sep=""))

}

#If j is between 1 and 10, the only difference is that a "0" has to be added
#in front of the month number


### Overcoming Formatting

The reason for the “if” statement is that it is not a trivial process to properly format the dates in Excel when writing to a csv to create the list of dates. When inputing “01”, the entry is simply changed to “1” when the file is saved as a csv. However, note that in the name of the files, the first nine months have a 0 preceding their number. Therefore, the URL construct must be modified for these months. The downloadurl has been changed so that a 0 is added before the month number if j < 10. Aside from the hard-coded zeroes, the block of code is the same as the one above.

else {

temp="tempfile"

newname=paste(as.character(years[i,1]),"0",as.character(months[j,1]),sep="")
assign(newname,x)
write.csv(x,file=paste(newname,".csv",sep=""))

}


# Acknowledgements

This was written in collaboration with Rohini Gupta who contributed by defining the problem and explaining her approach in R.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Versions: R = 3.4.1 Python = 3.6.2
To see the code in its entirety, please visit the linked GitHub Repository.

# Generating Interactive Visuals in R

Visuals vs. Visual Analytics

How do visuals differ from visual analytics? In a scientific sense, a visual is a broad term for any picture, illustration, or graph that can be used to convey an idea. However, visual analytics is more than just generating a graph of complex data and handing it to a decision maker. Visual analytic tools help create graphs that allow the user to interact with the data, whether that involves manipulating a graph in three-dimensional space or allowing users to filter or brush for solutions that match certain criteria. Ultimately, visual analytics seeks to help in making decisions as fast as possible and to “enable learning through continual [problem] reformulation” (Woodruff et al., 2013) by presenting large data sets in an organized way so that the user can better recognize patterns and make inferences.

My goal with this blog post is to introduce two R libraries that are particularly useful to develop interactive graphs that will allow for better exploration of a three-dimensional space. I have found that documentation on these libraries and potential errors was sparse, so this post will consolidate my hours of Stack Overflow searching into a step-by-step process to produce beautiful graphs!

R Libraries

Use rgl to create a GIF of a 3D graph

Spinning graphs can be especially useful to visualize a 3D Pareto front and a nice visualization for a Power Point presentation. I will be using an example three-objective Pareto set from Julie’s work on the Red River Basin for this tutorial. The script has been broken down and explained in the following sections.


#Set working directory
setwd("C:/Users/Me/Folder/Blog_Post_1")

#Read in in csv of pareto set

#Create three vectors for the three objectives
hydropower=data$WcAvgHydro deficit=data$WcAvgDeficit
flood=data$WcAvgFlood  In this first block of code, the working directory is set, the data set is imported from a CSV file, and each column of the data frame is saved as a vector that is conveniently named. Now we will generate the plot.  #call the rgl library library(rgl) #Adjust the size of the window par3d(windowRect=c(0,0,500,500))  If the rgl package isn’t installed on your computer yet, simply type install.packages(“rgl”) into the console. Otherwise, use the library function in line 2 to call the rgl package. The next line of code is used to adjust the window that the graph will pop up in. The default window is very small and as such, the movie will have a small resolution if the window is not adjusted!  #Plot the set in 3D space plot3d(hydropower,deficit,flood,col=brewer.pal(8,"Blues"), size=2, type='s', alpha=0.75)  Let’s plot these data in 3D space. The first three components of the plot3d function are the x,y, and z vectors respectively. The rest of the parameters are subject to your personal preference. I used the Color Brewer (install package “RColorBrewer”) to color the data points in different blue gradients. The first value is the number of colors that you want, and the second value is the color set. Color Brewer sets can be found here: http://www.datavis.ca/sasmac/brewerpal.html. My choice of colors is random, so I opted not to create a color scale. Creating a color scale is more involved in rgl. One option is to split your data into classes and to use legend3d and the cut function to cut your legend into color levels. Unfortunately, there simply isn’t an easy way to create a color scale in rgl. Finally, I wanted my data points to be spheres, of size 2, that were 50% transparent, which is specified with type, size, and alpha respectively. Plot3d will open a window with your graph. You can use your mouse to rotate it. Now, let’s make a movie of the graph. The movie3d function requires that you install ImageMagick, a software that allows you to create a GIF from stitching together multiple pictures. ImageMagick also has cool functionalities like editing, resizing, and layering pictures. It can be installed into your computer through R using the first two lines of code below. Make sure not to re-run these lines once ImageMagick is installed. Note that ImageMagick doesn’t have to be installed in your directory, just on your computer.  require(installr) install.ImageMagick() #Create a spinning movie of your plot movie3d(spin3d(axis = c(0, 0, 1)), duration = 20, dir = getwd())  Finally, the last line of code is used to generate the movie. I have specified that I want the plot to spin about the z axis, specified a duration (you can play around with the number to see what suits your data), and that I want the movie to be saved in my current working directory. The resulting GIF is below. If the GIF has stopped running, reload the page and scroll down to this section again. I have found that creating the movie can be a bit finicky and the last step is where errors usually occur. When you execute your code, make sure that you keep the plot window open while ImageMagick stitches together the snapshots otherwise you will get an error. If you have errors, please feel free to share because I most likely had them at one point and was able to ultimately fix them. Overall, I found this package to be useful for a quick overview of the 3D space, but I wasn’t pleased with the way the axes values and titles overlap sometimes when the graph spins. The way to work around this is to set the labels and title to NULL and insert your own non-moving labels and title when you add the GIF to a PowerPoint presentation. Use plotly to create an interactive scatter I much prefer the plotly package to rgl for the aesthetic value, ease of creating a color scale, and the ability to mouse-over points to obtain coordinate values in a scatter plot. Plotly is an open source JavaScript graphing library but has an R API. The first step is to create a Plotly account at: https://plot.ly/. Once you have confirmed your email address, head to https://plot.ly/settings/api to get an API key. Click the “regenerate key” button and you’ll get a 20 character key that will be used to create a shareable link to your chart. Perfect, now we’re ready to get started! setwd("C:/Users/Me/Folder/Blog_Post_1") library(plotly) library(ggplot2) #Set environment variables Sys.setenv("plotly_username"="rg727") Sys.setenv("plotly_api_key"="insert key here") #Read in pareto set data pareto=read.csv("ieee_synthetic_thinned.csv")  Set the working directory, install the relevant libraries, set the environment variables and load in the data set. Be sure to insert your API key. You will need to regenerate a new key every time you make a new graph. Also, note that your data must be in the form of a data frame for plotly to work. #Plot your data plot= plot_ly(pareto, x = ~WcAvgHydro, y = ~WcAvgDeficit, z = ~WcAvgFlood, color = ~WcAvgFlood, colors = c('#00d6f7', '#ebfc2f')) add_markers() #Add axes titles layout(title="Pareto Set", scene = list(xaxis = list(title = 'Hydropower'),yaxis = list(title = 'Deficit'), zaxis = list(title = 'Flood'))) #call the name of your plot to appear in the viewer plot  To correctly use the plotly command, the first input needed is the data frame, followed by the column names of the x,y, and z columns in the data frame. Precede each column name with a “~”. I decided that I wanted the colors to scale with the value of the z variable. The colors were defined using color codes available at http://www.color-hex.com/. Use the layout function to add a main title and axis labels. Finally call the name of your plot and you will see it appear in the viewer at the lower right of your screen.If your viewer shows up blank with only the color scale, click on the viewer or click “zoom”. Depending on how large the data set is, it may take some time for the graph to load.  #Create a link to your chart and call it to launch the window chart_link = api_create(plot, filename = "public-graph") chart_link  Finally create the chart link using the first line of code above and the next line will launch the graph in Plotly. Copy and save the URL and anyone with it can access your graph, even if they don’t have a Plotly account. Play around with the cool capabilities of my Plotly graph, like mousing over points, rotating, and zooming! https://plot.ly/~rg727/1/ Sources: http://www.sthda.com/english/wiki/a-complete-guide-to-3d-visualization-device-system-in-r-r-software-and-data-visualization https://plot.ly/r/3d-scatter-plots/ Woodruff, M.J., Reed, P.M. & Simpson, T.W. Struct Multidisc Optim (2013) 48: 201. https://doi.org/10.1007/s00158-013-0891-z James J. Thomas and Kristin A. Cook (Ed.) (2005). Illuminating the Path: The R&D Agenda for Visual Analytics National Visualization and Analytics Center. # Python example: Hardy Cross method for pipe networks I teach a class called Water Resources Engineering at the University of Colorado Boulder, and I have often added in examples where students use Python to perform the calculations. For those of you who are new to programming, you might get a kick out of this example since it goes through an entire procedure of how to solve an engineering problem using Python, and the numpy library. The example is two YouTube videos embedded below! The best way to watch them, though, is in full screen, which you can get to by clicking the name of the video, opening up a new YouTube tab in your browser. Let us know if you have any questions in the comments section below. For more on Python from this blog, search “Python” in the search box. # Time Series Modeling: ARMA Notation (part 1) Hydrological, meteorological, and ecological observations are often a special type of data: a time series. A time series consists of observations (say streamflow) at equally-spaced intervals over some period of time. Many of us on this blog are interested in running simulation-optimization models which receive time series data as an input. But the time series data from the historical record may be insufficient for our work, we also want to create synthetic time series data to explore a wider range of scenarios. To do so, we need to fit a time series model. If you are uncertain why we would want to generate synthetic data, check out Jon L.’s post “Synethic streamflow generation” for some background. If you are interested in some applications, read up on this 2-part post from Julie. A common time series model is the autoregressive moving average (ARMA) model. This model has many variations including the autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA) models, and ARIMA models with external covariates (ARIMAX and SARIMAX). This class of models is useful but it has its own special notation which can be hard to unpack. Take the SARIMA model for example: $\Phi(B^S)\phi(B)(x_t-\mu)W_t=\Theta(B^S)\theta(B)\epsilon_t\qquad\qquad(1)$ Confused yet? Me too. What are those functions? What does the B stand for? To help figure that out, I’m devoting two blogposts to breaking down some time series notation into bite-sized pieces. In this first post, I will unpack the ARMA model (eq. 2) and in the second I will move onto the SARIMA model (eq. 1). Here we go! $\phi(B)(x_t-\mu)=\theta(B)\epsilon_t\qquad\qquad(2)$ ### Autoregressive (AR) and moving average (MA) models An ARMA model is generalized form of two different models: the autoregressive (AR) and moving average (MA). Both the AR (eq. 3) and MA (eq. 4) models have a single parameter, p and q, respectively, which represent the order of the model. $AR(p): x_t=c+\phi_1x_t_-_1+\phi_2x_t_-_2+...+\phi_px_t_-_p+\epsilon_t\qquad\qquad(3)$ $MA(q):&space;x_t=\mu+\epsilon_t+\theta_1\epsilon_t_-_1+\theta_2\epsilon_t_-_2+...+\theta_q\epsilon_t_-_q&space;\qquad\qquad(4)$ The c and μ are constants, x’s are the time series observations, θ’s and Φ’s are weighting parameters for the different lagged terms, and ε represents a random error term (i.e. it has a normal distribution with mean zero). You can see already how these equations might get a bit tedious to write out. Using what is known as a backshift operator and defining specific polynomials for each model, we can use less ink to get the same point across. #### Backshift operator The backshift (also known as the lag) operator, B, is used to designate different lags on a particular time series observation. By applying the backshift operator to the observation at the current timestep, xt, it yields the one from the previous timestep xt-1 (also known as lag 1). $Bx_t=x_t_-_1$ It doesn’t save much ink in this simple example, but with more model terms the backshift operator comes in handy. Using this operator, we can represent any lagged term by raising B to the power of the desired lag. Let’s say we want to represent the lag 2 of xt. $B^2x_t=x_t_-_2$ Or possibly the lag 12 term. $B^1^2x_t=x_t_-_1_2$ #### Example 1: AR(2) – order two autoregressive model Let’s apply the backshift operator to the AR(2) model as an example. First, let’s specify the model in our familiar notation. $AR(2):&space;x_t=c+\phi_1x_t_-_1+\phi_2x_t_-_2+\epsilon_t$ Now, let’s apply the backshift operator. $&space;x_t=c+\phi_1Bx_t+\phi_2B^2x_t+\epsilon_t$ Notice that xt. shows up a few times in this equation, so let’s rearrange the model and simplify. $(1-\phi_1B-\phi_2B^2)x_t=c+\epsilon_t$ Once we’ve gotten to this point, we can define a backshift polynomial to further distill this equation down. For order two autoregressive models, this polynomial is defined as $\phi(B)=1-\phi_1B-\phi_2B^2$ Combine this with the above equation to get the final form of the AR(2) equation. $\phi(B)x_t=c+\epsilon_t\newline\newline&space;where\&space;\phi(B)=1-\phi_1B-\phi_2B^2$ #### Example 2: MA(1) – order one moving average model Starting to get the hand of it? Now we’re going to apply the same approach to a MA(1) model. $MA(1):&space;x_t=\mu+\epsilon_t+\theta_1\epsilon_t_-_1$ Now let’s apply the backshift operator. $x_t=\mu+\epsilon_t+\theta_1B\epsilon_t$ Rearrange and simplify by grouping εt terms together. $x_t=\mu+(1-\theta_1B)\epsilon_t$ Define a backshift polynomial to substitute for the terms in the parentheses. $\theta(B)=1-\theta_1B$ Substitute polynomial to reach the compact notation. $x_t=\mu+\theta(B)\epsilon_t&space;\newline\newline&space;where\&space;\theta(B)=1-\theta_1B$ ### Autoregressive moving average (ARMA) models Now that we’ve had some practice with the AR and MA models, we can move onto ARMA models. As the name implies, the ARMA model is simply a hybrid between the AR and MA models. As a shorthand, AR(p) is equivalent to ARMA(p,0) and MA(q) is the same as ARMA(0,q). The full ARMA(p,q) model is as follows: $ARMA(p,q):x_t=\mu+\epsilon_t+\phi_1x_t_-_1+\phi_2x_t_-_t+...+\phi_px_t_-_p+\theta_1\epsilon_t_-_1+\theta_2\epsilon_t_-_2+...+\theta_q\epsilon_t_-_q$ #### Example 3: ARMA(2,2) For the grand finale, let’s take the ARMA model from it’s familiar (but really long) form and put in it more compact notation. As an example we’ll look at the ARMA(1,2) model. $ARMA(1,2):x_t=\mu+\epsilon_t+\phi_1x_t_-_1+\theta_1\epsilon_t_-_1+\theta_2\epsilon_t_-_2$ First, apply the backshift operator. $x_t=\mu+\epsilon_t+\phi_1Bx_t+\theta_1B\epsilon_t+\theta_2B^2\epsilon_t$ Rearrange and simplify by grouping the terms from the current timestep, t. (If you are confused by this step check out “Clarifying Notes #2”) $(1-\phi_1B)(x_t-\mu)=(1+\theta_1B+\theta_2B^2)\epsilon_t$ Substitute the polynomials defined for AR and MA to reach the compact notation. $\phi(B)(x_t-\mu)=\theta(B)\epsilon_t$ $where\newline\phi(B)=1-\phi_1B\newline\theta(B)=\theta_1B+\theta_2B^2$ And that’s it! Hopefully that clears up ARMA model notation. For the next post, I’ll go into seasonal ARIMA models which incorporates seasonality and non-stationarity into ARMA models. ### Clarifying Notes 1. There are many different conventions for the symbols used in these equations. For example, the backshift operator (B) is also known as the lag operator (L). Furthermore, sometimes the constants used in AR, MA, and ARMA models are omitted with the assumption that they are centered around 0. I’ve decided to use the form which corresponds to agreement between a few sources with which I’m familiar and is consistent with their Wikipedia pages. 2. What does it mean for a backshift operator to be applied to a constant? For example, like for μ in equation 2. Based on my understanding, a backshift operator has no effect on constants: Bμ = μ. This makes sense because a backshift operator is time-dependent but a constant is not. I don’t know why some of these equations have constants multiplied by the backshift operator but it appears to be the convention. It seems to be more confusing to me at least. 3. One question you may be asking is “why don’t we just use summation terms to shorten these equations?” For example, why don’t we represent the AR(p) model like this? $AR(p):x_t=c+\sum_{i=1}^{p}\phi_ix_t_-_i+\epsilon_t$ We can definitely represent these equations with a summation, and for simple models (like the ones we’ve discussed) that might make more sense. However, as these models get more complicated, the backshift operators and polynomials will make things more efficient. ## References Applied Time Series Anlysis, The Pennsylvania State University: https://onlinecourses.science.psu.edu/stat510/ Note: This is a GREAT resource for anyone looking for a more extensive resource on time series analysis. This blogpost was inspired largely by my own attempt to understand Lessons 1 – 4 and apply it to my own research. Chatfield, Chris. The Analysis of Time Series: An Introduction. CRC press, 2016. Wikipedia: https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model # Introduction to Docker In this post we’ll learn the principles of Docker, and how to use Docker with large quantities of data in input / output. 1. What is Docker? Docker is a way to build virtual machines from a file called the Docker file. That virtual machine can be built anywhere with the help of that Docker file, which makes Docker a great way to port models and the architecture that is used to run them (e.g., the Cube: yes, the Cube can be ported in that way, with the right Docker file, even though that is not the topic of this post). Building it creates an image (a file), and a container is a running instance of that image, where one can log on and work. By definition, containers are transient and removing does not affect the image. 2. Basic Docker commands This part assumes that we already have a working Docker file. A docker file runs a series of instructions to build the container we want to work in. To build a container for the WBM model from a Docker file, let us go to the folder where the Docker file is and enter: docker build -t myimage -f Dockerfile .  The call docker build means that we want to run a Docker file; -t means that we name, or “tag” our image, here by giving it the name of “myimage”; -f specifies which Docker file we are using, in case there are several in the current folder, and “.” says that we run the Docker file and build the container in the current folder. Options -t and -f are optional in theory, but the tag -t is very important as it gives a name to your built image. If we don’t do that, we’ll have to go through the whole build every time we want to run a Docker container from the Docker file. This would waste a lot of time. Once the Docker image is built, we can run it. In other words, have a virtual machine running on the computer / cluster / cloud where we are working. To do that, we enter: docker run -dit myimage  The three options are as follows: -d means that we do not directly enter the container, and instead have it running in the background, while the call returns the containers hexadecimal ID. -i means that we keep the standard input open. Finally, -t is our tag, which is the name of the docker image (here, “myimage”). We can now check that the image is running by listing all the running images with: docker ps  In particular, this lists displays a list of hexadecimal IDs associated to each running image. After that, we can enter the container by typing:  docker exec -i -t hexadecimalID /bin/bash  where -i is the same as before, but -t now refers to the hexadecimal ID of the tagged image (that we retrieved with docker ps). The second argument /bin/bash simply sets the directory of the shell in a standard way. Once in the container, we can run all the processes we want. Once we are ready to exit the container, we can exit it by typing… exit. Once outside of the container, we can re-enter it as long as it still runs. If we want it to stop running, we use the following command to “kill” it (not my choice of words!):  docker kill hexadecimalID  A short cut to calling all these commands in succession is to use the following version of docker run:  docker run -it myimage /bin/bash  This command logs us onto the image as if we had typed run and exec at the same time (using the shell /bin/bash). Note that option -d is not used in this call. Also note that upon typing exit, we will not only exit the container, but also kill the running Docker image. This means that we don’t have to retrieve its hexadecimalID to log on to the image, nor to kill it. Even if the container is not running any more, it can be re-started and re-entered by retrieving its hexadecimal ID. The docker ps command only lists running containers, so to list all the containers, including those that are no longer running, we type:  docker ps -a  We can then restart and re-enter the container with the following commands:  docker restart hexadecimalID docker exec -it hexadecimalID /bin/bash  Note the absence of options for docker restart. Once we are truly done with a container, it can be removed from the lists of previously running containers by using:  docker rm hexadecimalID  Note that you can only remove a container that is not running. 3. Working with large input / output data sets. Building large quantities of data directly into the container when calling docker build has three major drawbacks. First, building the docker image will take much more time because we will need to transfer all that data every time we call docker build. This will waste a lot of time if we are tinkering with the structure of our container and are running the Docker file several times. Second, every container will take up a lot of space on the disk, which can prove problematic if we are not careful and have many containers for the same image (it is so easy to run new containers!). Third, output data will be generated within the container and will need to be copied to another place while still in the container. An elegant workaround is to “mount” input and output directories to the container, by calling these folders with the -v option as we use the docker run command:  docker run -it -v path/to/inputs -v path/to/outputs myimage /bin/bash  or  docker run -dit -v path/to/inputs -v path/to/outputs myimage  The -v option is abbreviation for “volume”. This way, the inputs and outputs directories (set on the same host as the container) are used directly by the Docker image. If new outputs are produced, they can be added directly to the mounted output directory, and that data will be kept in that directory when exiting / killing the container. It is also worth noting that we don’t need to call -v again if we restart the container after killing it. A side issue with Docker is how to manage user permissions on the outputs a container produces, but 1) that issue arises whether or not we use the -v option, and 2) this is a tale for another post. Acknowledgements: thanks to Julie Quinn and Bernardo Trindade from this research group, who started exploring Docker right before me, making it that much easier for me to get started. Thanks also to the Cornell-based IT support of the Aristotle cloud, Bennet Wineholt and Brandon Baker. # Exploring the stability of systems of ordinary differential equations – an example using the Lotka-Volterra system of equations Stability when dealing with dynamical systems is important because we generally like the systems we make decisions on to be predictable. As such, we’d like to know whether a small change in initial conditions could lead to similar behavior. Do our solutions all tend to the same point? Would slightly different initial conditions lead to the same or to a completely different point for our systems. This blogpost will consider the stability of dynamical systems of the form: The equilibria of which are denoted by x* and y*, respectively. I will use the example of the Lotka-Volterra system of equations, which is the most widely known method of modelling many predator-prey/parasite-host interactions encountered in natural systems. The Lotka-Volterra predator-prey equations were discovered independently by both Alfred Lotka and Vito Volterra in 1925-26. Volterra got to these equations while trying to explain why, immediately after WWI, the number of predatory fish was much larger than before the war. The system is described by the following equations: Where a, b, c, d > 0 are the parameters describing the growth, death, and predation of the fish. In the absence of predators, the prey population (x) grows exponentially with an intrinsic rate of growth b. Total predation is proportional to the abundance of prey and the abundance of predators, at a constant predation rate a. New predator abundance is proportional to the total predation (axy) at a constant conversion rate c. In the absence of prey, the predator population decreases at a mortality rate d. The system demonstrates an oscillating behavior, as presented in the following figure for parameters a=1, b=1, c=2, d=1. Volterra’s explanation for the rise in the numbers of predatory fish was that fishing reduces the rate of increase of the prey numbers and thus increases the rate of decrease of the predator. Fishing does not change the interaction coefficients. So, the number of predators is decreased by fishing and the number of prey increases as a consequence. Without any fishing activity (during the war), the number of predators increased which also led to a decrease in the number of prey fish. To determine the stability of a system of this form, we first need to estimate its equilibria, i.e. the values of x and y for which: An obvious equilibrium exists at x=0 and y=0, which kinda means that everything’s dead. We’ll first look at a system that’s still alive, i.e x>0 and y>0: And Looking at these expressions for the equilibria we can also see that the isoclines for zero growth for each of the species are straight lines given by b/a for the prey and d/ca for the predator, one horizontal and one vertical in the (x,y) plane. In dynamical systems, the behavior of the system near an equilibrium relates to the eigenvalues of the Jacobian (J) of F(x,y) at the equilibrium. If the eigenvalues all have real parts that are negative, then the equilibrium is considered to be a stable node; if the eigenvalues all have real parts that are positive, then the equilibrium is considered to be an unstable node. In the case of complex eigenvalues, the equilibrium is considered a focus point and its stability is determined by the sign of the real part of the eigenvalue. I found the following graphic from scholarpedia to be a useful illustration of these categorizations. So we can now evaluate the stability of our equilibria. First we calculate the Jacobian of our system and then plug in our estimated equilibrium. To find the eigenvalues of this matrix we need to find the values of λ that satisfy: det⁡(J-λI)=0 where I is the identity matrix and det denotes the determinant. Our eigenvalues are therefore complex with their real parts equal to 0. The equilibrium is therefore a focus point, right between instability and asymptotic stability. What this means for the points that start out near the equilibrium is that they tend to both converge towards the equilibrium and away from it. The solutions of this system are therefore periodic, oscillating around the equilibrium point, with a period , with no trend either towards the equilibrium or away from it. One can arrive at the same conclusion by looking at the trace (τ) of the Jacobian and its determinant (Δ). The trace is exactly zero and the determinant is positive (both d,b>0) which puts the system right in between stability and instability. Now let’s look into the equilibrium where x*=0 and y*=0, aka the total death. Both b and d are positive real numbers which means that the eigenvalues will always have real values of different signs. This makes the (0,0) an unstable saddle point. This is important because if the equilibrium of total death were a stable point, initial low population levels would tend to converge towards their extinction. The fact that this equilibrium is unstable means that the dynamics of the system make it difficult to achieve total death and that prey and predator populations could be infinitesimally close to zero and still recover. Now consider a system where we’ve somehow killed all the predators (y=0). The prey would continue to grow exponentially with a growth rate b. This is generally unrealistic for real-life systems because it assumed infinite resources for the prey. A more realistic model would consider the prey to exhibit a logistic growth, with a carrying capacity K. The carrying capacity of a biological species is the maximum population size of the species that can be sustained indefinitely given the necessary resources. The model therefore becomes: Where a, b, c, d, K > 0. To check for this system’s stability we have to go through the same exercise. The predator equation has remained the same so: For zero prey growth: Calculating the eigenvalues becomes a tedious exercise at this point and the time of writing is 07:35PM on a Friday. I’d rather apply a small trick instead and use the isoclines to derive the stability of the system. The isocline for the predator zero-growth has remained the same (d/ca), which is a straight line (vertical on the (x,y) vector plane we draw before). The isocline for the prey’s zero-growth has changed to: Which is again a straight line with a slope of –b/aK, i.e., it’s decreasing when moving from left to right (when the prey is increasing). Now looking at the signs in the Jacobian of the first system: We see no self-dependence for each of the two species (the two 0), we see that as the predator increases the prey decreases (-) and that as the prey increases the predator increases too (+). For our logistic growth the signs in the Jacobian change to: Because now there’s a negative self-dependence for the prey-as its numbers increase its rate of growth decreases. This makes the trace (τ) of the Jacobian negative and the determinant positive, which implies that our system is now a stable system. Plotting the exact same dynamic system but now including a carrying capacity, we can see how the two populations converge to specific numbers. # Let your Makefile make your life easier This semester I’m taking my first official CS class here at Cornell, CS 5220 Applications of Parallel Computers taught by Dave Bindel (for those of you in the Reed group or at Cornell, I would definitely recommend taking this class, which is offered every other year, once you have some experience coding in C or C++). In addition to the core material we’ve been learning in the class, I’ve been learning a lot by examining the structure and syntax of the code snippets written by the instructor and TA that are offered as starting points and examples for our class assignments. One element in particular that stood out to me from our first assignment was the many function calls made through the makefile. This post will first take a closer look into the typical composition of a makefile and then examine how we can harness the structure of a makefile to help improve workflow on complicated projects. ## Dissecting a typical makefile On the most basic level, a makefile simply consists of series of rules that each have an associated set of actions. Makefiles are how you use the “make” utility, a software package installed on all linux systems. Make has its own syntax similar to bash but with some distinct idiosyncrasies. For example, make allows you to store a snippet of code in whats called a “macro” (these are pretty much analogous to variables in most other languages). A macro to store the flags you would like to run with your compiler could be defined like this: CFLAGS = -g -Wall To referenced the CFLAGS macro, use a dollar sign and brackets, like this: $(CFLAGS)

There are a series of “special” predifined macros that can be used in any makefile which are fairly common, you can find them here.

Now that we’ve discussed makefile syntax, lets take a look at how rules are structured within a makefile. A rule specified by a makefile has the following shape:

target: prerequisites
recipe
...
...

The target is usually the name of the file that is generated by  a program, for example an executable or object file. A prerequisite is the specified input used to create the target (which can often depend on several files). The recipe is the action that make carries out for the intended target (note: this must be indented at every line).

For example, a rule to build an executable called myProg from a c file called myProg.c using the gcc compiler with flags defined in CFLAGS might look like this:

myProg: myProg.c
gcc $(CFLAGS)$<

## Make the makefile do the work

The most common rules within makefiles call the compiler to build code (hence the name “makefile”) and many basic makefiles are used for this sole purpose. However, a rule simply sends a series commands specified by its recipe to the command line and a rule can actually specify any action or series of actions that you want. A ubiquitous example of a rule that specifies an action is “clean”, which may be implemented like this:

clean:
rm -rf *.o $(PROGRAM) Where PROGRAM is a macro containing the names of the executable compiled by the makefile. In this example, the rule’s target is an action rather than an output file. You can call “clean” by simply typing “make clean” into the command line and you will remove all .o files and the executable PROGRAM from your working directory. Utilizing this application of rules, we can now have our makefile do a lot of our command line work for us. For example, we could create a rule called “run” which submits a series of pbs jobs into a cluster. run: qsub job-$*.pbs

We can then enter “make run” into the command line to execute this rule, which will submit the .pbs jobs for us (note that this will not perform any of the other rules defined in the makefile). Using this rule may be handy if we have a large number of jobs to submit.

Alternatively we could  make a rule called “plot” which calls a plotting function from python:

plot:
python plotter.py \$(PLOTFILES)

Where PLOTFILES is a macro containing the name of files to be plotted and plotter.py is a python function that takes the file names as input.

Those are just two examples (loosely based on a makefile given in CS 5220) of how you can use a makefile to do your command line work for you, but the possibilities are endless!! Ok, maybe that’s getting a bit carried away, but I do find this functionality to be a simple and elegant way to improve the efficiency of your workflow on complex projects.

For some background on makefiles, I’d recommend taking a look a Billy’s post from last year. I also found the GNU make user manual helpful as well as this tutorial from Swarthmore that has some nice example makefiles.