Guide to Your First Year in the Reed Research Group

I’m finishing up my first year as a MS/PhD student in Reed Research Group and I would like to use this blog post to formally list resources within the blog that I found especially useful and relevant to my  first year of training. We are also at the point where many of the senior students in the group are moving on to new positions, so I would also like to use this blog post to consolidate tips and tricks that I learned from them that will hopefully be helpful to future students as well.

Blog Posts

There are 315 blog posts on this Water Programming Blog. Chances are, if you have a question, it has already been answered in one of these posts. However, when I first joined the group, it was sometimes hard for me to know what I was even supposed to be searching for. Here are some blog posts that I found particularly useful when I started out or ones that I continue to regularly refer to.

Getting Oriented with the Cube

What even is a cluster? I had no idea when I first arrived but this post brought me up to speed.

Understanding the Terminal

  1. Using MobaXterm as a terminal is incredibly intuitive, especially for someone like me who had rarely touched a terminal in undergrad. MobaXterm allows you to drag and drop files from your computer directly into your directory on the Cube. Furthermore, with the MobaXterm graphical SFTP browser you can navigate through your directories similarly to a Windows environment. I found that it was easier to use other terminal environments like Cygwin after I had gotten used to the terminal through MobaXterm. See Dave’s post here.
  2. Once you are oriented with how the terminal works, the best thing to do is practice navigating using Linux commands. Linux commands can also be very helpful for file manipulation and processing. When I first started training, I was much more comfortable opening text files, for example, in Excel, and making the necessary changes. However, very quickly, I was confronted with manipulating hundreds of text files or set files at a time, which forced me to learn Linux commands. After I learned how to properly used these commands, I wished I had started using them a long time ago. You will work much more efficiently if start practicing the Linux commands listed in Bernardo’s blog post.

Using Borg and the MOEA Framework

Most of my second semester was spent reproducing Julie Quinn’s Lake Problem paper, which is when I first started to understand how to use Borg. It took me entirely too long to realize that the commands in Jazmin’s tutorials here and here are completely generalizable for any application requiring the MOEA framework or Borg. Since these tutorials are done so early in training, it is very easy to forget that they may be useful later and applied to problems other than DTLZ. I found myself referring back many times to these posts to remember the commands needed to generate a reference set from multiple seeds and how to execute Borg using the correct flags.

Using GitHub, Bitbucket, and Git commands

I had heard GitHub tossed around by CS majors in undergrad but it never occurred to me that I would be using it one day. Now, I have realized what a great tool it is for code version control. If used correctly, it makes sharing code with collaborators so much more clean and organized. However, before you can “clone” the contents of anyone’s repository to your own computer, you need an SSH key, which was not obvious to me as newbie to both Github and Bitbucket. You also need a different SSH key for every computer that you use. To generate an SSH key, refer to 2) of this post. Then you can add the generated keys in your profile settings on your Github and Bitbucket accounts.

Once you have keys, you can start cloning directories and pushing changes from your local version to the repository that you cloned from using Git commands outlined in this blog post.

Pro Tips

A consolidation of notes that I wrote down from interactions with senior students in the group that have proven to be useful:

  1.  If you can’t get your set files to merge, make sure there is a # sign at the end of each set file.
  2. If a file is too big to view, use the head or tail command to see the first few lines or last lines of a file to get an idea of what the contents of the file look like.
  3. Every time you submit a job, a file with the name of the job script and job number will appear in your directory. If your code crashes and you aren’t sure where to start, this file is a good place to see what might be going on. I was using Borg and couldn’t figure out why it was crashing after just 10 minutes of running because no errors were being returned. When I looked at this file, hundreds of outputs had been printed that I had forgotten to comment out. This had overloaded the system and caused it to crash.
  4. If you want to compile a file or series of files, use the command make. If you have multiple make files in one folder, then you’ll need to use the command make -f . If you get odd errors when using the make command, try make clean first and then recompile.
  5. Most useful Cube commands:qsub to submit a job

    qdel job number if you want to delete a job on the cube

    qsub -I to start an interactive node. If you start an interactive node, you have one node all to yourself. If you want to run something that might take a while but not necessarily warrant submitting a job, then use an interactive node (don’t run anything large on the command line). However, be aware that you won’t be able to use your terminal until your job is done. If you exit out of your terminal, then you will be kicked out of your interactive node.

In retrospect, I see just how much I have learned in just one year of being in the research group. When you start, it can seem like a daunting task. However, it is important to realize that all of the other students in the group were in your position at one point. By making use of all the resources available to you and with time and a lot of practice, you’ll get the hang of it!

Advertisements

Setting Up and Customizing Python Environments using Conda

Typing ‘python’ into your command line launches the default global Python environment (which you can change by changing your path) that includes every package you’ve likely installed since the dawn of man (or since you adopted your machine).

But what happens when you are working between Python 2.7 and Python 3.x due to collaboration, using Python 3.4 because the last time you updated your script was four years ago, collaborating with others and want to ensure reproducibility and compatible environments, or banging your head against the wall because that one Python library installation is throwing up errors (shakes fist at PIL/Pillow)?

Creating Python environments is a straightforward solution to save you headaches down the road.

Python environments are a topic that many of us have feared through the years due to ambiguous definitions filled with waving hands. An environment is simply the domain in which users run software or scripts. With this same train of thought, a python environment is the domain with all of the Python packages are installed where a user (you!) is executing a script (usually interfacing through an IDE or Terminal/Command Prompt).

However, different scripts will work or fail in different environments  avoid having to use all of these packages at once or having to completely reinstall Python, what we want to do is create new and independent Python environments. Applications of these environments include:

  • Have multiple versions of Python (e.g. 2.7 and 3.4 and 3.6) installed on your machine at once that you can easily switch between
  • Work with specific versions of packages and ensure they don’t update for the specific script you’re developing
  • Allow for individuals to install the same, reproducible environment between workstations
  • Create standardized environments for seamless collaboration
  • Use older versions of packages to utilize outdated code

Creating Your First Python Environment

One problem that recent arose in Ithaca was that someone was crunching towards deadlines and could only run PIL (Python Imaging Library) on their home machine and not their desktop on campus due to package installation issues. This individual had the following  packages they needed to install while using Python 2.7.5:

  • PIL
  • matplotlib
  • numpy
  • pandas
  • statsmodels
  • seaborn

To start, let’s first create an environment! To do this, we will be using Conda (install Anaconda for new users or MiniConda for anyone who doesn’t want their default Python environment to be jeopardized. If you want to avoid using Conda, feel free to explore Pipenv). As a quick note on syntax, I will be running everything in Windows 7 and every command I am using can be found on the Conda Cheatsheet. Only slight variations are required for MacOS/Linux.

First, with your Command Prompt open, type the following command to create the environment we will be working in:

conda create --name blog_pil_example python=2.7.5

conda_create.PNG

At this point, a new environment titled blog_pil_example with Python 2.7.5 has been created. Congrats! Don’t forget to take screenshots to add to your new environment’s baby book (or just use the one above if it’s not your first environment).

From here, we need to activate the environment before interacting with it. To see which environments are available, use the following:

conda env list

Now, let’s go ahead and activate the environment that we want (blog_pil_example):

activate blog_pil_example

To leave the environment you’re in, simply use the following command:

deactivate

(For Linux and MaxOS, put ‘source ‘ prior to these commands)

conda_env_list.PNG

We can see in the screenshot above that multiple other environments exist, but the selected/activated environment is shown in parentheses. Note that you’re still navigating through the same directories as before, you’re just selecting and running a different version of Python and installed packages when you’re using this environment.

Building Your Python Environment

(Installing Packages)

Now onto the real meat and potatoes: installing the necessary packages. While you can use pip at this point, I’ve found Conda has run into fewer issues over the past year.  (Read into channel prioritization if you’re interested in where package files are being sourced from and how to change this.) As a quick back to basics, we’re going to install one of the desired packages, matplotlib, using Conda (or pip). Using these ensures that the proper versions of the packages for your environment (i.e. the Python version and operating system) are retrieved. At the same time, all dependent packages will also be installed (e.g. numpy). Use the following command when in the environment and confirm you want to install matplotlib:

conda install matplotlib

Note that you can specify a version much like how we specified the python version above for library compatibility issues:

conda install matplotlib=2.2.0

If you wish to remove matplotlib, use the following command:

conda remove matplotlib

If you wish to update a specific package, run:

conda update matplotlib

Or to update all packages:

conda update

Additionally, you can prevent specific packages from updating by creating a pinned file in the environment’s conda-meta directory. Be sure to do this prior to running the command to update all packages! 

After installing all of the packages that were required at the start of this tutorial, let’s look into which packages are actually installed in this environment:

conda list

conda_list.PNG

By only installing the required packages, Conda was kind and installed all of the dependencies at the same time. Now you have a Python environment that you’ve created from scratch and developed into a hopefully productive part of your workflow.

Utilizing Your Python Environment

The simplest way to utilize your newly created python environment is simply run python directly in the Command Prompt above. You can run any script when this environment is activated (shown in the parentheses on the left of the command line) to utilize this setup!

If you want to use this environment in your IDE of choice, you can simply point the interpreter to this new environment. In PyCharm, you can easily create a new Conda Environment when creating a new project, or you can point the interpreter to a previously created environment (instructions here).

Additional Resources

For a good ground-up and more in depth tutorial with visualizations on how Conda works (including directory structure, channel prioritization) that has been a major source of inspiration and knowledge for me, please check out this blog post by Gergely Szerovay.

If you’re looking for a great (and nearly exhaustive) source of Python Packages (both current and previous versions), check out Gohlke’s webpage. To install these packages, download the associated file for your system (32/64 bit and then your operating system) then use pip to install the file (in Command Prompt, navigate to the folder the .whl file is located in, then type ‘pip install ,file_name>’). I’ve found that installing packages this way sometimes allows me to step around errors I’ve encountered while using

You can also create environments for R. Check it out here.

If you understand most of the materials above, you can now claim to be environmentally conscious!

A Deeper Dive into Principal Component Analysis

This post is meant to be a continuation of Dave Gold’s introductory post on Principal Component Analysis, which is an excellent explanation on how to conduct a PCA and visualize the principal components. The goal of this post is to elaborate on how to proceed after you have conducted a PCA and to address some common questions and concerns associated with the method.

Performing a PCA in R

Often times, you will perform a PCA on large datasets that contain many variables and/or many observations. One such dataset that will be used as an example is the Living Blended Drought Atlas (LBDA) which is a reconstruction of the Palmer Drought Severity Index (PDSI) over the contiguous United States from 1473-2005. This dataset contains 4968 columns, or variables, each of which is a grid cell over the U.S., and 533 rows, each of which is a yearly observation. We will call this dataset, X. In matrix notation, we will denote the PCA analysis formula as:

U=XW    (1)

where X is the dataset, W is the weighting matrix, whose columns are the key patterns in the data, and U is the matrix whose columns are the resulting principal components (PCs). You can perform a PCA on this dataset with a single function in R, prcomp.

 PCA=prcomp(X, scale=TRUE/FALSE) 

The first input into the function is your data matrix and the second input is used to declare if your dataset should be scaled to have a unit variance before the PCA is conducted. There are various other inputs into the function, listed here, that can be included if necessary.

prcomp returns three sets of results in a list that we have called “PCA”:

  1. sdev: the standard deviations of the principal components (if you square them, you get the eigenvalues of the covariance/correlation matrix)
  2. rotation: the loading matrix whose columns are the eigenvectors (W in the equation above)
  3. x: the rotated data or your PCs (The columns of U in the equation above)

And that’s it! You have the results of the PCA. Now comes the more difficult part: interpreting them.

How do I choose how many PCs to keep?

The dimensions associated with equation (1) are as follows:

    U=XW

(nxk)=(nxk)(kxk) 

If the number of observations is much larger than the number of variables in the dataset, i.e. n>>k, then the PCA will return k distinct eigenvectors. In our case, since n is smaller than our number of variables, the most non-zero eigenvectors that the PCA will return is n. Either way, we have many supposedly distinct patterns. How do we decide how many of those patterns to keep?

The answer is not always clear and most often subjective and case-dependent. One common tool used is a scree plot or an eigenvalue spectrum.

Scree Plot

Each column of our W matrix is a distinct, independent pattern, also called an empirical orthogonal function (EOF). Each EOF is responsible for explaining some amount of variance in the dataset. A scree plot, shown in Figure 1, allows you visualize this variance breakdown.

Picture1

Figure 1: Scree Plot

 

On the x-axis of the scree plot is the EOF (we’ve chosen to keep 10) and on the y-axis is the total variance explained by that EOF. Each variance is equivalent to the eigenvalue associated with its respective eigenvector. You can find the eigenvalues/variances by squaring of the results of sdev that is returned by prcomp.

Generally speaking, one can look for the “elbow” of the scree plot to determine at what point to truncate the EOFs and retain up to the EOF before the elbow. At about the 5th EOF, the graph starts to level off and all subsequent EOFs start to contribute about the same amount of variance. Therefore, the elbow of the graph is located at about the 5th EOF and you will retain the first 4 EOFs.

North’s Rule of Thumb

North’s Rule of Thumb is more of a precise way of truncating and involves creating confidence intervals around your estimates of the variance. The rule states that you should truncated EOFs only when the confidence intervals of the variances start to intersect. At this point, the eigenvectors are considered too close to be interpretable and spacing might be due to sampling error rather than a clear distinction between the variances [1].

Rotated EOFs

At some point, your EOFs might start to exhibit patterns that can be hard to interpret or attribute to a physical phenomenon. It is not uncommon for these types of patterns to result from pure noise in your data, especially if you are analyzing latter EOFs that explain a very small amount of the variance [2].

Rotating EOFs is a practice that is done to simplify the patterns obtained in EOFs and make them more interpretable. A varimax orthogonal rotation can be used to determine an optimum rotation matrix that maximizes the variance in the columns of W. The variance of the columns is maximized by driving some of the loadings to zero and trying to maximize the values of other loadings. In R, this is done using the varimax function.

 my.varimax=varimax(PCA$rotation[,1:10]) 

In the above command, my input is the first 10 EOFs from the original weighting matrix. The result, my.varimax, is a list with the following components:

  1. loadings: the resulting rotated loading matrix
  2. rotmat: the rotation matrix

The new loading matrix, Wrot , is still orthogonal after the rotation, and the eigenvectors are, therefore, still orthonormal. However, multiplication of the rotated loading matrix by the original dataset, X, to obtain a new U, results in principal components are not guaranteed to be independent. This can be seen through further inspection of the correlation matrix associated with U. Unfortunately, this is a tradeoff associated with obtaining EOFs that are simpler and easier to interpret.

Sources:

[1] http://yyy.rsmas.miami.edu/users/bmapes/teaching/MPO581_2011/EOF_chapter_DelSole.pdf

[2] Hannachi, A., Jolliffe, I.T., and Stephenson, D.B. (2007), Empirical orthogonal functions and related techniques in atmospheric science: A review, International Journal of Climatology, 27, 1119-1152.

*All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class, BEE 6300: Environmental Statistics 

Simple Bash shell scripts that have made my life easier

I’ve recently been using Bash shell scripts to improve the efficiency of my workflow when working on Linux systems and I thought I would share some of them here. I’m fairly new to Linux so this post is not meant to be a comprehensive guide on how to write shell scripts rather, I hope the scripts in this post can serve as examples for those who may also be learning Linux and unsure of where or how to start writing shell scripts. I didn’t write any of these from scratch, most of the scripts are based off files shared with me by group members Julie Quinn, Bernardo Trindade and Jazmin Zatarian Salazar. If you’re interested in learning more about any of the commands used in these scripts I’ve put some references I found useful at the end of this post. If you’re more experienced in writing shell scripts, please feel free to put tips or suggestions in the comments.

1. A simple script for making directories

For my research I’m processing results of a monte carlo simulation for several solutions found through multi-objective search and I needed to make folders in several locations to store the output from each solution. My first instinct was to make each directory separately using the mkdir command in the command line, but this quickly got tedious. Instead I used a bash script to loop through all the solution numbers and create a new directory for each. For more on using loops in Bash, check out this reference.

#!/bin/bash#!/bin/bash

# This script will create directories named "Solution_*.txt" for
# a set of numbered solutions 

# specify solution numbers
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# create a variable to store the string "Solution_"
DIRECTORY="Solution_" 

# loop over solution numbers
for i in ${SOLUTIONS[@]}
do
# create a separate directory for each solution
mkdir $DIRECTORY${i}
done

2. Calling a Java function and saving the output

The MOEA framework is a tool written in Java with all sorts of cool functions. I used it to generate 1024 latin hypercube samples across a given range for each of the 8 solutions mentioned above. Using a shell script allows for you to easily set up the arguments needed for the MOEA framework, call the Java function and save the output to your desired file format. The MOEA framework’s tool spits out a .txt file, but this script uses the “sed” command to save it as a .csv file. More on “sed” can be found in the reference at the end of this post.

#!/bin/bash#!/bin/bash
# this shell script will call the MOEA framework's Latin Hypercube
# Sampling tool to create 1024 samples from a set of
# prespecified ranges for each of 8 solutions

# create variables to store Java arguments
JAVA_ARGS="-Xmx1g -classpath MOEAFramework-1.16-Executable.jar"
NUM_SAMPLES=1024
METHOD=latin

# these are the solutions we will create samples from
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# loop through solutions
for i in ${SOLUTIONS[@]}
do
    # define names for input (ranges) and output file names
    RANGES_FILENAME=${i}ranges.txt
    OUTPUT_FILENAME=Solution${i}_Samples.txt
    CSV_FILENAME=Solution${i}_Samples.csv

    # Call MOEA framework from JAVA using specified arguments to
    # create LHS Samples, specify OUTPUT_FILENAME as output
    java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SampleGenerator -m ${METHOD} -n ${NUM_SAMPLES} -p ${RANGES_FILENAME} -o ${OUTPUT_FILENAME}

    # Use the sed command tocreate new comma separated values file
    # from original output .txt file
    sed 's/ /,/g' ${OUTPUT_FILENAME} > ${CSV_FILENAME} 

    # remove .txt files
    rm $OUTPUT_FILENAME
done

3. A piping example

Piping allows you to link together programs by making the output from one program or function the input to another. The script below was originally written by my friend Shrutarshi Basu for a class project we were working on together. This script is made to process the output from the Borg MOEA for 9 random seeds of the DTLZ2 benchmarking problem across several different algorithmic configurations, seen in the code as “masters” (for more on this see Jazmin’s post here). In addition to calling Java tools from the MOEAframework, Basu uses piping to link the Linux commands “tac”, “sed”, “grep” and “cut”.  For more on each of these commands, see the links at the bottom of this post.


# loop over each of 9 seeds
for i in {0..9}
do
obj=DTLZ2_S${i}.obj
output=dtlz2.volume

# loop over masters
for m in $(seq 0 $1)
do
runtime=DTLZ2_S${i}_M${m}.runtime
mobj=DTLZ2_S${i}_M${m}.obj

# extract objectives from output
echo "Extracting objectives"
tac ${runtime} | sed -n '1,/\/\// p' | grep -v "//" | cut -d' ' -f15-19 | tac > ${mobj};
done

# combine objectives into one file
echo "Combining objectives"
java -cp ../../moea.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger \
-d 5 -e 0.01,0.01,0.01,0.01,0.01 \
-o ${obj} DTLZ2_S${i}_M*.obj

# calculate the hypervolume
echo "Finding final hypervolume"
hvol=$(java -cp ../../moea.jar HypervolumeEval ${obj})

printf "%s %s\n" "$i" "$hvol" >> ${output}
echo "Done with seed $i"
done

Additional References and Links

 

A completely non-exhaustive list of tutorial resources for scientific computing

This is a short blog post to put together a list of resources with tutorials (similar to what’s usually found on this blog) for various programming languages. It is by no means exhaustive, so please comment if you feel there’s an important one I left out.

Matlab:

https://www.arnevogel.com/

Kinda new blog with Matlab tutorials on numerical methods

https://blogs.mathworks.com/loren/

One of the MathWorks blogs, great tutorials.

https://www.mathworks.com/support/learn-with-matlab-tutorials.html

Matlab tutorials from MathWorks

http://undocumentedmatlab.com/

More Matlab tutorials, a lot of material on many topics

Python:

https://glowingpython.blogspot.com/

Not updated very frequently, but good data analysis and visualization tutorials are posted

https://pythonprogramming.net/

Updated regularly, some great data visualization and analysis tutorials. The tutorials come with videos. I really like this site.

http://treyhunner.com/

Updated regularly (about once a month) with python tutorials and general python coding insights. I really like the writing style of this author. 

https://docs.scipy.org/doc/scipy/reference/tutorial/

Tutorials on using SciPy

C and C++:

https://www.cprogramming.com/

C and C++ programming tutorials, tips and tricks

http://www.cplusplus.com/articles/

Not really updated anymore but some good basic tutorials are listed

https://blog.knatten.org/

Hadn’t been updated in a while, but it looks like it’s been picked up again. Good for general C++ programming and good practice.

http://www.bfilipek.com/

C++ tutorials

General:

https://towardsdatascience.com/

This is a great general resource not devoted to a particular language. They cover topics in data science, machine learning and programming in general. Some great tutorials for Python and R. 

https://projecteuler.net/

Mathematical programming problems to help with learning any language

https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md

Free programming books repository

Reddits (some of the bigger ones):

/r/matlab (General on matlab, community provides help with coding)

/r/programming (General on programming)

/r/learnprogramming (Community provides help with debugging questions)

/r/python (General on python, community provides help with coding)

/r/learnpython (Community provides help with python questions, smaller than /r/python)

/r/cpp (General on C++)

/r/cpp_questions (Community provides help with C++ questions)

I’ve also recently made /r/sci_comp which has very little activity for the moment, but the aim is to create a community with general resources on coding for scientific applications.

 

Python example: Hardy Cross method for pipe networks

I teach a class called Water Resources Engineering at the University of Colorado Boulder, and I have often added in examples where students use Python to perform the calculations.  For those of you who are new to programming, you might get a kick out of this example since it goes through an entire procedure of how to solve an engineering problem using Python, and the numpy library.  The example is two YouTube videos embedded below! The best way to watch them, though, is in full screen, which you can get to by clicking the name of the video, opening up a new YouTube tab in your browser.

Let us know if you have any questions in the comments section below. For more on Python from this blog, search “Python” in the search box.

Introduction to Docker

In this post we’ll learn the principles of Docker, and how to use Docker with large quantities of data in input / output.

1. What is Docker?

Docker is a way to build virtual machines from a file called the Docker file. That virtual machine can be built anywhere with the help of that Docker file, which makes Docker a great way to port models and the architecture that is used to run them (e.g., the Cube: yes, the Cube can be ported in that way, with the right Docker file, even though that is not the topic of this post). Building it creates an image (a file), and a container is a running instance of that image, where one can log on and work. By definition, containers are transient and removing does not affect the image.

2. Basic Docker commands

This part assumes that we already have a working Docker file. A docker file runs a series of instructions to build the container we want to work in.

To build a container for the WBM model from a Docker file, let us go to the folder where the Docker file is and enter:

docker build -t myimage -f Dockerfile .

The call docker build means that we want to run a Docker file; -t means that we name, or “tag” our image, here by giving it the name of “myimage”; -f specifies which Docker file we are using, in case there are several in the current folder, and “.” says that we run the Docker file and build the container in the current folder. Options -t and -f are optional in theory, but the tag -t is very important as it gives a name to your built image. If we don’t do that, we’ll have to go through the whole build every time we want to run a Docker container from the Docker file. This would waste a lot of time.

Once the Docker image is built, we can run it. In other words, have a virtual machine running on the computer / cluster / cloud where we are working. To do that, we enter:

docker run -dit myimage

The three options are as follows: -d means that we do not directly enter the container, and instead have it running in the background, while the call returns the containers hexadecimal ID. -i means that we keep the standard input open. Finally, -t is our tag, which is the name of the docker image (here, “myimage”).

We can now check that the image is running by listing all the running images with:

docker ps

In particular, this lists displays a list of hexadecimal IDs associated to each running image. After that, we can enter the container by typing:

 docker exec -i -t hexadecimalID /bin/bash 

where -i is the same as before, but -t now refers to the hexadecimal ID of the tagged image (that we retrieved with docker ps). The second argument /bin/bash simply sets the directory of the shell in a standard way.

Once in the container, we can run all the processes we want. Once we are ready to exit the container, we can exit it by typing… exit.

Once outside of the container, we can re-enter it as long as it still runs. If we want it to stop running, we use the following command to “kill” it (not my choice of words!):

 docker kill hexadecimalID 

A short cut to calling all these commands in succession is to use the following version of docker run:

 docker run -it myimage /bin/bash 

This command logs us onto the image as if we had typed run and exec at the same time (using the shell /bin/bash). Note that option -d is not used in this call. Also note that upon typing exit, we will not only exit the container, but also kill the running Docker image. This means that we don’t have to retrieve its hexadecimalID to log on to the image, nor to kill it.

Even if the container is not running any more, it can be re-started and re-entered by retrieving its hexadecimal ID. The docker ps command only lists running containers, so to list all the containers, including those that are no longer running, we type:

 docker ps -a

We can then restart and re-enter the container with the following commands:


docker restart hexadecimalID

docker exec -it hexadecimalID /bin/bash

Note the absence of options for docker restart. Once we are truly done with a container, it can be removed from the lists of previously running containers by using:

 docker rm hexadecimalID 

Note that you can only remove a container that is not running.

3. Working with large input / output data sets.

Building large quantities of data directly into the container when calling docker build has three major drawbacks. First, building the docker image will take much more time because we will need to transfer all that data every time we call docker build. This will waste a lot of time if we are tinkering with the structure of our container and are running the Docker file several times. Second, every container will take up a lot of space on the disk, which can prove problematic if we are not careful and have many containers for the same image (it is so easy to run new containers!). Third, output data will be generated within the container and will need to be copied to another place while still in the container.

An elegant workaround is to “mount” input and output directories to the container, by calling these folders with the -v option as we use the docker run command:

 docker run -it -v path/to/inputs -v path/to/outputs myimage /bin/bash 

or

 docker run -dit -v path/to/inputs -v path/to/outputs myimage 

The -v option is abbreviation for “volume”. This way, the inputs and outputs directories (set on the same host as the container) are used directly by the Docker image. If new outputs are produced, they can be added directly to the mounted output directory, and that data will be kept in that directory when exiting / killing the container. It is also worth noting that we don’t need to call -v again if we restart the container after killing it.

A side issue with Docker is how to manage user permissions on the outputs a container produces, but 1) that issue arises whether or not we use the -v option, and 2) this is a tale for another post.

Acknowledgements: thanks to Julie Quinn and Bernardo Trindade from this research group, who started exploring Docker right before me, making it that much easier for me to get started. Thanks also to the Cornell-based IT support of the Aristotle cloud, Bennet Wineholt and Brandon Baker.