In this post we’ll learn the principles of Docker, and how to use Docker with large quantities of data in input / output.
1. What is Docker?
Docker is a way to build virtual machines from a file called the Docker file. That virtual machine can be built anywhere with the help of that Docker file, which makes Docker a great way to port models and the architecture that is used to run them (e.g., the Cube: yes, the Cube can be ported in that way, with the right Docker file, even though that is not the topic of this post). Building it creates an image (a file), and a container is a running instance of that image, where one can log on and work. By definition, containers are transient and removing does not affect the image.
2. Basic Docker commands
This part assumes that we already have a working Docker file. A docker file runs a series of instructions to build the container we want to work in.
To build a container for the WBM model from a Docker file, let us go to the folder where the Docker file is and enter:
docker build -t myimage -f Dockerfile .
docker build means that we want to run a Docker file;
-t means that we name, or “tag” our image, here by giving it the name of “myimage”;
-f specifies which Docker file we are using, in case there are several in the current folder, and “
.” says that we run the Docker file and build the container in the current folder. Options
-f are optional in theory, but the tag
-t is very important as it gives a name to your built image. If we don’t do that, we’ll have to go through the whole build every time we want to run a Docker container from the Docker file. This would waste a lot of time.
Once the Docker image is built, we can run it. In other words, have a virtual machine running on the computer / cluster / cloud where we are working. To do that, we enter:
docker run -dit myimage
The three options are as follows:
-d means that we do not directly enter the container, and instead have it running in the background, while the call returns the containers hexadecimal ID.
-i means that we keep the standard input open. Finally,
-t is our tag, which is the name of the docker image (here, “myimage”).
We can now check that the image is running by listing all the running images with:
In particular, this lists displays a list of hexadecimal IDs associated to each running image. After that, we can enter the container by typing:
docker exec -i -t hexadecimalID /bin/bash
-i is the same as before, but
-t now refers to the hexadecimal ID of the tagged image (that we retrieved with
docker ps). The second argument
/bin/bash simply sets the directory of the shell in a standard way.
Once in the container, we can run all the processes we want. Once we are ready to exit the container, we can exit it by typing…
Once outside of the container, we can re-enter it as long as it still runs. If we want it to stop running, we use the following command to “kill” it (not my choice of words!):
docker kill hexadecimalID
A short cut to calling all these commands in succession is to use the following version of docker run:
docker run -it myimage /bin/bash
This command logs us onto the image as if we had typed run and exec at the same time (using the shell
/bin/bash). Note that option
-d is not used in this call. Also note that upon typing
exit, we will not only exit the container, but also kill the running Docker image. This means that we don’t have to retrieve its hexadecimalID to log on to the image, nor to kill it.
Even if the container is not running any more, it can be re-started and re-entered by retrieving its hexadecimal ID. The
docker ps command only lists running containers, so to list all the containers, including those that are no longer running, we type:
docker ps -a
We can then restart and re-enter the container with the following commands:
docker restart hexadecimalID
docker exec -it hexadecimalID /bin/bash
Note the absence of options for
docker restart. Once we are truly done with a container, it can be removed from the lists of previously running containers by using:
docker rm hexadecimalID
Note that you can only remove a container that is not running.
3. Working with large input / output data sets.
Building large quantities of data directly into the container when calling
docker build has three major drawbacks. First, building the docker image will take much more time because we will need to transfer all that data every time we call docker build. This will waste a lot of time if we are tinkering with the structure of our container and are running the Docker file several times. Second, every container will take up a lot of space on the disk, which can prove problematic if we are not careful and have many containers for the same image (it is so easy to run new containers!). Third, output data will be generated within the container and will need to be copied to another place while still in the container.
An elegant workaround is to “mount” input and output directories to the container, by calling these folders with the
-v option as we use the
docker run command:
docker run -it -v path/to/inputs -v path/to/outputs myimage /bin/bash
docker run -dit -v path/to/inputs -v path/to/outputs myimage
-v option is abbreviation for “volume”. This way, the inputs and outputs directories (set on the same host as the container) are used directly by the Docker image. If new outputs are produced, they can be added directly to the mounted output directory, and that data will be kept in that directory when exiting / killing the container. It is also worth noting that we don’t need to call
-v again if we restart the container after killing it.
A side issue with Docker is how to manage user permissions on the outputs a container produces, but 1) that issue arises whether or not we use the
-v option, and 2) this is a tale for another post.
Acknowledgements: thanks to Julie Quinn and Bernardo Trindade from this research group, who started exploring Docker right before me, making it that much easier for me to get started. Thanks also to the Cornell-based IT support of the Aristotle cloud, Bennet Wineholt and Brandon Baker.