Introduction to Docker

In this post we’ll learn the principles of Docker, and how to use Docker with large quantities of data in input / output.

1. What is Docker?

Docker is a way to build virtual machines from a file called the Docker file. That virtual machine can be built anywhere with the help of that Docker file, which makes Docker a great way to port models and the architecture that is used to run them (e.g., the Cube: yes, the Cube can be ported in that way, with the right Docker file, even though that is not the topic of this post). Building it creates an image (a file), and a container is a running instance of that image, where one can log on and work. By definition, containers are transient and removing does not affect the image.

2. Basic Docker commands

This part assumes that we already have a working Docker file. A docker file runs a series of instructions to build the container we want to work in.

To build a container for the WBM model from a Docker file, let us go to the folder where the Docker file is and enter:

docker build -t myimage -f Dockerfile .

The call docker build means that we want to run a Docker file; -t means that we name, or “tag” our image, here by giving it the name of “myimage”; -f specifies which Docker file we are using, in case there are several in the current folder, and “.” says that we run the Docker file and build the container in the current folder. Options -t and -f are optional in theory, but the tag -t is very important as it gives a name to your built image. If we don’t do that, we’ll have to go through the whole build every time we want to run a Docker container from the Docker file. This would waste a lot of time.

Once the Docker image is built, we can run it. In other words, have a virtual machine running on the computer / cluster / cloud where we are working. To do that, we enter:

docker run -dit myimage

The three options are as follows: -d means that we do not directly enter the container, and instead have it running in the background, while the call returns the containers hexadecimal ID. -i means that we keep the standard input open. Finally, -t is our tag, which is the name of the docker image (here, “myimage”).

We can now check that the image is running by listing all the running images with:

docker ps

In particular, this lists displays a list of hexadecimal IDs associated to each running image. After that, we can enter the container by typing:

 docker exec -i -t hexadecimalID /bin/bash 

where -i is the same as before, but -t now refers to the hexadecimal ID of the tagged image (that we retrieved with docker ps). The second argument /bin/bash simply sets the directory of the shell in a standard way.

Once in the container, we can run all the processes we want. Once we are ready to exit the container, we can exit it by typing… exit.

Once outside of the container, we can re-enter it as long as it still runs. If we want it to stop running, we use the following command to “kill” it (not my choice of words!):

 docker kill hexadecimalID 

A short cut to calling all these commands in succession is to use the following version of docker run:

 docker run -it myimage /bin/bash 

This command logs us onto the image as if we had typed run and exec at the same time (using the shell /bin/bash). Note that option -d is not used in this call. Also note that upon typing exit, we will not only exit the container, but also kill the running Docker image. This means that we don’t have to retrieve its hexadecimalID to log on to the image, nor to kill it.

Even if the container is not running any more, it can be re-started and re-entered by retrieving its hexadecimal ID. The docker ps command only lists running containers, so to list all the containers, including those that are no longer running, we type:

 docker ps -a

We can then restart and re-enter the container with the following commands:


docker restart hexadecimalID

docker exec -it hexadecimalID /bin/bash

Note the absence of options for docker restart. Once we are truly done with a container, it can be removed from the lists of previously running containers by using:

 docker rm hexadecimalID 

Note that you can only remove a container that is not running.

3. Working with large input / output data sets.

Building large quantities of data directly into the container when calling docker build has three major drawbacks. First, building the docker image will take much more time because we will need to transfer all that data every time we call docker build. This will waste a lot of time if we are tinkering with the structure of our container and are running the Docker file several times. Second, every container will take up a lot of space on the disk, which can prove problematic if we are not careful and have many containers for the same image (it is so easy to run new containers!). Third, output data will be generated within the container and will need to be copied to another place while still in the container.

An elegant workaround is to “mount” input and output directories to the container, by calling these folders with the -v option as we use the docker run command:

 docker run -it -v path/to/inputs -v path/to/outputs myimage /bin/bash 

or

 docker run -dit -v path/to/inputs -v path/to/outputs myimage 

The -v option is abbreviation for “volume”. This way, the inputs and outputs directories (set on the same host as the container) are used directly by the Docker image. If new outputs are produced, they can be added directly to the mounted output directory, and that data will be kept in that directory when exiting / killing the container. It is also worth noting that we don’t need to call -v again if we restart the container after killing it.

A side issue with Docker is how to manage user permissions on the outputs a container produces, but 1) that issue arises whether or not we use the -v option, and 2) this is a tale for another post.

Acknowledgements: thanks to Julie Quinn and Bernardo Trindade from this research group, who started exploring Docker right before me, making it that much easier for me to get started. Thanks also to the Cornell-based IT support of the Aristotle cloud, Bennet Wineholt and Brandon Baker.

 

 

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s