Containerizing your code for HPC (Docker + Singularity)

If you’ve ever run into software dependency issues when trying to download and run a computer model on a new machine, you’re in good company. Dependency issues are a common headache for modelers, whether you are trying to run someone else’s code or share your own code with collaborators. Problems can arise due to differences in operating system (e.g., Windows vs OSX vs Ubuntu), compilers (e.g., Gnu vs Intel), software versions (e.g., Python 3.3 vs 3.10), or other factors. These challenges are exacerbated on high performance computing (HPC) clusters, which often have highly specialized architectures and software environments that are beyond our control. For example, I regularly develop code on my laptop, test parallel jobs on TheCube cluster at Cornell, and finally run much larger production jobs on the Stampede2 cluster at TACC. Ensuring compatibility across these three environments can be a challenge, especially for models involving complex software interdependencies and/or MPI parallel communication. Thankfully, containers offer an elegant solution to many of these challenges.

Containers such as Docker allow you to package an application in a self-contained, ready-to-run package. The container includes its own operating system and all libraries and dependencies needed to run your code. The application then runs inside this custom container where it is isolated from the individual machine that you are running it on, making it much simpler to build portable applications. This has several benefits for researchers by allowing you to spend more time on scientific research and model building and less time on installing finicky software dependencies. Additionally, it can encourage “open science” and collaboration by streamlining the process of sharing and borrowing code.

However, HPC clusters present some unique challenges to the “build once, run anywhere” mantra of containerization. First, Docker containers generally require root permissions that are disallowed on shared supercomputers. This challenge can be overcome by using a second type of container called Singularity. Second, when running parallel applications with MPI, or running applications on GPUs, the container must be built in a way that is consistent with the hardware and software setup of the cluster. This challenge is harder to address, and does reduce the portability of the container somewhat, but as we shall see it can still be overcome with careful container design.

This blogpost will demonstrate how to build containers to run parallelized code on HPC clusters. We will build two different containers: one which can be run on two different TACC clusters, Stampede2 and Frontera, and one which can be run on two different Cornell Center for Advanced Computing (CAC) clusters, TheCube and Hopper. Both of these can also be run on a personal laptop. If you want to follow along with this post and/or containerize your own code, the first thing you will need to do is to download and install Docker Desktop and set up an account on DockerHub. You can find instructions for doing that, as well as a lot more great information on containerization and HPC, in this documentation from the Texas Advanced Computing Center (TACC). You can also find more information on Docker in earlier Water Programming Blog posts by Lillian Lau, Rohini Gupta, and Charles Rouge.

Containerizing a Cython reservoir model with Docker

All code related to this blogpost can be found in the “WPB_containerization” directory of my “cython_tutorial” GitHub repository, here. If you want to follow along, you can clone this repository and navigate in your terminal to the WPB_containerization directory. This directory contains a simple reservoir simulation model which is built using Cython, a software used to convert Python-like code into typed, compiled C code, which can significantly improve execution time. Cython is beyond the scope of this blog post (though I may cover it in a future post, so stay tuned!), but the interested reader can find an overview in the base directory of this GitHub repository, and references therein. For this post, suffice it to say that Cython introduces non-trivial software dependencies and build complexities beyond a more standard Python application, which makes it a good candidate for containerization. Additionally, this directory contains scripts for running parallel simulations across multiple nodes on an HPC cluster using MPI, which introduces additional containerization challenges.

The first container we will build is for the TACC clusters Stampede2 and Frontera. The main step involved in building a Docker container is writing the Dockerfile (see the file “Dockerfile_tacc”), which outlines the steps needed to build the container:

FROM tacc/tacc-ubuntu18-impi19.0.7-common 

RUN apt-get update && apt-get upgrade -y && apt-get install -y libjpeg-dev python3-pip 

RUN pip3 install numpy pandas matplotlib cython mpi4py 

ADD . /code WORKDIR /code 

RUN python3 setup.py build_ext --inplace 

RUN chmod +rx /code/run_reservoir_sim.py 

ENV PATH "/code:$PATH" RUN useradd -u 8877 <user>

The first line tells Docker which “base image” to start from. There are a wide variety of base images to choose from, ranging from standard operating systems (e.g., Ubuntu 18.04) to smaller specialized images for software languages like Python. Due to the complexity of TACC’s hardware and software infrastructure, they provide special base images that are set up to efficiently interface with the clusters’ particular processors, compilers, communication networks, etc. The base image imported here is designed to operate on both Stampede2 and Frontera. In general, it should also work on your local computer as well, which helps streamline application development.

The second line updates the Ubuntu operating system and installs Python3 and another necessary library. Third, we install Python libraries needed for our application, including Cython and mpi4py. We then add the contents of the current working directory into the container, in a new directory called /code, and make this the working directory inside the container. Next, we run setup.py, which cythonizes and compiles our application. We then make the file “run_reservoir_sim.py” executable, so that we can run it directly from the container, and add the /code directory to the container’s path. Lastly, we change the user from root to <user> (fill in your user profile name on your computer) to avoid permissions issues on some machines.

To build the container, we run the command:

docker build -f Dockerfile_tacc -t <username>/cython_reservoir:0.1 .

where <username> should be replaced with your DockerHub username. This command will run each line in our Dockerfile to build the image of the container, then push the image to Dockerhub and create a new project called “cython_reservoir”, with the version tag 0.1.

To run the container locally, we run the command:

docker run --rm -v <working_directory>/results:/code/results <username>/cython_reservoir:0.1 mpirun -np 4 run_reservoir_sim.py

This command has several parts. First, “docker run” means to create and start a container from the image (this is like instantiating an object from a class). “—rm” means that the container can be removed after running (just the instantiated container, not the image itself). Next, the “-v” flag creates a directory mount from “code/results” inside the container to “<working_directory>/results” on your machine, where <working_directory> should be replaced with the full path of your working directory. This allows us to access the simulation results even after the container is removed. Lastly, “mpirun -np 4 run_reservoir_sim.py” tells the container to run the script across four processors – this will run four independent simulations and store each result as a csv file in the results directory.

Running on TACC clusters with Singularity

Now that we have verified that this container works on a local laptop, let’s move to Stampede2, a supercomputer owned by TACC (with allocations managed through NSF XSEDE). Unfortunately, Docker isn’t allowed on TACC’s clusters due to permission issues. However, another containerization software called Singularity is allowed, and has the ability to run containers created through Docker. After logging into the cluster through the terminal, we need to load the Singularity module:

module load tacc-singularity

We also need to create a directory called “results”. TACC doesn’t allow Singularity to be run directly from the login node, so we need to request an interactive session:

idev -m 30

When the interactive session on a compute node has been granted, we can “pull” our container from DockerHub:

singularity pull docker://ahamilton144/cython_reservoir:0.1

Singularity creates and downloads a single file, cython_reservoir_0.1.sif, as a Singularity representation of the Docker container. Once this has downloaded, you can exit your interactive session. We will run the job itself non-interactively by submitting it to the SLURM scheduler, just like you would with a non-containerized job:

sbatch submit_reservoir_stampede2.sh

This submission script, in addition to typical SLURM directives, contains the line

mpirun singularity run cython_reservoir_0.1.sif run_reservoir_sim.py

And that’s it! This command will run the simulation across all processors assigned to the job, and store the results in the results folder. These same commands and container can also be used on the Frontera cluster (also owned by TACC), using the “submit_reservoir_frontera.sh” script. As you can see, this makes it straightforward to develop and run code across your local laptop and two different TACC clusters, without having to worry about software dependency issues.

Customizing containers to run on non-TACC clusters

Unfortunately, it’s not quite so simple to port containers to other HPC systems if your application involves MPI parallel processing. This is because it is necessary for the MPI configuration inside the container to be interoperable with the MPI configuration in the cluster’s environment. TACC’s customized base images take care of this complexity for us, but may not port to alternative systems, such as CAC’s TheCube and Hopper. In this case, we need to provide explicit instructions in the Dockerfile for how to set up MPI:

### start with ubuntu base image 
FROM ubuntu:18.04 

### install basics, python3, and modules needed for application 
RUN apt-get update && apt-get upgrade -y && apt-get install -y build-essential zlib1g-dev libjpeg-dev python3-pip openssh-server 
RUN pip3 install Pillow numpy pandas matplotlib cython 

### install openMPI version 4.0.5, consistent with Hopper & TheCube 
RUN wget 'https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.5.tar.gz' -O openmpi-4.0.5.tar.gz 
RUN tar -xzf openmpi-4.0.5.tar.gz openmpi-4.0.5; cd openmpi-4.0.5; ./configure --prefix=/usr/local; make all install 
RUN ldconfig 

### install mpi4py now that openmpi is installed 
RUN pip3 install mpi4py 

### add all code from current directory into “code” directory within container, and set as working directory 
ADD . /code WORKDIR /code ENV PATH "/code:$PATH" 

### compile cython for this particular application 
RUN python3 setup.py build_ext --inplace 

### set python file as executable so it can be run by docker/singularity 
RUN chmod +rx /code/run_reservoir_sim.py 

### change username from root 
RUN useradd -u 8877 <username>

For this container, we start with a base Ubuntu image rather than TACC’s specialized image. We then install several important libraries and Python modules. The third block of commands downloads and installs OpenMPI version 4.0.5, which is the version installed on Hopper. TheCube has an older version, 3.1.4, but I found version 4.0.5 to be backwards compatible and to work on both CAC clusters. The rest of the commands are equivalent to the first Dockerfile. We can now build this second container and push it to Dockerhub as a separate version, 0.2.

docker build -f Dockerfile -t <username>/cython_reservoir:0.2 .

We can run this container on our local machine using the same command from above, but substituting 0.2 for 0.1. Similarly, we can log onto TheCube or Hopper and execute the same commands from above, again substituting 0.2 for 0.1. Note that the singularity module on CAC clusters is “singularity” rather than “tacc-singularity”. “submit_reservoir_cube.sh” and “submit_reservoir_hopper.sh” can be used to submit to the SLURM scheduler on these systems.

Thus, while MPI does compromise container portability somewhat, it is relatively straightforward to incorporate separate builds for different HPC systems using custom Dockerfiles. With the Dockerfile in place for each system, a streamlined workflow can be developed that allows for easy project development and deployment across multiple HPC systems.

Many thanks to TACC staff for teaching me all about containers at a recent virtual workshop, and to CAC staff who helped install and troubleshoot Singularity on their system.

Containerizing Rhodium with Docker

Ever installed a new library only for it to throw version depreciation errors up on your terminal? Or have warnings print in your output line instead of the figure you so painstakingly coded? Fear not – containerization is here to save the day! But before we get too excited, there are a few things (and terms) to learn about containerization using Docker.

In this post, we will be walking through a brief description of containerization and explain a few of its key terms. At the end, we will perform an exercise by containerizing the Rhodium robust decision-making library. More information about this library and how it can be used for exploratory modeling can be found in this post by Andrew Dircks. For a specific application of this library to the Lake Problem, please refer to this post by Antonia Hadjimichael.

Explaining containerization (and what is a base image?)

Visualizing containerization as a portable, app-specific virtual environment (Source: https://www.armor.com/resources/blog/containerization-the-need-to-know/)

Using the image above, picture your hardware (laptop, desktop, supercomputer) as a large cargo ship, with its engines being its operating system. In the absence of containerization, an application (app) is developed in a specific computing environment, akin to placing cargo in a permanent storage hold under the deck of a ship. Methods for cargo loading and removal are strongly dictated by the shape and size of the ship. Similarly, a non-containerized app can only be reliably executed given that it is installed in a computing environment that is nearly or almost completely identical to that in which is was developed in.

On the contrary, containerization bundles everything an app might need to run in a ‘container’ – the code, its required libraries, and their associated dependencies – therefore enabling an app to be run consistently on any infrastructure. By extension, this renders a containerized application version- and operating system (OS)-independent. These ‘containers’ are thus easily loaded and installed onto any ‘cargo ship’. The piece of software that enables the efficient execution of containerized apps is the container engine. This nifty tool is responsible for handling app user input and ensuring the correct installation, startup and running of the containerized app. The engine also pulls, loads, and builds the container image, which is a (misleadingly-named) file, or repository of files, that contains all the information that the engine will need to build the app on a new machine.

In this post, we will be walking through the containerization of the Rhodium library using Docker, which is a container hub that let’s you develop, store and build your container images. It is the first and most commonly-used container hub (at the moment).

Let’s containerize!

Setup

If you use either a Windows or Mac machine, please install Docker Desktop from this site. Linux machines should first install Docker and then Docker Compose. Make sure to create an account and login.

Next, clone the PRIM, Platypus, and Rhodium repositories onto your local machine. You can directly download a .zip file of the repository here or you can clone the repository via your command line/terminal into a folder within a directory of your choice:

C:\Users\username> cd directory-of-choice
C:\Users\username\directory-of-choice>git clone https://github.com/lbl59/Rhodium.git
C:\Users\username\directory-of-choice>git clone https://github.com/Project-Platypus/Platypus.git
C:\Users\username\directory-of-choice>git clone https://github.com/Project-Platypus/PRIM.git

Great, your repositories are set and ready to go! These should result in three new folders: Rhodium, Platypus, and PRIM. Now, in the same terminal window, navigate to the PRIM folder and run the following:

C:\Users\username\directory-of-choice\PRIM>python setup.py develop

Repeat for the Platypus folder. This is to make sure that you have both PRIM and Project Platypus installed and setup on your local machine.

Updating the requirements.txt file

Now, navigate back to the original directory-of-choice. Open the Rhodium folder, locate and open the requirements.txt file. Modify it so it looks like this:

matplotlib==3.5.1
numpy==1.22.1
pandas==1.4.0
mpldatacursor==0.7.1
six==1.16.0
scipy==1.7.3
prim
platypus-opt
sklearn==1.0.2

This file tells Docker that these are the required versions of libraries to install when building and installing your app.

Creating a Dockerfile

To begin building the container image for Docker to pack and build as Rhodium’s container, first create a new text file and name is Dockerfile within the Rhodium folder. Make sure to remove the .txt extension and save it as “All types” to avoid appending an extension. Open it using whichever text file you are comfortable with. The contents of this file should look like as follows. Note that the comments are for explanatory purposes only.

# state the base version of Python you are working with
# for my machine, it is Python 3.9.1
FROM python:3.9.1

# set the Rhodium repository as the container
WORKDIR /rhodium_app

# copy the requirements file into the new working directory
COPY requirements.txt .

# install all libraries and dependencies declared in the requirements file
RUN pip install -r requirements.txt

# copy the rhodium subfolder into the new working directory
# find this subfolder within the main Rhodium folder
COPY rhodium/ .

# this is the command the run when the container starts
CMD ["python", "./setup.py"]

The “.” indicates that you will be copying the file from your present directory into your working directory.

Build the Docker image

Once again in your terminal, check that you are in the same directory as before. Then, type in the following:

$ docker build -t rhodium_image .

Hit enter. If the containerization succeeded, you should see the following in your terminal (or something similar to it):

Containerization successful!

Congratulations, you have successfully containerized Rhodium! You are now ready for world domination!

References

2022. [online] Available at: <https://www.redhat.com/en/topics/cloud-native-apps/what-is-containerization&gt; [Accessed 1 February 2022].

Education, I., 2022. containerization. [online] Ibm.com. Available at: <https://www.ibm.com/cloud/learn/containerization&gt; [Accessed 1 February 2022].

Iordache, A., Scott, D., Turner, S. and Dalleau, F., 2022. Containerized Python Development – Part 1 – Docker Blog. [online] Docker Blog. Available at: <https://www.docker.com/blog/containerized-python-development-part-1/&gt; [Accessed 1 February 2022].

Runnable Docker Guides. 2022. Dockerize your Python Application. [online] Available at: <https://runnable.com/docker/python/dockerize-your-python-application&gt; [Accessed 1 February 2022].