Containerizing your code for HPC (Docker + Singularity)

If you’ve ever run into software dependency issues when trying to download and run a computer model on a new machine, you’re in good company. Dependency issues are a common headache for modelers, whether you are trying to run someone else’s code or share your own code with collaborators. Problems can arise due to differences in operating system (e.g., Windows vs OSX vs Ubuntu), compilers (e.g., Gnu vs Intel), software versions (e.g., Python 3.3 vs 3.10), or other factors. These challenges are exacerbated on high performance computing (HPC) clusters, which often have highly specialized architectures and software environments that are beyond our control. For example, I regularly develop code on my laptop, test parallel jobs on TheCube cluster at Cornell, and finally run much larger production jobs on the Stampede2 cluster at TACC. Ensuring compatibility across these three environments can be a challenge, especially for models involving complex software interdependencies and/or MPI parallel communication. Thankfully, containers offer an elegant solution to many of these challenges.

Containers such as Docker allow you to package an application in a self-contained, ready-to-run package. The container includes its own operating system and all libraries and dependencies needed to run your code. The application then runs inside this custom container where it is isolated from the individual machine that you are running it on, making it much simpler to build portable applications. This has several benefits for researchers by allowing you to spend more time on scientific research and model building and less time on installing finicky software dependencies. Additionally, it can encourage “open science” and collaboration by streamlining the process of sharing and borrowing code.

However, HPC clusters present some unique challenges to the “build once, run anywhere” mantra of containerization. First, Docker containers generally require root permissions that are disallowed on shared supercomputers. This challenge can be overcome by using a second type of container called Singularity. Second, when running parallel applications with MPI, or running applications on GPUs, the container must be built in a way that is consistent with the hardware and software setup of the cluster. This challenge is harder to address, and does reduce the portability of the container somewhat, but as we shall see it can still be overcome with careful container design.

This blogpost will demonstrate how to build containers to run parallelized code on HPC clusters. We will build two different containers: one which can be run on two different TACC clusters, Stampede2 and Frontera, and one which can be run on two different Cornell Center for Advanced Computing (CAC) clusters, TheCube and Hopper. Both of these can also be run on a personal laptop. If you want to follow along with this post and/or containerize your own code, the first thing you will need to do is to download and install Docker Desktop and set up an account on DockerHub. You can find instructions for doing that, as well as a lot more great information on containerization and HPC, in this documentation from the Texas Advanced Computing Center (TACC). You can also find more information on Docker in earlier Water Programming Blog posts by Lillian Lau, Rohini Gupta, and Charles Rouge.

Containerizing a Cython reservoir model with Docker

All code related to this blogpost can be found in the “WPB_containerization” directory of my “cython_tutorial” GitHub repository, here. If you want to follow along, you can clone this repository and navigate in your terminal to the WPB_containerization directory. This directory contains a simple reservoir simulation model which is built using Cython, a software used to convert Python-like code into typed, compiled C code, which can significantly improve execution time. Cython is beyond the scope of this blog post (though I may cover it in a future post, so stay tuned!), but the interested reader can find an overview in the base directory of this GitHub repository, and references therein. For this post, suffice it to say that Cython introduces non-trivial software dependencies and build complexities beyond a more standard Python application, which makes it a good candidate for containerization. Additionally, this directory contains scripts for running parallel simulations across multiple nodes on an HPC cluster using MPI, which introduces additional containerization challenges.

The first container we will build is for the TACC clusters Stampede2 and Frontera. The main step involved in building a Docker container is writing the Dockerfile (see the file “Dockerfile_tacc”), which outlines the steps needed to build the container:

FROM tacc/tacc-ubuntu18-impi19.0.7-common 

RUN apt-get update && apt-get upgrade -y && apt-get install -y libjpeg-dev python3-pip 

RUN pip3 install numpy pandas matplotlib cython mpi4py 

ADD . /code WORKDIR /code 

RUN python3 setup.py build_ext --inplace 

RUN chmod +rx /code/run_reservoir_sim.py 

ENV PATH "/code:$PATH" RUN useradd -u 8877 <user>

The first line tells Docker which “base image” to start from. There are a wide variety of base images to choose from, ranging from standard operating systems (e.g., Ubuntu 18.04) to smaller specialized images for software languages like Python. Due to the complexity of TACC’s hardware and software infrastructure, they provide special base images that are set up to efficiently interface with the clusters’ particular processors, compilers, communication networks, etc. The base image imported here is designed to operate on both Stampede2 and Frontera. In general, it should also work on your local computer as well, which helps streamline application development.

The second line updates the Ubuntu operating system and installs Python3 and another necessary library. Third, we install Python libraries needed for our application, including Cython and mpi4py. We then add the contents of the current working directory into the container, in a new directory called /code, and make this the working directory inside the container. Next, we run setup.py, which cythonizes and compiles our application. We then make the file “run_reservoir_sim.py” executable, so that we can run it directly from the container, and add the /code directory to the container’s path. Lastly, we change the user from root to <user> (fill in your user profile name on your computer) to avoid permissions issues on some machines.

To build the container, we run the command:

docker build -f Dockerfile_tacc -t <username>/cython_reservoir:0.1 .

where <username> should be replaced with your DockerHub username. This command will run each line in our Dockerfile to build the image of the container, then push the image to Dockerhub and create a new project called “cython_reservoir”, with the version tag 0.1.

To run the container locally, we run the command:

docker run --rm -v <working_directory>/results:/code/results <username>/cython_reservoir:0.1 mpirun -np 4 run_reservoir_sim.py

This command has several parts. First, “docker run” means to create and start a container from the image (this is like instantiating an object from a class). “—rm” means that the container can be removed after running (just the instantiated container, not the image itself). Next, the “-v” flag creates a directory mount from “code/results” inside the container to “<working_directory>/results” on your machine, where <working_directory> should be replaced with the full path of your working directory. This allows us to access the simulation results even after the container is removed. Lastly, “mpirun -np 4 run_reservoir_sim.py” tells the container to run the script across four processors – this will run four independent simulations and store each result as a csv file in the results directory.

Running on TACC clusters with Singularity

Now that we have verified that this container works on a local laptop, let’s move to Stampede2, a supercomputer owned by TACC (with allocations managed through NSF XSEDE). Unfortunately, Docker isn’t allowed on TACC’s clusters due to permission issues. However, another containerization software called Singularity is allowed, and has the ability to run containers created through Docker. After logging into the cluster through the terminal, we need to load the Singularity module:

module load tacc-singularity

We also need to create a directory called “results”. TACC doesn’t allow Singularity to be run directly from the login node, so we need to request an interactive session:

idev -m 30

When the interactive session on a compute node has been granted, we can “pull” our container from DockerHub:

singularity pull docker://ahamilton144/cython_reservoir:0.1

Singularity creates and downloads a single file, cython_reservoir_0.1.sif, as a Singularity representation of the Docker container. Once this has downloaded, you can exit your interactive session. We will run the job itself non-interactively by submitting it to the SLURM scheduler, just like you would with a non-containerized job:

sbatch submit_reservoir_stampede2.sh

This submission script, in addition to typical SLURM directives, contains the line

mpirun singularity run cython_reservoir_0.1.sif run_reservoir_sim.py

And that’s it! This command will run the simulation across all processors assigned to the job, and store the results in the results folder. These same commands and container can also be used on the Frontera cluster (also owned by TACC), using the “submit_reservoir_frontera.sh” script. As you can see, this makes it straightforward to develop and run code across your local laptop and two different TACC clusters, without having to worry about software dependency issues.

Customizing containers to run on non-TACC clusters

Unfortunately, it’s not quite so simple to port containers to other HPC systems if your application involves MPI parallel processing. This is because it is necessary for the MPI configuration inside the container to be interoperable with the MPI configuration in the cluster’s environment. TACC’s customized base images take care of this complexity for us, but may not port to alternative systems, such as CAC’s TheCube and Hopper. In this case, we need to provide explicit instructions in the Dockerfile for how to set up MPI:

### start with ubuntu base image 
FROM ubuntu:18.04 

### install basics, python3, and modules needed for application 
RUN apt-get update && apt-get upgrade -y && apt-get install -y build-essential zlib1g-dev libjpeg-dev python3-pip openssh-server 
RUN pip3 install Pillow numpy pandas matplotlib cython 

### install openMPI version 4.0.5, consistent with Hopper & TheCube 
RUN wget 'https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.5.tar.gz' -O openmpi-4.0.5.tar.gz 
RUN tar -xzf openmpi-4.0.5.tar.gz openmpi-4.0.5; cd openmpi-4.0.5; ./configure --prefix=/usr/local; make all install 
RUN ldconfig 

### install mpi4py now that openmpi is installed 
RUN pip3 install mpi4py 

### add all code from current directory into “code” directory within container, and set as working directory 
ADD . /code WORKDIR /code ENV PATH "/code:$PATH" 

### compile cython for this particular application 
RUN python3 setup.py build_ext --inplace 

### set python file as executable so it can be run by docker/singularity 
RUN chmod +rx /code/run_reservoir_sim.py 

### change username from root 
RUN useradd -u 8877 <username>

For this container, we start with a base Ubuntu image rather than TACC’s specialized image. We then install several important libraries and Python modules. The third block of commands downloads and installs OpenMPI version 4.0.5, which is the version installed on Hopper. TheCube has an older version, 3.1.4, but I found version 4.0.5 to be backwards compatible and to work on both CAC clusters. The rest of the commands are equivalent to the first Dockerfile. We can now build this second container and push it to Dockerhub as a separate version, 0.2.

docker build -f Dockerfile -t <username>/cython_reservoir:0.2 .

We can run this container on our local machine using the same command from above, but substituting 0.2 for 0.1. Similarly, we can log onto TheCube or Hopper and execute the same commands from above, again substituting 0.2 for 0.1. Note that the singularity module on CAC clusters is “singularity” rather than “tacc-singularity”. “submit_reservoir_cube.sh” and “submit_reservoir_hopper.sh” can be used to submit to the SLURM scheduler on these systems.

Thus, while MPI does compromise container portability somewhat, it is relatively straightforward to incorporate separate builds for different HPC systems using custom Dockerfiles. With the Dockerfile in place for each system, a streamlined workflow can be developed that allows for easy project development and deployment across multiple HPC systems.

Many thanks to TACC staff for teaching me all about containers at a recent virtual workshop, and to CAC staff who helped install and troubleshoot Singularity on their system.

Continuous Deployment with GitHub Actions (or, What Gives Life to a Living eBook?)

Continuous Deployment with GitHub Actions (or, What Gives Life to a Living eBook?)

Last month we introduced our free eBook Addressing Uncertainty in MultiSector Dynamics Research. Did you know that, to date, we have published 74 revisions? Without an automated release and deployment process, this would have been very tedious! But, with the help of GitHub Actions, every small edit, minor clarification, or major improvement can go live within moments of merging the code. Read on to learn how the eBook team leverages GitHub Actions for CI/CD (Continuous Integration / Continuous Delivery).

GitHub Workflow

A reliable CI/CD strategy depends on a robust code review process⁠—continuously delivering bugs and typos will not impress anyone! There are many acceptable workflows; the best one to use will depend on team composition and codebase. In our case, a feature branching workflow suffices:

Feature Branching Workflow

The feature branching workflow consists of a main code branch representing published assets. New features or bug fixes require authors to create a new branch from main, implement their changes, and then submit a pull request back into main. A pull request is a formal code review process that gives other authors a chance to provide feedback on the proposed changes. Once consensus has been reached, the feature branch is merged into the main branch, kicking off the CI/CD process.

Automation

While the code review process is designed to catch content and conceptual errors, subtle process and system based errors can often slip through the cracks. Thus the first step in the CI/CD process should be running a suite of automated tests that span a range of systems, behaviors, and any known pain points. The depth and breadth of these tests should be sufficient to ensure an adequate degree of publication readiness without being overly burdensome to maintain. The test suite can grow over time!

For the eBook, our tests simply ensure that the Python dependencies install correctly on Linux, Mac, and Windows containers, that the supporting code can be imported on these systems without error, and that the HTML and PDF versions of the publication generate successfully. If any tests fail, the publication process is cancelled and the authors are notified with details of the failure.

This test, release, and publication process is orchestrated by GitHub Actions. Read on to learn more!

GitHub Actions

GitHub Actions are available to any project hosted in GitHub—totally free for public repositories, and free to a limited extent for private repositories. An Action can be defined as a unit of work performed using a virtual machine in response to an event. In GitHub, Actions are defined using YAML files places into the .github/workflows directory of a repository. YAML (YAML Ain’t Markup Language) is a concise, human-readable syntax for conveying semi-structured data. The minimal content of a GitHub Action includes a name, one or more event triggers, and one or more jobs. A job consists of a name, one or more virtual machine targets, and one or more steps. For example, the eBook test Action looks like this:

.github/workflows/01_test.yml
name: Test
on:
  push:
    branches: [ main ]
jobs:
  test:
      runs-on: ${{ matrix.os }}
      strategy:
        matrix:
          os: [ubuntu-latest, macos-latest, windows-latest]
      steps:
        - uses: actions/checkout@v2
        - name: Set up Python
          uses: actions/setup-python@master
          with:
            python-version: 3.9
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Run tests
          run: |
            pip install pytest
            pytest

There is a lot going on here, so let’s take it step by step!

name: Test
on:
  push:
    branches: [ main ]

This snippet gives our Action a name, and specifies that it should trigger on updates to the main branch of the repository.

jobs:
  test:
      runs-on: ${{ matrix.os }}
      strategy:
        matrix:
          os: [ ubuntu-latest, macos-latest, windows-latest ]

This snippet defines a job within our “Test” Action named “test”, and then uses special syntax to declare that the job should be run on three different virtual machines: the latest Ubuntu Linux, macOS, and Windows containers. Running the tests on multiple operating systems helps catch bugs with system-specific dependencies.

      steps:
        - uses: actions/checkout@v2
        - name: Set up Python
          uses: actions/setup-python@master
          with:
            python-version: 3.9
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Run tests
          run: |
            pip install pytest
            pytest

This snippet outlines the actual units of work within the job; each “-” separates a unique task. The uses syntax is special in that it allows one to leverage tasks written by others hosted in the GitHub Actions Marketplace. The actions/checkout@v2 task clones the repository onto the virtual machine, and the actions/setup-python@master task installs and configures the specified Python version. The final two steps use the run directive to invoke custom code, in this case installing Python dependencies and running the Python test suites.

Deployment

Once the tests successfully pass, it’s time to publish! Since the eBook is essentially a web app, GitHub Pages is the perfect deployment platform. Pages hosts the content of a branch as a website, and is free for public repositories.

If you followed along with the previous eBook post, you learned about the Python, Sphinx, and Restructured Text workflow for compiling the eBook content into a polished product. Let’s create a GitHub Action to compile the eBook and deploy it to GitHub Pages! Here’s the full YAML file:

.github/workflows/02_deploy.yml
name: Deploy
on:
  push:
    branches: [ main ]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      - name: Install latex dependencies
        run: sudo apt-get update -y && sudo apt-get install -y texlive latexmk texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended ghostscript
      - name: Update pip and install python dependencies
        working-directory: 'docs/'
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Build html and pdf ebook
        working-directory: 'docs/'
        run: |
          make html latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
        continue-on-error: true
      - name: Get current datetime in ISO format
        id: date
        run: echo "::set-output name=date::$(date -u +'%Y-%m-%d')"
      - name: Create GitHub release
        id: gh_release
        uses: softprops/action-gh-release@v1
        with:
          files: docs/build/latex/addressinguncertaintyinmultisectordynamicsresearch.pdf
          tag_name: ${{ steps.date.outputs.date }}v${{ github.run_number }}
      - name: Commit the compiled files
        run: |
          cd docs/build/html
          git init
          git add -A
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git commit -m 'deploy' -a || true
      - name: Push changes to gh-pages
        uses: ad-m/github-push-action@master
        with:
          branch: gh-pages
          directory: docs/build/html
          force: true
          github_token: ${{ secrets.GITHUB_TOKEN }}

A lot to unpack here! Let’s take it step by step. As before, we start by naming the Action, triggering it on updates to the main branch, declaring that it should run only on an ubuntu-latest virtual machine, checking out out the code, and setting up Python. Then we get into the new job steps:

      - name: Install latex dependencies
        run: sudo apt-get update -y && sudo apt-get install -y texlive latexmk texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended ghostscript

This step installs all the operating system dependencies needed to support the Latex syntax and compilation to PDF. There was some trial and error involved in getting this right, but once correct it should be pretty stable.

      - name: Build html and pdf ebook
        working-directory: 'docs/'
        run: |
          make html latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
          make latexpdf --keep-going LATEXMKOPTS="-interaction=nonstopmode" || true
        continue-on-error: true

This step runs the Sphinx makefile to compile the HTML and PDF versions of the eBook. The verbosity and repetitiveness of these commands works around some unusual oddities of the Latex and PDF compilation. --keep-going LATEXMKOPTS="-interaction=nonstopmode" prevents the command from waiting for user input. || true and the repeated make latexpdf lines allow the PDF engine to fully resolve all the references in the restructured text files; otherwise the PDF file would be incomplete and garbled (this one stumped us for awhile!).

      - name: Get current datetime in ISO format
        id: date
        run: echo "::set-output name=date::$(date -u +'%Y-%m-%d')"
      - name: Create GitHub release
        id: gh_release
        uses: softprops/action-gh-release@v1
        with:
          files: docs/build/latex/addressinguncertaintyinmultisectordynamicsresearch.pdf
          tag_name: ${{ steps.date.outputs.date }}v${{ github.run_number }}

To make it easier to chronologically place our eBook releases, we wanted to include a date stamp in our version tags. The first step above assigns the date to a variable. The second step above creates and tags an official GitHub release (using the date and an auto-incrementing run number), and includes the PDF version of the eBook as an asset attached to the release.

      - name: Commit the compiled files
        run: |
          cd docs/build/html
          git init
          git add -A
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git commit -m 'deploy' -a || true
      - name: Push changes to gh-pages
        uses: ad-m/github-push-action@master
        with:
          branch: gh-pages
          directory: docs/build/html
          force: true
          github_token: ${{ secrets.GITHUB_TOKEN }}

These two steps cause the GitHub Actions user to commit the compiled HTML files and force push them to the gh-pages branch of our repository, using a secret token. This is a common “hack” to enable publishing only the desired web assets and not the entire repository. Never force push to other shared code branches!

Action Status

Check the status of the CI/CD pipeline using the Actions tab of the GitHub repository. Successful Actions show a check mark, in progress Actions show a spinner, and failed Actions show an X. Clicking into a particular Action will show more details, log messages, and error traces if relevant.

The GitHub Actions Tab

Slack Integration

To take our workflow to the next level (and to avoid the need to read even more email 😅 ), we added the GitHub app to our eBook Slack channel. This adds a bot that can subscribe to GitHub repositories and report on activity; for instance: new issues, new pull requests, and new releases. We can then discuss and iterate inline in Slack, without having to jump to other apps or sites.

GitHub Slack Integration

To add the GitHub bot to a channel, right click on the channel name, select “Open channel details”, and navigate to the “Integrations” tab. From here, you can choose “Add apps” and search for GitHub. Once added, type a bot command such as /github subscribe [repository name] to start receiving notifications in the channel. The bot can also be used to open and close issues!

Conclusion

Using the GitHub Actions workflow to automate the testing and publication of our eBook enabled our team to focus more on the content and quality of release rather than stumbling over the intricacies of the publication process. CI/CD has been a powerful tool in software communities, but can greatly benefit researchers and academics as well. We hope our learnings presented above will speed you along in your own workflows!