Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

  1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
  2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
  3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()

Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

  1. Write demoScript.m so that it reads the sample number from the input.
  2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
  3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory
sample_out = fileread(strcat("input_sample_", sample, "/sample_data.txt"));

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)

2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

Reads in the total tasks and number of samples per task from command line.
'''
# load necessary libraries
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes
N_SAMPLES_PER_TASK = int(sys.argv[2]) # number of runs per/task

# loop through samples assigned to current rank
for i in range(N_SAMPLES_PER_TASK):
	sample= rank + TOTAL_TASKS * i

	# write the command that will be sent to the terminal (here RUN will replace the {})
	terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

	# write the terminal command to the process
	os.system(terminal_command)

	# sleep before submitting the next command
	time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()

submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
SAMPLES_PER_TASK=5 # number of runs for each MPI task
N_NODES=2 # number of nodes
TASKS_PER_NODE=2 # number of tasks per node

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks

# Submit the parallel job
#!/bin/bash
#SBATCH -n $(TOTAL_TASKS) -N $(N_NODES)
#SBATCH --time=0:01:00
#SBATCH --job-name=demoMPI4py
#SBATCH --output=output/demo.out
#SBATCH --error=output/demo.err
#SBATCH --exclusive
module load py3-mpi4py
module load py3-numpy
module load octave/6.3.0

mpirun -np $TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS $SAMPLES_PER_TASK > demoOutput.txt

Additional resources

Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/

How to automate scripts on a cluster

There are several reasons why you might need to schedule or automate your scripts on a personal machine or a cluster:

  • You’re waiting for a job to finish before submitting another
  • You’d like to automate regular backups or cleanups of your data (e.g., move new data to another location or remove unnecessary output files)
  • You need to submit jobs to get around node limitations (e.g., you’d like to spread out the submissions over several days)
  • You need to retrieve regularly updated data (e.g., you have a model that uses daily precipitation data and you’d like to automatically collect them every day)

Cron is a utility program on Unix operating systems that allows you to schedule or repeat such tasks in the future. There’s a crontab file associated with every user in a cluster, where you’ll input all the information needed to schedule and automate your tasks. Note that not all clusters automatically allow their users to run cron jobs[1], for example, I can use it on the Reed Group’s Cube cluster, but not on XSEDE’s Comet.

To edit the crontab file associated with your user, type the following in your command line:

crontab -e

This will open a text editor (like Vim) which you can edit. To simply view your current crontab without editing, run:

crontab -l

Crontab syntax is made up of two parts: the timer indicating when to run and the command to run:

Source

The timer accepts five fields, indicating the time and day for the command to run:

  • Minute — minute of the hour, from 0 to 59
  • Hour — hour of the day, from 0 to 23
  • Day of the month — day of the month, from 1 to 31
  • Month — month of the year, from 1 to 12
  • Day of the week — day of the week, from 0 to 7

For example the following would execute script.sh on January 2nd at 9:00AM:

0 9 2 1 * /home/user/scripts/script.sh

Special characters are naturally very useful here, as they allow multiple execution times or ranges:

Asterisk (*) — to use all scheduling parameters in a field, for example, run the script, every day at midnight:

0 0 * * * /home/user/scripts/script.sh

Comma (,) — to use more than one scheduling parameter in a field, for example, run the script every day at midnight and 12PM:

0 0,12 * * * /home/user/scripts/script.sh

Slash (/) — to create predetermined time intervals, for example, run the script every four hours:

0 */4 * * * /home/user/scripts/script.sh

Hyphen (-) — to determine a range of values in a field, for example, run the script every minute during the first 10 minutes of every hour, every day

0-10 * * * * /home/user/scripts/script.sh

Hyphens and slashes can be combined, for example, to run a script every 5 minutes during the first 30 minutes of every hour, every day:

0-30/5 * * * * /home/user/scripts/script.sh

Last (L) — this character can only be used in the day-of-the-month and day-of-the-week fields to specify the last occurrence of something, for example the last day of the month (which could differ):

0 9 L * * /home/user/scripts/script.sh

or, to specify constructs such as “the last Friday” of a every month:

0 9 * * 5L /home/user/scripts/script.sh

Weekday ( W) — this character is only allowed on the day-of-month field and is used to determine the closest weekday to that day of the month. For instance, using “15W” indicates to cron to run the script on the nearest weekday to the 15th day of the month. If the 15th is a Saturday, the script will be executed on Friday the 14th. If the 15th is a Sunday, the script will be executed on Monday the 16th. If the 15th is a weekday, the script will be executed on the same day:

0 0 15W * * /home/user/scripts/script.sh

Hash (#) — this character is only allowed in the day-of-week field and is used to specify constructs such as the second Friday of every month:

0 0 * * 5#2 /home/user/scripts/script.sh

Lastly, if you’d like to be notified whenever a script is executed you can use the MAILTO parameter, with your email address.

The important thing to remember when running cron on a cluster (as opposed to your own machine) is that it will launch a shell that with a new clean environment (i.e., without the environment variables that are automatically applied when you log on an interactive shell) and it will likely not be able to recognize some commands or where your modules are. This can be easily addressed by sourcing your bash_rc or bash_profile from your home directory before running anything. You also need to remember that it will launch at your home directory and you need to specify the absolute path of the scripts to be executed, or change directory before executing them.

For example my crontab file on the Reed Group cluster looks like this:

#!/bin/bash
MAILTO=myemail@cornell.edu
00 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git pull; sbatch ./script.sh
30 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git add . ; git commit -m 'fetched data'; git push

This does the following:
Every day at 10am it sources my bashrc profile so it knows all my environment variables. It changes to the directory of my project and pulls from git any new updates to that project. It then submits a script using sbatch. I get an email at the same time, with the text that would that would have appeared in my command line had I executed these commands in an interactive node (i.e., the git information and a line saying Submitted batch job xxxxx).
Then, every day at 10:30 am, I commit and push the new data back to git.


[1] If you’re just a regular user on a cluster you might need to request to be granted access. If you have root privileges (say, on a personal machine), you need to edit your cron allow and deny files:

/etc/cron.allow
/etc/cron.deny

More Terminal Schooling

You are probably asking yourself “and why do I need more terminal schooling?”. The short answer is: to not have to spend as much time as you do on the terminal, most of which spent (1) pushing arrow keys thousands of times per afternoon to move through a command or history of commands, (2) waiting for a command that takes forever to be done running before you can run anything else, (3) clicking all over the place on MobaXTerm and still feeling lost, (4) manually running the same command multiple times with different inputs, (5) typing the two-step verification token every time you want to change a “+” to a “-” on a file on a supercomputer, (6) waiting forever for a time-consuming run done in serial on a single core, and (7, 8, …) other useless and horribly frustrating chores. Below are some tricks to make your Linux work more efficient and reduce the time you spend on the terminal. From now on, I will use a “$” sign to indicate that what follows is a command typed in the terminal.

The tab autocomple is your best friend

When trying to do something with that file whose name is 5480458 characters long, be smart and don’t type the whole thing. Just type the first few letters and hit tab. If it doesn’t complete all the way it’s because there are multiple files whose names begin with the sequence of characters. In this case, hitting tab twice will return the names of all such files. The tab autocomplete works for commands as well.

Ctrl+r for search through previous commands

When on the terminal, hit ctrl+r to switch to reverse search mode. This works like a simple search function o a text document, but instead looking in your bash history file for commands you used over the last weeks or months. For example, if you hit ctrl+r and type sbatch it will fill the line with the last command you ran that contained the word sbatch. If you hit ctrl+r again, it will find the second last used command, and so on.

Vim basics to edit files on a system that requires two-step authentication

Vim is one the most useful things I have came across when it comes to working on supercomputers with two-step identity verification, in which case using MobaXTerm of VS Code requires typing a difference security code all the time. Instead of uploading a new version of a code file every time you want to make a simple change, just edit the file on the computer itself using Vim. To make simple edits on your files, there are very few commands you need to know.

To open a file with Vim from the terminal: $ vim <file name> or $ vim +10 <file name>, if you want to open the file and go straight to line 10.

Vim has two modes of operation: text-edit (for you to type whatever you want in the file) and command (replacement to clicking on file, edit, view, etc. on the top bar of notepad). When you open Vim, it will be in command mode.

To switch to text-edit mode, just hit either “a” or “i” (you should then see “– INSERT –” at the bottom of the screen). To return to command mode, hit escape (Esc). When in text-edit more, the keys “Home,” “End,” “Pg Up,” “Pg Dn,” “Backspace,” and “Delete” work just like on Notepad and MS Word.

When in command mode, save your file by typing :w + Enter, save and quite with :wq, and quit without saving with :q!. Commands for selecting, copying and pasting, finding and replacing, replacing just one character, deleting a line, and other more advanced tasks can be found here. There’s also a great cheatsheet for Vim here. Hint: once you learn some more five to ten commands, making complex edits on your file with Vim becomes blazingly fast.

Perform repetitive tasks on the terminal using one-line Bash for-loops.

Instead of manually typing a command for each operation you want to perform on a subset of files in a directory (“e.g., cp file<i>.csv directory300-400 for i from 300 to 399 , tar -xzvf myfile<i>.tar.gz, etc.), you can use a Bash for-loop if using the is not possible.

Consider a situation in which you have 10,000 files and want to move files number 200 to 299 to a certain directory. Using the wildcard “*” in this case wouldn’t be possible, as result_2<i>.csv would return result_2.csv, result_20.csv to result_29.csv, and result_2000.csv to result_2999.csv as well–sometimes you may be able to use Regex, but that’s another story. To move a subset of result files to a directory using a Bash for-loop, you can use the following syntax:

$ for i in {0..99}; do cp result_2$i results_200s/; done

Keep in mind that you can have multiple commands inside a for-loop by separating them with “;” and also nest for-loops.

Run a time-intensive command on the background with an “&” and keep doing your terminal work

Some commands may take a long time to run and render the terminal unusable until it’s complete. Instead of opening another instance of the terminal and login in again, you can send a command to the background by adding “&” at the end of it. For example, if you want to extract a tar file with dozens of thousands of files in it and keep doing your work as the files are extracted, just run:

$ tar -xzf my_large_file.tar.gz &

If you have a directory with several tar files and want to extract a few of them in parallel while doing your work, you can use the for-loop described above and add “&” to the end of the tar command inside the loop. BE CAREFUL, if your for-loop iterates over dozens or more files, you may end up with your terminal trying to run dozens or more tasks at once. I accidentally crashed the Cube once doing this.

Check what is currently running on the terminal using ps

To make sure you are not overloading the terminal by throwing too many processes at it, you can check what it is currently running by running the command ps. For example, if I run an program with MPI creating two processes and run ps before my program is done, it will return the following:

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
$ mpirun -n 2 ./triangleSimulation -I Tests/test_input_file_borg.wp &
[1] 6129     <-- this is the process ID
bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ps
 PID TTY TIME CMD
 8 tty1 00:00:00 bash
 6129 tty1 00:00:00 mpirun    <-- notice the process ID 6129 again
 6134 tty1 00:00:00 triangleSimulat
 6135 tty1 00:00:00 triangleSimulat
 6136 tty1 00:00:00 ps

Check the output of a command running on the background

If you run a program on the background its output will not be printed on the screen. To know what’s happening with your program, send (to pipe) its output to a text file using the “>” symbol, which will be updated continuously as your program is running, and check it with cat <file name>, less +F<file name>, tail -n<file name>, or something similar. For example, if test_for_background.sh is a script that will print a number on the screen every one second, you could do the following (note the “> pipe.csv” in the first command):

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ./test_for_background.sh > pipe.csv &
 [1] 6191

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2
 3

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ tail -3 pipe.csv
 8
 9
 10

This is also extremely useful in situations when you want to run a command that takes long to run but whose outputs are normally displayed one time on the screen. For example, if you want to check the contents of a directory with thousands of files to search for a few specific files, you can pipe the output of ls to a file and send it to the background with ls > directory_contents.txt & and search the resulting text file for the file of interest.

System monitor: check core and memory usage with htop, or top if htop is not available

If ps does not provide enough information given your needs, such as if you’re trying to check if your multi-thread application is using the number of cores it should, you can try running htop instead. This will show on your screen something along the lines of the Performance view  of Windows’ Task Manager, but without the time plot. It will also show how much memory is being used, so that you do not accidentally shut down a node on an HPC system. If htop is not available, you can try top.

Running make in parallel with make -j for much shorter compiling time

If a C++ code is properly modularized, make can compile certain source code files in parallel. To do that, run make -j<number of cores> <rule in makefile>. For example, the following command would compile WaterPaths in parallel over four cores:

$ make -j4 gcc

For WaterPaths, make gcc takes 54s on the Cube, make -j4 gcc takes 15s, make -j8 gcc takes 9s, so the time and patience savings are real if you have to compile the code various times per day. To make your life simpler, you can add an alias to bash_aliases such as alias make='make -j4' (see below in section about .bash_aliases file). DO NOT USE MAKE -J ON NSF HPC SYSTEMS: it is against the rules. On the cube keep it to four cores or less not to disturb other users, but use all cores available if on the cloud or iterative section.

Check the size of files and directories using du -hs

The title above is quite self-explanatory. Running du -hs <file name> will tell you its size.

Check the data and time a file was created or last modified using the stat command

Also rather self-explanatory. Running stat <file name> is really useful if you cannot remember on which file you saved the output last time you ran your program.

Split large files into smaller chunks with the split command and put them back together with cat

This works for splitting a large text file into files with fewer lines, as well as for splitting large binary files (such as large tar files) so that you can, for example, upload them to GitHub or e-mail them to someone. To split a text file with 10,000 into ten files with 1,000 lines each, use:

 $ split -l 1000 myfile myfile_part

This will result in ten files called myfile_part00, myfile_part01, and so on with 1,000 lines each. Similarly, the command below would break a binary file into parts with 50 MB each:

 $ split -b 50m myfile myfile_part

To put all files back together in either case, run:

$ cat myfile_part* myfile

More information about the split command can be found in Joe’s post about it.

Checking your HPC submission history with `sacct`

Another quite sulf-explanatory tile. If you want to remember when you submitted something, such as to check if an output file resulted from this or that submission (see stat command), just run the command below in one line:

$ sacct -S 2019-09-18 -u bct52 --format=User,JobID,Jobname,start,end,elapsed,nnodes,nodelist,state

This will result in an output similar to the one below:

bct52 979 my_job 2019-09-10T21:48:30 2019-09-10T21:55:08 00:06:38 1 c0001 COMPLETED
bct52 980 skx_test_1 2019-09-11T01:44:08 2019-09-11T01:44:09 00:00:01 1 c0001 FAILED
bct52 981 skx_test_1 2019-09-11T01:44:33 2019-09-11T01:56:45 00:12:12 1 c0001 CANCELLED
bct52 1080 skx_test_4 2019-09-11T22:07:03 2019-09-11T22:08:39 00:01:36 4 c[0001-0004] COMPLETED
1080.0 orted 2019-09-11T22:08:38 2019-09-11T22:08:38 00:00:00 3 c[0002-0004] COMPLETED

Compare files with meld, fldiff, or diff

There are several programs to show the differences between text files. This is particularly useful when you want to see what the changes between different versions of the same file, normally a source code file. If you are on a computer running a Linux OS or have an X server like Xming installed, you can use meld and kdiff3 for pretty outputs on a nice GUI or fldiff to quickly handle a files with huge number of difference. Otherwise, diff will show you the differences in a cruder pure-terminal but still very much functional manner. The syntax for all of them is:

$ <command> <file1> <file2>

Except for diff, for which it is worth calling with the --color option:

$ diff --color <file1> <file2>

If cannot run a graphical user interface but is feeling fancy today, you can install the ydiff Python extension with (done just once):

$ python3 -m pip install --user ydiff 

and pipe diff’s output to it with the following:

$diff -u <file1> <file2> | python3 -m ydiff -s

This will show you the differences between two versions of a code file in a crystal clear, side by side, and colorized way.

Creating a .bashrc file for a terminal that’s easy to work with and good (or better) to look at

When we first login to several Linux systems the terminal is all black with white characters, in which it’s difficult find the commands you typed amidst all the output printed on the screen, and with limited autocomplete and history search. In short, it’s a real pain and you makes you long for Windows as much as for you long for your mother’s weekend dinner. There is, however, a way of making the terminal less of a pain to work with, which is by creating a file called .bashrc with the right contents in your home directory. Below is an example of a .bashrc file with the following features for you to just copy and paste in your home directory (e.g., /home/username/, or ~/ for short):

  • Colorize your username and show the directory you’re currently in, so that it’s easy to see when the output of a command ends and the next one begins–as in section “Checking the output of a command running on the background.”
  • Allow for a search function with the up and down arrow keys. This way, if you’re looking for all the times you typed a command starting with sbatch, you can just type “sba” and hit up arrow until you find the call you’re looking for.
  • A function that allows you to call extract and the compressed file will be extracted. No more need to tar with a bunch of options, unzip, unrar, etc. so long as you have all of them installed.
  • Colored man pages. This means that when you look for the documentation of a program using man, such as man cat to see all available options for the cat command, the output will be colorized.
  • A function called pretty_csv to let you see csv files in a convenient, organized and clean way from the terminal, without having to download it to your computer.
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# Load aliases
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

# Automatically added by module
shopt -s expand_aliases

if [ ! -z "$PS1" ]; then
PS1='\[\033[G\]\[\e]0;\w\a\]\n\[\e[1;32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '
bind '"\e[A":history-search-backward'
bind '"\e[B":history-search-forward'
fi

set show-all-if-ambiguous on
set completion-ignore-case on
export PATH=/usr/local/gcc-7.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/gcc-7.1/lib64:$LD_LIBRARY_PATH

history -a
export DISPLAY=localhost:0.0

sshd_status=$(service ssh status)
if [[ $sshd_status = *"is not running"* ]]; then
sudo service ssh --full-restart
fi

HISTSIZE=-1
HISTFILESIZE=-1

extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2)   tar xvjf $1    ;;
*.tar.gz)    tar xvzf $1    ;;
*.bz2)       bunzip2 $1     ;;
*.rar)       unrar x $1       ;;
*.gz)        gunzip $1      ;;
*.tar)       tar xvf $1     ;;
*.tbz2)      tar xvjf $1    ;;
*.tgz)       tar xvzf $1    ;;
*.zip)       unzip $1       ;;
*.Z)         uncompress $1  ;;
*.7z)        7z x $1        ;;
*)           echo "don't know how to extract '$1'..." ;;
esac
else
echo "'$1' is not a valid file!"
fi
}

# Colored man pages
export LESS_TERMCAP_mb=$'\E[01;31m'
export LESS_TERMCAP_md=$'\E[01;31m'
export LESS_TERMCAP_me=$'\E[0m'
export LESS_TERMCAP_se=$'\E[0m'
export LESS_TERMCAP_so=$'\E[01;44;33m'
export LESS_TERMCAP_ue=$'\E[0m'
export LESS_TERMCAP_us=$'\E[01;32m'

# Combine multiline commands into one in history
shopt -s cmdhist

# Ignore duplicates, ls without options and builtin commands
HISTCONTROL=ignoredups
export HISTIGNORE="&:ls:[bf]g:exit"

pretty_csv () {
cat "$1" | column -t -s, | less -S
}

There are several .bashrc example files online with all sorts of functionalities. Believe me, a nice .bashrc will make your life A LOT BETTER. Just copy and paste the above into a text file called .bashrc and sent it to your home directory in your local or HPC system terminal.

Make the terminal far less user-friendly and less archane by setting up a .bash_aliases file

You should also have a .bash_aliases file to significantly reduce typing and colorizing the output of commands you often use for ease of navigation. Just copy all the below into a file called .bash_aliases and copy into your home directory (e.g., /home/username/, or ~/ for short). This way, every time you run the command between the word “alias” and the “=” sign, the command after the “=”sign will be run.

alias ls='ls --color=tty'
alias ll='ls -l --color=auto'
alias lh='ls -al --color=auto'
alias lt='ls -alt --color=auto'
alias uu='sudo apt-get update && sudo apt-get upgrade -y'
alias q='squeue -u '
alias qkill='scancel $(qselect -u bct52)'
alias csvd="awk -F, 'END {printf \"Number of Rows: %s\\nNumber of Columns: %s\\n\", NR, NF}'"
alias grep='grep --color=auto'                          #colorize grep output
alias gcc='gcc -fdiagnostics-color=always'                           #colorize gcc output
alias g++='g++ -fdiagnostics-color=always'                          #colorize g++ output
alias paper='cd /my/directory/with/my/beloved/paper/'
alias res='cd /my/directory/with/my/ok/results/'
alias diss='cd /my/directory/of/my/@#$%&/dissertation/'
alias aspell='aspell --lang=en --mode=tex check'
alias aspellall='find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;'
alias make='make -j4'

Check for spelling mistakes in your Latex files using aspell

Command-line spell checker, you know what this is.

aspell --lang=en --mode=tex check'

To run aspell check on all the Latexfiles in a directory and its subdirectories, run:

find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;

Easily share a directory on certain HPC systems with others working on the same project [Hint from Stampede 2]

Here’s a great way to set permissions recursively to share a directory named projdir with your research group:

$ lfs find projdir | xargs chmod g+rX

Using lfs is faster and less stressful on Lustre than a recursive chmod. The capital “X” assigns group execute permissions only to files and directories for which the owner has execute permissions.

Run find and replace in all files in a directory [Hint from Stampede 2]

Suppose you wish to remove all trailing blanks in your *.c and *.h files. You can use the find command with the sed command with in place editing and regular expressions to this. Starting in the current directory you can do:

$ find . -name *.[ch] -exec sed -i -e ‘s/ +$//’ {} \;

The find command locates all the *.c and *.h files in the current directory and below. The -exec option run the sed command replacing {} with the name of each file. The -i option tells sed to make the changes in place. The s/ +$// tells sed to replace one or blanks at the end of the line with nothing. The \; is required to let find know where the end of the text for the -exec option. Being an effective user of sed and find can make a great different in your productivity, so be sure to check Tina’s post about them.

Other post in this blog

Be sure to look at other posts in this blog, such as Jon Herman’s post about ssh, Bernardo’s post about other useful Linux commands organized by task to be performed, and Joe’s posts about grep (search inside multiple files) and cut.

Remote terminal environment using VS Code for Windows and Mac

On Windows machines, the application MobaXterm is a valuable tool for computing on virtual machines and working through SSH clients. David Gold’s blog post walks through the installation and use of this app, which works well in Windows environments.

Working remotely on my Mac laptop, I have been struggling to achieve the same workflow as in the office, with a Windows machine. Unfortunately, MobaXterm is not available for download on Mac OS. Looking for alternatives, I discovered that using VS Code with the “Remote – SSH” extension is a great replacement with significant advantages to MobaXterm, as it an SSH client interface and code editor in one.

A screenshot from my VS Code remote interface, with the graphical file browser on the left panel, the SSH server terminal on the bottom-right, and the VS Code editor on the top-right.

Here’s how you can set up a remote session on Mac (and Windows) using VS Code: 

  1. Install the VS Code application here. For installation help and a brief overview of the app, check out this video.
  2. With VS Code opened, go to View -> Extensions, and search “Remote – SSH.” Click on the extension and press the green “Install” button. You should see the message “This extension is enabled globally” appear. Check out this extension’s description below (I’ll run through the basics in this post).
  3. On the bottom left of your screen, there should be a small green box with two opposite pointing arrow heads. Click this.
The green box is the Remote – SSH extension.
  1. Choose the first pop-up option “Remote-SSH: Connect to host…” and then select “Add New SSH Host…”.
Click the first box and then the “Add New SSH Host” button to connect to your SSH client.
  1. Here, enter your remote SSH username@serverid (here at Cornell, this would be yournetid@thecube.cac.cornell.edu to connect to our remote computing cluster, the Cube).
  2. In the same pop-up window, click the remote server that you just added. A new window will open and prompt you to enter your password for the server.
  3. Now, you in are in your remote SSH environment. Click “Open folder…” and select “OK” to see your remote directory on the left. You can navigate through these files in your remote machine the same way as MobaXterm. Click View -> Terminal to see your SSH command line on the bottom of the screen (here’s where you can actually run the programs on your cluster).

Now using VS Code, you can install other extensions to aid in code editing in different languages (here’s an article with a few good ones for various uses). This environment has the same functionality as MobaXterm, without having to switch applications for editing code. Run your cluster programs in the terminal window and edit the code in the main VS Code editor!

Job scheduling on HPC resources

Architecture of a HPC Cluster

Modern High Performance Computing (HPC) resources are usually composed of a cluster of computing nodes that provide the user the ability to parallelize tasks and greatly reduce the time it takes to perform complex operations. A node is usually defined as a discrete unit of a computer system that runs its own instance of an operating system. Modern nodes have multiple chips, often known as Central Processing Units or CPUs, which each contain multiple cores each capable of processing a separate stream of instructions (such as a single Monte Carlo run). An example cluster configuration is shown in Figure 1.

garbage

Figure 1. An example cluster configuration

To efficiently make use of a cluster’s computational resources, it is essential to allow multiple users to use the resource at one time and to have an efficient and equatable way of allocating and scheduling computing resources on a cluster. This role is done by job scheduling software. The scheduling software is accessed via a shell script called in the command line. A scheduling  script does not actually run any code, rather it provides a set of instructions for the cluster specifying what code to run and how the cluster should run it. Instructions called from a scheduling script may include but are not limited to:

  • What code would you like the cluster to run
  • How would you like to parallelize your code (ie MPI, openMP ect)
  • How many nodes would you like to run on
  • How many core per processor would you like to run (normally you would use the maximum allowable per processor)
  • Where would you like error and output files to be saved
  • Set up email notifications about the status of your job

This post will highlight two commonly used Job Scheduling Languages, PBS and SLURM and detail some simple example scripts for using them.

PBS

The Portable Batching System (PBS) was originally developed by NASA in the early 1990’s [1] to facilitate access to computing resources.  The intellectual property associated with the software is now owned by Altair Engineering. PBS is a fully open source system and the source code can be found here. PBS is the job scheduler we use for the Cube Cluster here at Cornell.

An annotated PBS submission script called “PBSexample.sh” that runs a C++ code called “triangleSimulation.cpp” on 128 cores can be found below:

#PBS -l nodes=8:ppn=16    # how many nodes, how many cores per node (ppn)
#PBS -l walltime=5:00:00  # what is the maximum walltime for this job
#PBS -N SimpleScript      # Give the job this name.
#PBS -M email.cornell.edu # email address for notifications
#PBS -j oe                # combine error and output file
#PBS -o outputfolder/output.out # name output file

cd $PBS_O_WORKDIR # change working directory to current folder

#module load openmpi/intel # load MPI (Intel implementation)
time mpirun ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3

To submit this PBS script via the command line one would type:

qsub PBSexample.sh

Other helpful PBS commands for UNIX can be found here. For more on PBS flags and options, see this detailed post from 2012 and for more example PBS submission scripts see Jon Herman’s Github repository here.

SLURM

A second common job scheduler is know as SLURM. SLURM stands for “Simple Linux Utility Resource Management” and is the scheduler used on many XSEDE resources such as Stampede2 and Comet.

An example SLURM submission script named “SLURMexample.sh” that runs “triangleSimulation.cpp” on 128 core can be found below:


#!/bin/bash
#SBATCH --nodes=8             # specify number of nodes
#SBATCH --ntasks-per-node=16  # specify number of core per node
#SBATCH --export=ALL
#SBATCH -t 5:00:00            # set max wallclock time
#SBATCH --job-name="triangle" # name your job #SBATCH --output="outputfolder/output.out"

#ibrun is the command for MPI
ibrun -v ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3 -p 2841

To submit this SLURM script from the command line one would type:

sbatch SLURM

The Cornell Center  for Advanced Computing has an excellent SLURM training module within the introduction to Stampede2 workshop that goes into detail on how to most effectively make use of SLURM. More examples of SLURM submission scripts can be found on Jon Herman’s Github. Billy also wrote a blog post last year about debugging with SLURM.

References

  1. https://en.wikipedia.org/wiki/Portable_Batch_System

Simple Bash shell scripts that have made my life easier

I’ve recently been using Bash shell scripts to improve the efficiency of my workflow when working on Linux systems and I thought I would share some of them here. I’m fairly new to Linux so this post is not meant to be a comprehensive guide on how to write shell scripts rather, I hope the scripts in this post can serve as examples for those who may also be learning Linux and unsure of where or how to start writing shell scripts. I didn’t write any of these from scratch, most of the scripts are based off files shared with me by group members Julie Quinn, Bernardo Trindade and Jazmin Zatarian Salazar. If you’re interested in learning more about any of the commands used in these scripts I’ve put some references I found useful at the end of this post. If you’re more experienced in writing shell scripts, please feel free to put tips or suggestions in the comments.

1. A simple script for making directories

For my research I’m processing results of a monte carlo simulation for several solutions found through multi-objective search and I needed to make folders in several locations to store the output from each solution. My first instinct was to make each directory separately using the mkdir command in the command line, but this quickly got tedious. Instead I used a bash script to loop through all the solution numbers and create a new directory for each. For more on using loops in Bash, check out this reference.

#!/bin/bash#!/bin/bash

# This script will create directories named "Solution_*.txt" for
# a set of numbered solutions 

# specify solution numbers
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# create a variable to store the string "Solution_"
DIRECTORY="Solution_" 

# loop over solution numbers
for i in ${SOLUTIONS[@]}
do
# create a separate directory for each solution
mkdir $DIRECTORY${i}
done

2. Calling a Java function and saving the output

The MOEA framework is a tool written in Java with all sorts of cool functions. I used it to generate 1024 latin hypercube samples across a given range for each of the 8 solutions mentioned above. Using a shell script allows for you to easily set up the arguments needed for the MOEA framework, call the Java function and save the output to your desired file format. The MOEA framework’s tool spits out a .txt file, but this script uses the “sed” command to save it as a .csv file. More on “sed” can be found in the reference at the end of this post.

#!/bin/bash#!/bin/bash
# this shell script will call the MOEA framework's Latin Hypercube
# Sampling tool to create 1024 samples from a set of
# prespecified ranges for each of 8 solutions

# create variables to store Java arguments
JAVA_ARGS="-Xmx1g -classpath MOEAFramework-1.16-Executable.jar"
NUM_SAMPLES=1024
METHOD=latin

# these are the solutions we will create samples from
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# loop through solutions
for i in ${SOLUTIONS[@]}
do
    # define names for input (ranges) and output file names
    RANGES_FILENAME=${i}ranges.txt
    OUTPUT_FILENAME=Solution${i}_Samples.txt
    CSV_FILENAME=Solution${i}_Samples.csv

    # Call MOEA framework from JAVA using specified arguments to
    # create LHS Samples, specify OUTPUT_FILENAME as output
    java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SampleGenerator -m ${METHOD} -n ${NUM_SAMPLES} -p ${RANGES_FILENAME} -o ${OUTPUT_FILENAME}

    # Use the sed command tocreate new comma separated values file
    # from original output .txt file
    sed 's/ /,/g' ${OUTPUT_FILENAME} &amp;amp;amp;amp;gt; ${CSV_FILENAME} 

    # remove .txt files
    rm $OUTPUT_FILENAME
done

3. A piping example

Piping allows you to link together programs by making the output from one program or function the input to another. The script below was originally written by my friend Shrutarshi Basu for a class project we were working on together. This script is made to process the output from the Borg MOEA for 9 random seeds of the DTLZ2 benchmarking problem across several different algorithmic configurations, seen in the code as “masters” (for more on this see Jazmin’s post here). In addition to calling Java tools from the MOEAframework, Basu uses piping to link the Linux commands “tac”, “sed”, “grep” and “cut”.  For more on each of these commands, see the links at the bottom of this post.


# loop over each of 9 seeds
for i in {0..9}
do
obj=DTLZ2_S${i}.obj
output=dtlz2.volume

# loop over masters
for m in $(seq 0 $1)
do
runtime=DTLZ2_S${i}_M${m}.runtime
mobj=DTLZ2_S${i}_M${m}.obj

# extract objectives from output
echo "Extracting objectives"
tac ${runtime} | sed -n '1,/\/\// p' | grep -v "//" | cut -d' ' -f15-19 | tac > ${mobj};
done

# combine objectives into one file
echo "Combining objectives"
java -cp ../../moea.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger \
-d 5 -e 0.01,0.01,0.01,0.01,0.01 \
-o ${obj} DTLZ2_S${i}_M*.obj

# calculate the hypervolume
echo "Finding final hypervolume"
hvol=$(java -cp ../../moea.jar HypervolumeEval ${obj})

printf "%s %s\n" "$i" "$hvol" >> ${output}
echo "Done with seed $i"
done

Additional References and Links

 

Enhance your (Windows) remote terminal experience with MobaXterm

Jazmin and Julie recently introduced me to a helpful program for Windows called “MobaXterm” that has significantly sped up my workflow when running remotely on the Cube (our cluster here at Cornell). MobaXterm bills itself as an “all in one” toolbox for remote computing. The program’s interface includes a terminal window as well as a graphical SFTP browser. You can link the terminal to the SFTP browser so that as you move through folders on the terminal the browser follows you. The SFTP browser allows you to view and edit files using your text editor of choice on your windows desktop, a feature that I find quite helpful for making quick edits to shell scripts or pieces of code as go.

mobaxtermsnip

A screenshot of the MobaXterm interface. The graphical SFTP browser is on the left, while the terminal is on the right (note the checked box in the center of the left panel that links the browser to the terminal window).

 

You can set up a remote Cube session using MobaXterm with the following steps:

  1. Download MobaXterm using this link
  2.  Follow the installation instructions
  3. Open MobaXterm and select the “Session” icon in the upper left corner.
  4. In the session popup window, select a new SSH session in the upper left, enter “thecube.cac@cornell.edu” as the name of the remote host and enter your username.
  5. When the session opens, check the box below the SFTP browser on the left to link the browser to your terminal
  6. Run your stuff!

Note that for a Linux system, you can simply link your file browser window to your terminal window and get the same functionality as MobaXterm. MobaXterm is not available for Mac, but Cyberduck and Filezilla are decent alternatives. An alternative graphical SFTP browser for Windows is WinSCP, though I prefer MobaXterm because of its linked terminal/SFTP interface.

For those new to remote computing, ssh or UNIX commands in general, I’d recommend checking out the following posts to get familiar with running on a remote cluster:

 

 

 

Porting GCAM onto a UNIX cluster

Hi All,

One of my recent research tasks has been to port GCAM (Global Change Assessment Model) on to the Cube, our groups HPC cluster, and some of the TACC resources.  This turned out to be more challenging than I had anticipated, so I wanted to share my experiences in the hopes that it will save you all some time in the future.  This post might be a bit pedestrian for some readers, but I hope it’s helpful to folks that are new to C++ and Linux.

Before getting started, some background on GCAM.  GCAM was developed by researches at the Joint Global Change Research Institute (JGCRI) at the Pacific Northwest National Lab (PNNL).  GCAM is an integrated assesment model (IAM), which means it pairs a climate model with a global economic model.  GCAM is unique from many IAMs in that it has a fairly sophisticated representation of various sectors of the global economy modeled on a regional level…meaning the model is a beast compared to other IAMs (like DICE). I’ll be writing a later post on IAMs more generally.  GCAM is coded in C++ and makes extensive use of xml databases for both input and output.  The code is open source (available here), has a Wiki (here), and a community listserve where researches can pose questions to their peers.

There are three flavors of the model available: one for Windows users, one for Mac users, and one for Unix users.  The Windows version comes compiled, with a nice user-interface for post-processing the results.  The Unix version comes as uncompiled code, and requires the installation of some third party C++ libraries.  For those who’d like to sniff around GCAM the Windows or Mac versions are a good starting point, but if you’re going to be doing HPC heavy lifting, you’ll need to work with the Unix code.

The GCAM Wiki and the detailed user manual provide excellent documentation about running GCAM the Model Interface tools, but are a bit limited when describing how to compile the Unix version of the code.  One helpful page on the Wiki can be found here.  GCAM comes with a version of the Boost C++ library in the standard Unix download (located in Main_User_Workspace/libs).  GCAM also uses the Berkeley DBXML library, which can be downloaded here.  You’ll want to download version 2.5.16 to be sure it runs well with GCAM.

Once you’ve downloaded DBXML, you’ll need to build the library.  As it turns out, this is fairly easy.  Once you’ve ported the DBXML library onto your Unix cluster, simply navigate into the main directory and run the script buildall.sh.  The buildall.sh script accepts flags that allow you to customize your build.  We’re going to use the -b flag to build the 64-bit version of DBXML:

sh buildall.sh -b 64

once you’ve build the DBXML library, you are nearly ready to build GCAM, but must first export the paths to the Boost library and to the DBXML include and lib directories.  It seems that the full paths must be declared.  In other words, they should read something like:

export BOOST_INCLUDE=/home/fs02/pmr82_0001/jrl276/GCAM_Haewon_trans/libs/boost_1_43_0/
export DBXML_INCLUDE=/home/fs02/pmr82_0001/jrl276/dbxml-2.5.16/install/include/
export DBXML_LIB=/home/fs02/pmr82_0001/jrl276/dbxml-2.5.16/install/lib/

One tricky point here is that the full path must be typed out instead of using ‘~’ as a short-cut (I’m not sure why). Once this is done, you are ready to compile gcam.  Navigate into the ‘Main_User_Workspace’ directory and type the command:


make gcam

Compiling GCAM takes several minutes, but at the end there should be a gcam.exe executable file located in the ‘/exe/’  folder of the ‘Main_User_Workspace’.  It should be noted that there is a multi-thread version of gcam (more info here), which I’ll be compiling in the next few days.  If there are any tricks to that process, I’ll post again with some tips.

That’s it for now.

Useful Linux commands to handle text files and speed up work

Most of us, nice and normal human beings, tend to prefer programs with GUIs over typing commands on a command prompt because the former looks more “real” and is more visual than the latter. However, one thing we don’t know (or, sometimes, don’t want to know) is that learning a few terminal commands can dramatically increase productivity. These commands can save us a lot of time by sparing us from opening and closing programs, navigating through menus and windows, moving the mouse around, as well as moving the hand back and forth from the mouse to the keyboard.

This post will mention and briefly describe some useful “intermediate level” Linux commands (some basic commands are presented in this post by Jon Herman), which can be called from a Linux OS, Cygwin (mostly), or Mac. Among the numerous tedious tasks these commands can greatly simplify is the particularly interesting chore of handling text files, be they scripts or data files. Commands for other tasks are covered as well. Keep in mind that the symbol * is a wild card (character that can mean any string of characters when searching for files), which is really useful when the goal is to repeatedly apply one command to multiple files. For all commands listed here skip the “$” character.

DELIMITER SEPARATED FILES HANDLING

  • Remove columns 30 to 36 (starting from 0) from a comma separated file and export the output to another file.
    $ cut -d',' -f1-30,36 input.file >> output.file

    (More information on the post by Joe Kasprzyk)

  • Print only columns 2 and 4 (starting from 1) of a comma separated file.
    $ awk -F "," '{print $2,$4}' input.file >> output.file
  • Count number of columns in a file separated either by spaces or commas:
    $ head -1 input.file | sed 's/[^, ]//g' | wc -c
    or:
    $ awk -F "[, ]" 'END{print NF}' input.file
  • Print lines of a comma separated file in which the value in the 2nd column is lower than 100 and the value in the 5th column is higher than 0.3:
    $ awk -F "," '$2<100 && $5>0.3' input.file >> output.file
  • Print lines between 10th and 20th lines (not inclusive) of a file:
    $ awk 'NR>10 && NR<20' input.file >> output.file
  • Add a string to the end of multiple files:
    $ echo "your string" | tee -a *.set
  • Add a string to the end of one file:
    $ echo "your string" >> file

FILE SEARCHING

  • Find all text files in a folder that contain a certain string:
    $ grep -rn './folder' -e your_string
  • Find files recursively (* is a wildcard):
    $ find -type f -name name_of*.file

FILES INFO

  • See the contents of a zip/tar file without extracting it. Press q to quit.
    $ less file.tar
  • Count number of lines in a file:
    $ wc -l your.file
  • List all files with a certain extension in a directory:
    $ ls *.ext
  • Print files and folders in tree fashion:
    $ tree
  • Print the size of all subfolders and files in (sub)folders to a certain max depth in the folder hierarchy:
    $ du -h -a --max-depth=2

IMAGE HANDLING

  • Convert svg files to png (you need to have Inkscape installed):
    $ inkscape input.svg -d 300 -e output.png
  • Convert svg files to pdf-latex (you need to have Inkscape installed):
    $ inkscape input.svg --export-pdf output.pdf --export-latex
  • Rotate a picture:
    $ convert Fig6_prim.png -rotate 90 Fig6_prim_rotated.png

MISCELLANEOUS

  • See the history of commands you have typed:
    $ history
  • See a calendar (month and year optional):
    $ cal [month] [year]
  • Combine pdf files into one (you need to have pdftk installed):
    $ pdftk file1.pdf file2.pdf file3.pdf cat output newfile.pdf
    or, to merge all pdf files in a directory:
    $ pdftk *.pdf cat output newfile.pdf

    In order to see how to combine only certain pagers of pdf files, as well as how to splits all pages into separate pdf files, see this page.

  • See the manual of a command:
    $ man command

Another useful idea is that of piping outputs of a command to another command. For example, if you want print the number of files in a directory, you can pipe the output of the ls command (list all files in a directory) to the wc -l command (count the number of lines in a string). For this, use the “|” character:

$ ls | wc -l

However, you may want instead to check the number of lines in all the 20 files in a directory at once, which can also be achieved by combining the ls and wc commands with the command xargs. The command then would look like:

$ ls | xargs wc -l

The command xargs breaks down the output from ls into one string per line and then calls wc -l for each line of the output of ls.

Hope all this saves you some time!

Running jobs on the supercomputer: JANUS

The power of supercomputing is undeniable. However, there is often a hurdle in syntax to get jobs to run on them. What I’m including below are ways to submit jobs to run on the CU-Boulder supercomputer, JANUS, which I hope will be helpful.

To log on, open up a terminal window (e.g. Terminal on a Mac or CygWin on a PC): ssh <username>@login.rc.colorado.edu

To copy items to JANUS from a shell, simply use the following:

scp <path and filename on local machine>   <username>@login.rc.colorado.edu:<destination path on JANUS>/

The purpose of the job script is to tell JANUS where to run the job. I will cover two types of job scripts, (1) to submit a job to an entire node, and (2) to submit to a single processor. Note, nodes on JANUS contain multiple processors, usually more than 12, so that if you have a memory intensive job you may wish to submit the former. Also, the jobs that occupy entire nodes offer the user a larger number of total processors to work with (several thousand cores versus several hundred). Nevertheless, here are the examples:

1. Example script to submit to a node is below: The body of text should be saved to a text file with a “.sh” suffix (i.e. shell script). Also notice that lines that begin with “#” are not read by the program, but rather are for comments/documentation. To submit the script, first be sure you’ve loaded the slurm module:

module load slurm

sbatch <path and filename of script>

#!/bin/bash
# Lines starting with #SBATCH are interpreted by slurm as arguments.
#

# Set the name of the job, e.g. MyJob
#SBATCH -J MyJob

#
# Set a walltime for the job. The time format is HH:MM:SS - In this case we run for 12 hours. **Important, this length should be commensurate with the type of node
# you're submitting to, debug is less than 1 hour, but others can be much longer, check the online documentation for assistance

#SBATCH --time=12:00:00
#
# Select one node
#
#SBATCH -N 1

# Select one task per node (similar to one processor per node)
#SBATCH --ntasks-per-node 12
# Set output file name with job number

#SBATCH -o MyJob-%j.out

# Use the standard 'janus' queue. This is confusing as the online documentation is incorrect, use the below to get a simple 12 core node

#SBATCH --qos janus

# The following commands will be executed when this script is run.

# **Important, in order to get 12 commands to run at the same time on your node, enclose them in parentheses "()" and follow them with an ampersand "&"

# to get all jobs to run in the background. The last thing is be sure to include a "wait" command at the end, so that the job script waits to terminate until these

# jobs complete. Theoretically you could have more than 12 command below.

# ** Note replace the XCMDX commands below with the full path to your executable as well as any command line options exactly how you'd run them from the

# command line.

echo The job has begun

(XCMD1X) &

(XCMD2X) &

(XCMD3X) &

(XCMD4X) &

(XCMD5X) &

(XCMD6X) &

(XCMD7X) &

(XCMD8X) &

(XCMD9X) &

(XCMD10X) &

(XCMD11X) &

(XCMD12X) &

# wait ensures that job doesn't exit until all background jobs have completed

wait

EOF

2. Example script to submit to a single processor is below. The process is almost identical to above, except for 4 things: (i) the queue that we’ll submit to is called ‘serial’, (ii) number of tasks per node is 1, (iii) the number of executable lines is 1, and (iv) we do not need the “wait” command.

#!/bin/bash

# Lines starting with #SBATCH are interpreted by slurm as arguments.

#

# Set the name of the job, e.g. MyJob

#SBATCH -J MyJob

#

# Set a walltime for the job. The time format is HH:MM:SS - In this case we run for 6 hours. **Important, this length should be commensurate with the type of node

# you're submitting to, debug is less than 1 hour, but others can be much longer, check the online documentation for assistance

#SBATCH --time=6:00:00

#

# Select one node

#

#SBATCH -N 1

# Select one task per node (similar to one processor per node)

#SBATCH --ntasks-per-node 1

# Set output file name with job number

#SBATCH -o MyJob-%j.out

# Use the standard 'serial' queue. This is confusing as the online documentation is incorrect, use the below to get a single processor

#SBATCH --qos serial

# The following commands will be executed when this script is run.

# ** Note replace the XCMDX commands below with the full path to your executable as well as any command line options exactly how you'd run them from the

# command line.

echo The job has begun

XCMDX

EOF