Debug in Real-time on SLURM

Debugging a code by submitting jobs to a supercomputer is an inefficient process. It goes something like this:

  1. Submit job and wait in queue
  2. Check for errors/change code
  3. (repeat endlessly until your code works)

Debugging in Real-Time:

There’s a better way to debug that doesn’t require waiting for the queue every time you want to check your code. On SLURM, you can debug in real-time like so:
  1. Request a debugging or interactive node and wait in queue
  2. Check for errors/change code continuously until code is fixed or node has timed out

Example (using Summit supercomputer at University of Colorado Boulder):

  1. Log into terminal (PuTTY, Cygwin, etc.)
  2. Navigate to directory where the file to be debugged is located using ‘cd’ command
  3. Load SLURM
    • $module load slurm
  4. Enter the ‘sinteractive’ command
    • $sinteractive
  5. Wait in line for permission to use the node (you will have a high priority with a debugging QOS so it shouldn’t take long)
  6. Once you are granted permission, the node is yours! Now you can debug to your hearts content (or until you run out of time).
I’m usually debugging shell scripts on Unix. If you want advice on that topic check out this link. I prefer the ‘-x’ command (shown below) but there are many options available.
Debugging shell scripts in Unix using ‘-x’ command: 
 $bash -x mybashscript.bash
Hopefully this was helpful! Please feel free to edit/comment/improve as you see fit.

Common PBS Batch Options

PBS is a handy cluster scheduling software, usually wrapped around a grid manager like MOAB.  It’s useful in that you can submit options to the command line, or using a batch script.  Arguments placed on the command line when calling the qsub command will take precedent over those in the script, so a general script may be built and then tested or varied by varying the options on the command line.  PSU has a pretty good guide to get you started using the PBS system, and can be read here.  However, there are some other options which are exceptionally useful for the moderate user.  In particular, the ability to pass the current environment, set up email notification, and redirect output are handy things to be able to use and modify.  An example script and header are presented below:

——————————————————————————————————————————–
#!/bin/csh                                                 #Using the C-Shell
#PBS -N SimpleScript                              #Give the job this name.
#PBS -M youremailhere@psu.edu              #A single user, send notification emails there.
#PBS -m a                                               #Send notification of aborts <a>.
#PBS -V                                                  #Pass the current environment variables to the job.
#PBS -l nodes=1:ppn=1                            #Request a single node, single core for this job.
#PBS -l walltime=96:00:00                         #Request a maximum wall time of 96 hours [HH:MM:SS format].
#PBS -o output/$PBS_JOBNAME.out        #Redirect STDOUT to ./output/$PBS_JOBNAME.out
#PBS -e error/$PBS_JOBNAME.err           #Redirect STDERR to ./output/$PBS_JOBNAME.err

env                                            #Echo the environment (variables)
cd $PBS_O_WORKDIR              #PBS starts your job in your home directory, cd to the submit/work directory
echo -n CWD:; /bin/pwd              #Echo the current working directory path
echo PBS_JOBNAME is live…    #Print to STDOUT (really, the file declared above) the job is live…
sleep 30                                    #Sleep for 30 seconds, then exit.
——————————————————————————————————————————–

In this case, I’ve configured the job to be named “SimpleScript,” to email the user “youremailhere@psu.edu” if the job aborts, to use the same environment as the one that the qsub command was issued from, requests 1 node and 1 processor on that node, a maximum run time of 96 hours, and to redirect the error/output messages to separate directories under my working directory.  Clearly this is a very simple example, given that it prints some basic info, pauses and exits.  If you were going to run a process or other program, you’d put your commands in place of the sleep command.  However, it provides a cut/copy example of commonly used options that you can include in your own batch scripts.  In case you want to modify those options, there’s a brief review of the most commonly changed ones below.  For a more complete list, head to the NCCS’ listing on common PBS options:

http://www.nccs.gov/computing-resources/phoenix/running-jobs/common-pbs-options/

Commonly Used Options:

These options can either be present on the command line a-la:
qsub -N SimpleScript -j oe <batchScriptFile>

Or included in the batch script file using the PBS Flagging macro: #PBS as in:
#PBS -N SimpleScript

Recall, that you can mix and match options on the command line and in the batch script, but be aware that the command line options override those in the batch file.

[-N] Name: Declares the name of the job.  It may be up to 15 characters in length, and must consist of printable, non white space characters with the first  character alphabetic.

[-o] Output File Path: Defines the path to, and name of, the file to which STDOUT will get redirected to.

[-e] Error File Path: Defines the path to, and name of, the file to which STDERR will get redirected to.

[-j] Join STD* streams: Declares if the standard error stream of the job will be merged with the standard output stream of the job.

An  option  argument  value  of oe directs that the two streams will be merged, intermixed,  as  standard  output.   The path and name of the file can then be specified with the -o option.

An  option argument  value  of  eo  directs  that  the two streams will be merged, intermixed, as standard error.  The path and name of the file can then be specified with the -e option.

If the join argument is n or the option is not  specified,  the two streams will be two separate files.

[-V] Pass Environment: This option declares that all environment variables in the qsub command’s environment are to be exported to the batch job.

[-m] – mail options: Defines the set of conditions under which the execution server will send a mail message about the job.  The mail_options argument is a string which consists of either the single  character “n”, or one or more of the characters “a”, “b”, and “e”.

If the character “n” is specified, no mail will be sent.

For the letters “a”, “b”, and “e”:

  • a  mail is sent when the job is aborted by the batch system.
  • b  mail is sent when the job begins execution.
  • e  mail is sent when the job terminates.

If the -m option is not specified, mail will be sent if the job is aborted.

[-M] User List: A list of users to send email notifications to.  The user_list argument is of the form:

user[@host][,user[@host],…]
If  unset, the list defaults to the submitting user at the qsub host, i.e. the job owner.

[-l , ‘ell’] – resource_list: Defines resources required by the job and establishes a limit to the amount of resource that can be consumed.  The list can be of the form:
resource_name[=[value]][,resource_name[=[value]],…]
Common arguments for this flag option are “walltime” and “nodes”.  The walltime sets the wall clock limit for the job, and is of the format HH:MM:SS.  Check with your sysadmin to see if there’s a maximum limit on this time.  The nodes argument defines how many cores you want the script to grab.

[-a] – date_time: Declares the time after which the job is eligible for execution. The date_time argument is in the form:
[[[[CC]YY]MM]DD]hhmm[.SS]

Where  CC is the first two digits of the year (the century), YY is the second two digits of the year, MM is the two digits for the month, DD is the day of the month, hh is the hour, mm is the minute, and the optional SS is the seconds.

Environment Variables Available to Job:

You can use these variables in your scripts as though they already exist in your environment; PBS sets them up as soon as your job starts running.

PBS_O_WORKDIR – the absolute path of the current working directory of the qsub command. You must ‘cd’ to this directory if you want to work in the folder you submitted the job from.
PBS_JOBNAME – the job name supplied by the user.
PBS_O_HOST – the name of the host upon which the qsub command is running.
PBS_SERVER – the hostname of the pbs_server which qsub submits the job to.
PBS_O_QUEUE – the name of the original queue to which the job was submitted.
PBS_ARRAYID – each member of a job array is assigned a unique identifier (see -t)
PBS_ENVIRONMENT – set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job, see -I option.
PBS_JOBID – the job identifier assigned to the job by the batch system.
PBS_NODEFILE – the name of the file contain the list of nodes assigned to the job (for parallel and cluster systems). This file is particularly useful if you want to log in to remote machines for parallel debugging.
PBS_QUEUE – the name of the queue from which the job is executed.

PBS_JOBNAME – the job name supplied by the user.

Using YAML in C++

YAML stands for “YAML Ain’t Markup Language”.  It is a “human friendly data serialization standard for all programming languages”.  What this means is that a human can read the files you write in YAML, and there are libraries and packages in almost every language that can also parse these files.  It’s a little bit more formal way to do parameter files and input files for models, since all the reading and error catching is provided by a library instead of by lines and lines of tedious code that you write.

I’m just playing around with this right now so I’ll share my notes here as I get it working.

The C++ libraries are available here.

  1. Follow the instructions on the website to download the zip file.
  2. The next instructions will either work on your Linux desktop or on the cluster.  They will probably work in Windows too, but I haven’t tried that yet.  I successfully ran the trial on my home computer running Ubuntu 11.10, but now I will replicate the process on the cluster.  Unzip the contents of the file on your computer of choice.
  3. Follow the website instructions to create the build directory in your folder and navigate to it.
  4. On the cluster, make sure to enable the program cmake by typing “module load cmake”.  Then, once you are in the build directory, you want to run cmake on the files in the outer directory, so type “cmake ..”
  5. When cmake runs successfully, it generates a custom Makefile just for you and your system.  To run the makefile, simply type “make”.  You should see some colorful commands that show that the program has compiled successfully.
  6. At the end, you’ll have a library called libyaml-cpp.a in your build directory.  Success!

Now we have a brand-new yaml-cpp library that contains all the functions you’ll need to parse yaml in your own program.  How do we test it out?  I’m glad you asked.

  1. Create a new folder that’s outside of the yaml-cpp folder.  You can call it “program1” or some other name.  Into that folder, copy libyaml-cpp.a from your yaml-cpp/build/ folder.  Also navigate into the /include/ folder inside yaml-cpp, and you’ll find another folder called yaml-cpp.  This folder contains the headers for all the functions inside the library.  In your project folder, you can either copy it over as /include/yaml-cpp, or just as /yaml-cpp.  In my project, I just copied it as yaml-cpp, in order to not have too many folders laying around.
  2. On the yaml-cpp site, try the monsters example at this page.  You’ll need a file called monsters.yaml, and the main cpp file, which I called test.cpp.  Here’s an important tip that it took me about a day (and help from the internet) to figure out: Only use spaces when indenting your blocks in the yaml file, not tabs!
  3. Now compile your program.  You can use a command like this: “g++ -Wall -I. -g test.cpp -lyaml-cpp -L. -o monsterstest” which tells the compiler to find your include paths in the working folder (referred to with a dot), and to name the executable “monsterstest”.
  4. Run the program using “monsterstest”  Did it work?  If so, great!

In a later post, I’ll give some example code that could be used to read objective titles, epsilons, constraints, model parameters, and so forth from a yaml file.  My idea is to have a master yaml file that contains all the parameters for a run.  The yaml can then be read by script programs that write input text files, java classes, or anything else you’d like.  The yaml will also be accessible to the C++ wrapper that interfaces with MOEAframework, and it can even be used directly by your simulation model.  This will give the user a lot of control, in a format that is flexible and fairly easy to use.  But more on that later!

The Cluster and Basic UNIX Commands

In this tutorial, you will log onto a computing cluster and get comfortable with some basic UNIX commands.  This post is about 2 years old at this point! It was originally written by Jon Herman and edited by Joe Kasprzyk, most recently on 9/27/2013.

One comment before we get started. At first, this post was written to help get started on the Penn State cluster. Now, folks in our research groups may be using computers at Cornell, University of Colorado, through the NSF Xsede System, or in other places! But generally the steps are about the same.

What is a cluster anyway?

When you use Excel or Matlab on your own laptop, all the calculations are being done right on your computer’s processor. On the internet, though, we’re used to having calculations done remotely “in the cloud” on a server somewhere. For example when you upload a video to YouTube, the conversion from your video format to Flash isn’t done on your laptop, it’s done somewhere in Iowa.

Using a computing cluster is the same idea. It may be fine to run a single MOEA run on your own laptop, but what happens when you want to run 50 random trials? Or the function evaluation time is really long? Plus, your laptop may not be that powerful and you may want to turn it off and go home, or someone might spill something on it, etc.

So using a computing cluster takes all the calculations and performs them somewhere else — on the cluster! So the idea is that you upload your files to a server, and then you can actually interact with the computer remotely, submit the computing jobs, and then download the results. For example, you can compile your code on the cluster (on the initial computer that you connect to called the login node, and then submit a remote job that gets performed on the compute notes.

You’ll need to interact with the cluster in two ways.

  1. Enter commands on the command line.  Use this to submit jobs, run programs, process files, etc. There are several software packages available to do this.  If you’re on a Mac or Linux machine, you should just be able to use the terminal.  On Windows there are several options.  The first is SSH Secure Shell, which can be downloaded from the Penn State ITS center (if you’re at Penn State).  The second, which a lot of members in the group use, is Cygwin.  Cygwin installs many unix-like programs integrated within the Windows environment.  Third are a selection of different terminal programs such as Putty. On a Mac, I’ve seen people use a program called Fugu. But the workflow is similar across most of the programs:

Each of these options uses SSH to connect to the cluster. SSH stands for “secure shell” and provides remote access to the command line interface on the clusters. You will first need to define a connection—if you use the “Quick Connect” option, you will need to re-enter the connection information every time. To simplify future access, use the Profiles -> Add Profile option, then use Profiles -> Edit Profile to define the profile you just created.

A remote connection requires a host name, a user name, and a password. The host name will depend on which cluster you want to access.

Penn State Right now, the largest and most powerful cluster at Penn State is Cyberstar (hostname: cyberstar.psu.edu). You can also access smaller clusters which may be less crowded such as Lion-XO (lionxo.aset.psu.edu). A detailed list of available systems and their specifications is available here.

University of Colorado We have access to a computer called Janus. For more information, click here. Researchers in Joe Kasprzyk’s group have access to several computing allocations, for more information, email joseph.kasprzyk “at” colorado.edu.

Cornell The Reed group cluster, “TheCube”, is currently coming online. There may be an additional future post about this once it’s operational. In the meantime, contact Jon Herman at jdh366 “at” cornell.edu for more information.

When you connect, you will be prompted for your password (this is the same as your university logon). If you have already been approved for access to your chosen system, your login should be almost immediate.

Congratulations, you are now on the cluster! You should see a prompt like [username@hostname ~]$ with a blinking cursor to the right. The ~ symbol means that you are currently in your home directory. This is the UNIX command line interface—Windows also has a command line, but we rarely use it because its graphical interface is so convenient.

Let’s try out some basic commands. Commands allow you to move around your file system and move, copy, edit, and run your files. Some good ones to know starting out:

  • ls (List contents of current directory)
  • pwd (Print working directory)
  • cd newFolder (Change directory to newFolder)
  • cp filename newLocation (Copy a file to a new location)
  • mv filename newLocation (Move a file to a new location)
  • rm filename (Delete a file. This is permanent, use with care.)
  • tar -cvf zippedFolder.tar oldFolder (Compress a directory. Tar is the UNIX version of zip)
  • tar -xvf zippedFolder.tar (Uncompress a tarred folder to the directory zippedFolder/)

When moving around, remember that UNIX uses the forward slash (‘/’) to denote directories rather than the Windows backslash (‘\’). The current directory is denoted as a dot (‘.’) and the parent directory is denoted by two dots (‘..’).

From your home directory (~), the two main directories are your work and scratch folders. These are both associated with your username. Your work folder is where you will store and run most of your programs. The scratch folder offers unlimited file storage and is sometimes useful for holding large result files.

  1. Transferring files between your computer and the cluster.  The first choice is the sftp command, covered in our post about using the cluster on a Mac.  The second choice is using a program called WinSCP, which provides a graphical drag-and-drop interface for transferring files.  The instructions to do so are below.
  • Open WinSCP and connect to any of the clusters, similar to how you did with the SSH client. Note that your home directory is accessible from any of the clusters, so it doesn’t really matter which one you use with WinSCP.
  • The transfer protocol on the first screen can remain at the default, “stfp”, with the default port being 22.  Then simply type your user id and the remote host like you did with the SSH client.
  • If you are prompted to choose an interface style, use the “Commander” interface. This shows you both the local and remote directories at the same time.
  • The right-hand window will show your file system on the cluster. Use WinSCP to drag and drop files between local folders and your cluster folders. You can also drag and drop to/from your regular Windows folders.
  • WinSCP also has a simple (but useful) text editor. If you have a text file on the cluster, right-click it in WinSCP and select “Edit”. A window will open that allows you to edit the file directly on the remote machine.

This should get you started with using the cluster. You will use the SSH client for compiling and running programs, and WinSCP to transfer files with your local machine.