Automate remote tasks with Paramiko

This is a short blogpost to demonstrate a the Paramiko Python package. Paramiko allows you to establish SSH, SCP or SFTP connections within Python scripts, which is handy when you’d like to automate some repetitive tasks with on remote server or cluster from your local machine or another cluster you’re running from.

It is often used for server management tasks, but for research applications you could consider situations where we have a large dataset stored at a remote location and are executing a script that needs to transfer some of that data depending on results or new information. Instead of manually establishing SSH or SFTP connections, those processes could be wrapped and automated within your existing Python script.

To begin a connection, all you need is a couple lines:

import paramiko

ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname='remotehose',username='yourusername',password='yourpassword')

The first line creates a paramiko SSH client object. The second line tells paramiko what to do if the host is not a known host (i.e., whether this host should be trusted or not)—think of when you’re setting up an SSH connection for the first time and get the message:

The authenticity of host ‘name’ can’t be established. RSA key fingerprint is ‘gibberish’. Are you sure you want to continue connecting (yes/no)?

The third line is what makes the connection, the hostname, username and password are usually the only necessary things to define.

Once a connection is established, commands can be executed with exec_command(), which creates three objects:

stdin, stdout, stderr = ssh_client.exec_command("ls")

stdin is write-only file which can be used for commands requiring input, stdout contains the output of the command, and stderr contains any errors produced by the command—if there are no errors it will be empty.

To print out what’s returned by the command, use can use stdout.readlines(). To add inputs to stdin, you can do so by using the write() function:

stdin, stdout, stderr = ssh.exec_command(“sudo ls”)
stdin.write(‘password\n’)

Importantly: don’t forget to close your connection, especially if this is an automated script that opens many of them: ssh_client.close().

To transfer files, you need to establish an SFTP or an SCP connection, in a pretty much similar manner:

ftp_client=ssh_client.open_sftp()
ftp_client.get('/remote/path/to/file/filename','/local/path/to/file/filename')
ftp_client.close()

get() will transfer a file to a local directory, put(), used in the same way, will transfer a file to a remote directory.

More Terminal Schooling

You are probably asking yourself “and why do I need more terminal schooling?”. The short answer is: to not have to spend as much time as you do on the terminal, most of which spent (1) pushing arrow keys thousands of times per afternoon to move through a command or history of commands, (2) waiting for a command that takes forever to be done running before you can run anything else, (3) clicking all over the place on MobaXTerm and still feeling lost, (4) manually running the same command multiple times with different inputs, (5) typing the two-step verification token every time you want to change a “+” to a “-” on a file on a supercomputer, (6) waiting forever for a time-consuming run done in serial on a single core, and (7, 8, …) other useless and horribly frustrating chores. Below are some tricks to make your Linux work more efficient and reduce the time you spend on the terminal. From now on, I will use a “$” sign to indicate that what follows is a command typed in the terminal.

The tab autocomple is your best friend

When trying to do something with that file whose name is 5480458 characters long, be smart and don’t type the whole thing. Just type the first few letters and hit tab. If it doesn’t complete all the way it’s because there are multiple files whose names begin with the sequence of characters. In this case, hitting tab twice will return the names of all such files. The tab autocomplete works for commands as well.

Ctrl+r for search through previous commands

When on the terminal, hit ctrl+r to switch to reverse search mode. This works like a simple search function o a text document, but instead looking in your bash history file for commands you used over the last weeks or months. For example, if you hit ctrl+r and type sbatch it will fill the line with the last command you ran that contained the word sbatch. If you hit ctrl+r again, it will find the second last used command, and so on.

Vim basics to edit files on a system that requires two-step authentication

Vim is one the most useful things I have came across when it comes to working on supercomputers with two-step identity verification, in which case using MobaXTerm of VS Code requires typing a difference security code all the time. Instead of uploading a new version of a code file every time you want to make a simple change, just edit the file on the computer itself using Vim. To make simple edits on your files, there are very few commands you need to know.

To open a file with Vim from the terminal: $ vim <file name> or $ vim +10 <file name>, if you want to open the file and go straight to line 10.

Vim has two modes of operation: text-edit (for you to type whatever you want in the file) and command (replacement to clicking on file, edit, view, etc. on the top bar of notepad). When you open Vim, it will be in command mode.

To switch to text-edit mode, just hit either “a” or “i” (you should then see “– INSERT –” at the bottom of the screen). To return to command mode, hit escape (Esc). When in text-edit more, the keys “Home,” “End,” “Pg Up,” “Pg Dn,” “Backspace,” and “Delete” work just like on Notepad and MS Word.

When in command mode, save your file by typing :w + Enter, save and quite with :wq, and quit without saving with :q!. Commands for selecting, copying and pasting, finding and replacing, replacing just one character, deleting a line, and other more advanced tasks can be found here. There’s also a great cheatsheet for Vim here. Hint: once you learn some more five to ten commands, making complex edits on your file with Vim becomes blazingly fast.

Perform repetitive tasks on the terminal using one-line Bash for-loops.

Instead of manually typing a command for each operation you want to perform on a subset of files in a directory (“e.g., cp file<i>.csv directory300-400 for i from 300 to 399 , tar -xzvf myfile<i>.tar.gz, etc.), you can use a Bash for-loop if using the is not possible.

Consider a situation in which you have 10,000 files and want to move files number 200 to 299 to a certain directory. Using the wildcard “*” in this case wouldn’t be possible, as result_2<i>.csv would return result_2.csv, result_20.csv to result_29.csv, and result_2000.csv to result_2999.csv as well–sometimes you may be able to use Regex, but that’s another story. To move a subset of result files to a directory using a Bash for-loop, you can use the following syntax:

$ for i in {0..99}; do cp result_2$i results_200s/; done

Keep in mind that you can have multiple commands inside a for-loop by separating them with “;” and also nest for-loops.

Run a time-intensive command on the background with an “&” and keep doing your terminal work

Some commands may take a long time to run and render the terminal unusable until it’s complete. Instead of opening another instance of the terminal and login in again, you can send a command to the background by adding “&” at the end of it. For example, if you want to extract a tar file with dozens of thousands of files in it and keep doing your work as the files are extracted, just run:

$ tar -xzf my_large_file.tar.gz &

If you have a directory with several tar files and want to extract a few of them in parallel while doing your work, you can use the for-loop described above and add “&” to the end of the tar command inside the loop. BE CAREFUL, if your for-loop iterates over dozens or more files, you may end up with your terminal trying to run dozens or more tasks at once. I accidentally crashed the Cube once doing this.

Check what is currently running on the terminal using ps

To make sure you are not overloading the terminal by throwing too many processes at it, you can check what it is currently running by running the command ps. For example, if I run an program with MPI creating two processes and run ps before my program is done, it will return the following:

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
$ mpirun -n 2 ./triangleSimulation -I Tests/test_input_file_borg.wp &
[1] 6129     <-- this is the process ID
bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ps
 PID TTY TIME CMD
 8 tty1 00:00:00 bash
 6129 tty1 00:00:00 mpirun    <-- notice the process ID 6129 again
 6134 tty1 00:00:00 triangleSimulat
 6135 tty1 00:00:00 triangleSimulat
 6136 tty1 00:00:00 ps

Check the output of a command running on the background

If you run a program on the background its output will not be printed on the screen. To know what’s happening with your program, send (to pipe) its output to a text file using the “>” symbol, which will be updated continuously as your program is running, and check it with cat <file name>, less +F<file name>, tail -n<file name>, or something similar. For example, if test_for_background.sh is a script that will print a number on the screen every one second, you could do the following (note the “> pipe.csv” in the first command):

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ./test_for_background.sh > pipe.csv &
 [1] 6191

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2
 3

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ tail -3 pipe.csv
 8
 9
 10

This is also extremely useful in situations when you want to run a command that takes long to run but whose outputs are normally displayed one time on the screen. For example, if you want to check the contents of a directory with thousands of files to search for a few specific files, you can pipe the output of ls to a file and send it to the background with ls > directory_contents.txt & and search the resulting text file for the file of interest.

System monitor: check core and memory usage with htop, or top if htop is not available

If ps does not provide enough information given your needs, such as if you’re trying to check if your multi-thread application is using the number of cores it should, you can try running htop instead. This will show on your screen something along the lines of the Performance view  of Windows’ Task Manager, but without the time plot. It will also show how much memory is being used, so that you do not accidentally shut down a node on an HPC system. If htop is not available, you can try top.

Running make in parallel with make -j for much shorter compiling time

If a C++ code is properly modularized, make can compile certain source code files in parallel. To do that, run make -j<number of cores> <rule in makefile>. For example, the following command would compile WaterPaths in parallel over four cores:

$ make -j4 gcc

For WaterPaths, make gcc takes 54s on the Cube, make -j4 gcc takes 15s, make -j8 gcc takes 9s, so the time and patience savings are real if you have to compile the code various times per day. To make your life simpler, you can add an alias to bash_aliases such as alias make='make -j4' (see below in section about .bash_aliases file). DO NOT USE MAKE -J ON NSF HPC SYSTEMS: it is against the rules. On the cube keep it to four cores or less not to disturb other users, but use all cores available if on the cloud or iterative section.

Check the size of files and directories using du -hs

The title above is quite self-explanatory. Running du -hs <file name> will tell you its size.

Check the data and time a file was created or last modified using the stat command

Also rather self-explanatory. Running stat <file name> is really useful if you cannot remember on which file you saved the output last time you ran your program.

Split large files into smaller chunks with the split command and put them back together with cat

This works for splitting a large text file into files with fewer lines, as well as for splitting large binary files (such as large tar files) so that you can, for example, upload them to GitHub or e-mail them to someone. To split a text file with 10,000 into ten files with 1,000 lines each, use:

 $ split -l 1000 myfile myfile_part

This will result in ten files called myfile_part00, myfile_part01, and so on with 1,000 lines each. Similarly, the command below would break a binary file into parts with 50 MB each:

 $ split -b 50m myfile myfile_part

To put all files back together in either case, run:

$ cat myfile_part* myfile

More information about the split command can be found in Joe’s post about it.

Checking your HPC submission history with `sacct`

Another quite sulf-explanatory tile. If you want to remember when you submitted something, such as to check if an output file resulted from this or that submission (see stat command), just run the command below in one line:

$ sacct -S 2019-09-18 -u bct52 --format=User,JobID,Jobname,start,end,elapsed,nnodes,nodelist,state

This will result in an output similar to the one below:

bct52 979 my_job 2019-09-10T21:48:30 2019-09-10T21:55:08 00:06:38 1 c0001 COMPLETED
bct52 980 skx_test_1 2019-09-11T01:44:08 2019-09-11T01:44:09 00:00:01 1 c0001 FAILED
bct52 981 skx_test_1 2019-09-11T01:44:33 2019-09-11T01:56:45 00:12:12 1 c0001 CANCELLED
bct52 1080 skx_test_4 2019-09-11T22:07:03 2019-09-11T22:08:39 00:01:36 4 c[0001-0004] COMPLETED
1080.0 orted 2019-09-11T22:08:38 2019-09-11T22:08:38 00:00:00 3 c[0002-0004] COMPLETED

Compare files with meld, fldiff, or diff

There are several programs to show the differences between text files. This is particularly useful when you want to see what the changes between different versions of the same file, normally a source code file. If you are on a computer running a Linux OS or have an X server like Xming installed, you can use meld and kdiff3 for pretty outputs on a nice GUI or fldiff to quickly handle a files with huge number of difference. Otherwise, diff will show you the differences in a cruder pure-terminal but still very much functional manner. The syntax for all of them is:

$ <command> <file1> <file2>

Except for diff, for which it is worth calling with the --color option:

$ diff --color <file1> <file2>

If cannot run a graphical user interface but is feeling fancy today, you can install the ydiff Python extension with (done just once):

$ python3 -m pip install --user ydiff 

and pipe diff’s output to it with the following:

$diff -u <file1> <file2> | python3 -m ydiff -s

This will show you the differences between two versions of a code file in a crystal clear, side by side, and colorized way.

Creating a .bashrc file for a terminal that’s easy to work with and good (or better) to look at

When we first login to several Linux systems the terminal is all black with white characters, in which it’s difficult find the commands you typed amidst all the output printed on the screen, and with limited autocomplete and history search. In short, it’s a real pain and you makes you long for Windows as much as for you long for your mother’s weekend dinner. There is, however, a way of making the terminal less of a pain to work with, which is by creating a file called .bashrc with the right contents in your home directory. Below is an example of a .bashrc file with the following features for you to just copy and paste in your home directory (e.g., /home/username/, or ~/ for short):

  • Colorize your username and show the directory you’re currently in, so that it’s easy to see when the output of a command ends and the next one begins–as in section “Checking the output of a command running on the background.”
  • Allow for a search function with the up and down arrow keys. This way, if you’re looking for all the times you typed a command starting with sbatch, you can just type “sba” and hit up arrow until you find the call you’re looking for.
  • A function that allows you to call extract and the compressed file will be extracted. No more need to tar with a bunch of options, unzip, unrar, etc. so long as you have all of them installed.
  • Colored man pages. This means that when you look for the documentation of a program using man, such as man cat to see all available options for the cat command, the output will be colorized.
  • A function called pretty_csv to let you see csv files in a convenient, organized and clean way from the terminal, without having to download it to your computer.
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# Load aliases
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

# Automatically added by module
shopt -s expand_aliases

if [ ! -z "$PS1" ]; then
PS1='\[\033[G\]\[\e]0;\w\a\]\n\[\e[1;32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '
bind '"\e[A":history-search-backward'
bind '"\e[B":history-search-forward'
fi

set show-all-if-ambiguous on
set completion-ignore-case on
export PATH=/usr/local/gcc-7.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/gcc-7.1/lib64:$LD_LIBRARY_PATH

history -a
export DISPLAY=localhost:0.0

sshd_status=$(service ssh status)
if [[ $sshd_status = *"is not running"* ]]; then
sudo service ssh --full-restart
fi

HISTSIZE=-1
HISTFILESIZE=-1

extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2)   tar xvjf $1    ;;
*.tar.gz)    tar xvzf $1    ;;
*.bz2)       bunzip2 $1     ;;
*.rar)       unrar x $1       ;;
*.gz)        gunzip $1      ;;
*.tar)       tar xvf $1     ;;
*.tbz2)      tar xvjf $1    ;;
*.tgz)       tar xvzf $1    ;;
*.zip)       unzip $1       ;;
*.Z)         uncompress $1  ;;
*.7z)        7z x $1        ;;
*)           echo "don't know how to extract '$1'..." ;;
esac
else
echo "'$1' is not a valid file!"
fi
}

# Colored man pages
export LESS_TERMCAP_mb=$'\E[01;31m'
export LESS_TERMCAP_md=$'\E[01;31m'
export LESS_TERMCAP_me=$'\E[0m'
export LESS_TERMCAP_se=$'\E[0m'
export LESS_TERMCAP_so=$'\E[01;44;33m'
export LESS_TERMCAP_ue=$'\E[0m'
export LESS_TERMCAP_us=$'\E[01;32m'

# Combine multiline commands into one in history
shopt -s cmdhist

# Ignore duplicates, ls without options and builtin commands
HISTCONTROL=ignoredups
export HISTIGNORE="&:ls:[bf]g:exit"

pretty_csv () {
cat "$1" | column -t -s, | less -S
}

There are several .bashrc example files online with all sorts of functionalities. Believe me, a nice .bashrc will make your life A LOT BETTER. Just copy and paste the above into a text file called .bashrc and sent it to your home directory in your local or HPC system terminal.

Make the terminal far less user-friendly and less archane by setting up a .bash_aliases file

You should also have a .bash_aliases file to significantly reduce typing and colorizing the output of commands you often use for ease of navigation. Just copy all the below into a file called .bash_aliases and copy into your home directory (e.g., /home/username/, or ~/ for short). This way, every time you run the command between the word “alias” and the “=” sign, the command after the “=”sign will be run.

alias ls='ls --color=tty'
alias ll='ls -l --color=auto'
alias lh='ls -al --color=auto'
alias lt='ls -alt --color=auto'
alias uu='sudo apt-get update && sudo apt-get upgrade -y'
alias q='squeue -u '
alias qkill='scancel $(qselect -u bct52)'
alias csvd="awk -F, 'END {printf \"Number of Rows: %s\\nNumber of Columns: %s\\n\", NR, NF}'"
alias grep='grep --color=auto'                          #colorize grep output
alias gcc='gcc -fdiagnostics-color=always'                           #colorize gcc output
alias g++='g++ -fdiagnostics-color=always'                          #colorize g++ output
alias paper='cd /my/directory/with/my/beloved/paper/'
alias res='cd /my/directory/with/my/ok/results/'
alias diss='cd /my/directory/of/my/@#$%&/dissertation/'
alias aspell='aspell --lang=en --mode=tex check'
alias aspellall='find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;'
alias make='make -j4'

Check for spelling mistakes in your Latex files using aspell

Command-line spell checker, you know what this is.

aspell --lang=en --mode=tex check'

To run aspell check on all the Latexfiles in a directory and its subdirectories, run:

find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;

Easily share a directory on certain HPC systems with others working on the same project [Hint from Stampede 2]

Here’s a great way to set permissions recursively to share a directory named projdir with your research group:

$ lfs find projdir | xargs chmod g+rX

Using lfs is faster and less stressful on Lustre than a recursive chmod. The capital “X” assigns group execute permissions only to files and directories for which the owner has execute permissions.

Run find and replace in all files in a directory [Hint from Stampede 2]

Suppose you wish to remove all trailing blanks in your *.c and *.h files. You can use the find command with the sed command with in place editing and regular expressions to this. Starting in the current directory you can do:

$ find . -name *.[ch] -exec sed -i -e ‘s/ +$//’ {} \;

The find command locates all the *.c and *.h files in the current directory and below. The -exec option run the sed command replacing {} with the name of each file. The -i option tells sed to make the changes in place. The s/ +$// tells sed to replace one or blanks at the end of the line with nothing. The \; is required to let find know where the end of the text for the -exec option. Being an effective user of sed and find can make a great different in your productivity, so be sure to check Tina’s post about them.

Other post in this blog

Be sure to look at other posts in this blog, such as Jon Herman’s post about ssh, Bernardo’s post about other useful Linux commands organized by task to be performed, and Joe’s posts about grep (search inside multiple files) and cut.

Remote terminal environment using VS Code for Windows and Mac

On Windows machines, the application MobaXterm is a valuable tool for computing on virtual machines and working through SSH clients. David Gold’s blog post walks through the installation and use of this app, which works well in Windows environments.

Working remotely on my Mac laptop, I have been struggling to achieve the same workflow as in the office, with a Windows machine. Unfortunately, MobaXterm is not available for download on Mac OS. Looking for alternatives, I discovered that using VS Code with the “Remote – SSH” extension is a great replacement with significant advantages to MobaXterm, as it an SSH client interface and code editor in one.

A screenshot from my VS Code remote interface, with the graphical file browser on the left panel, the SSH server terminal on the bottom-right, and the VS Code editor on the top-right.

Here’s how you can set up a remote session on Mac (and Windows) using VS Code: 

  1. Install the VS Code application here. For installation help and a brief overview of the app, check out this video.
  2. With VS Code opened, go to View -> Extensions, and search “Remote – SSH.” Click on the extension and press the green “Install” button. You should see the message “This extension is enabled globally” appear. Check out this extension’s description below (I’ll run through the basics in this post).
  3. On the bottom left of your screen, there should be a small green box with two opposite pointing arrow heads. Click this.
The green box is the Remote – SSH extension.
  1. Choose the first pop-up option “Remote-SSH: Connect to host…” and then select “Add New SSH Host…”.
Click the first box and then the “Add New SSH Host” button to connect to your SSH client.
  1. Here, enter your remote SSH username@serverid (here at Cornell, this would be yournetid@thecube.cac.cornell.edu to connect to our remote computing cluster, the Cube).
  2. In the same pop-up window, click the remote server that you just added. A new window will open and prompt you to enter your password for the server.
  3. Now, you in are in your remote SSH environment. Click “Open folder…” and select “OK” to see your remote directory on the left. You can navigate through these files in your remote machine the same way as MobaXterm. Click View -> Terminal to see your SSH command line on the bottom of the screen (here’s where you can actually run the programs on your cluster).

Now using VS Code, you can install other extensions to aid in code editing in different languages (here’s an article with a few good ones for various uses). This environment has the same functionality as MobaXterm, without having to switch applications for editing code. Run your cluster programs in the terminal window and edit the code in the main VS Code editor!

Establishing an Effective Data Backup Strategy for Your Workstation

When determining your data management strategy for your workflow, considering a range of backup options for your data beyond just a single copy on your workstation or your external hard drive is paramount. Creating a seamless workspace that will easily transition between workstations and while maintaining durability and availability is easily achievable once you know what resources might be available and a general guideline.

General considerations for how you will be managing and sharing data is crucial, especially for collaborative projects when files must often be accessible in real time.

Considering how long you might need to retain data and how often you might need to access it will drastically change your approach to your storage strategy.

3-2-1 Data Backup Rule

If you walk away form this with nothing else, remember the 3-2-1 rule. The key to ensuring durability of your data—preventing loss due to hardware or software malfunction, fire, viruses, and institutional changes or uproars—is following the 3-2-1 Rule. Maintaining three or more copies on two or more different mediums (i.e. cloud and HDD) with at least one off-site copy.

3-2-1-Backup-Rule-1024x505Source: https://cactus-it.co.uk/the-3-2-1-backup-rule/

An example of this would be to have a primary copy of your data on your desktop that is backed up continuously via Dropbox and nightly via an external hard drive. There are three copies of your data between your local workstation, external hard drive (HD), and Dropbox. By having your media saved on hard drive disks (HDDs) on your workstation and external HD in addition to ‘the cloud’ (Dropbox), you have accomplished spreading your data across exactly two different mediums. Lastly, since cloud storage is located on external servers connected via the internet, you have successfully maintained at least one off-site copy. Additionally, with a second external HD, you could create weekly/monthly/yearly backups and store this HD offsite.

Version Control Versus Data Backup

Maintaining a robust version control protocol does not ensure your data will be properly backed up and vice versa. Notably, you should not be relying on services such as GitHub to back up your data, only your code (and possibly very small datasets, i.e. <50 MB). However, you should still maintain an effective strategy for version control.

  • Code Version Control
  • Large File Version Control
    • GitHub is not the place to be storing and sharing large datasets, only the code to produce large datasets
    • Git Large File Storage (LFS) can be used for a Git-based version-control on large files

Data Storage: Compression

Compressing data reduces the amount of storage required (thereby reducing cost), but ensuring the data’s integrity is an extremely complex topic that is continuously changing. While standard compression techniques (e.g. .ZIP and HDF5) are generally effective at compression without issues, accessing such files requires additional steps before having the data in a usable format (i.e. decompressing the files is required).  It is common practice (and often a common courtesy) to compress files prior to sharing them, especially when emailed.

7-Zip is a great open-source tool for standard compression file types (.ZIP, .RAR) and has its own compression file type. Additionally, a couple of guides looking into using HDF5/zlib for NetCFD files are located here and here.

Creating Your Storage Strategy

To comply with the 3-2-1 strategy, you must actively choose where you wish to back up your files. In addition to pushing your code to GitHub, choosing how to best push your files to be backed up is necessary. However, you must consider any requirements you might have for your data handling:

My personal strategy costs approximately $120 per year. For my workstation on campus, I primarily utilize DropBox with a now-outdated version control history plugin that allows for me to access files one year after deletion. Additionally, I instantaneously sync these files to GoogleDrive (guide to syncing). Beyond these cloud services, I utilize an external HDD that backs up select directories nightly (refer below to my script that works with Windows 7).

It should be noted that Cornell could discontinue its contracts with Google so that unlimited storage on Google Drive is no longer available. Additionally, it is likely that Cornell students will lose access to Google Drive and Cornell Box upon graduation, rendering these options impractical for long-term or permanent storage.

  • Minimal Cost (Cornell Students)
    • Cornell Box
    • Google Drive
    • Local Storage
    • TheCube
  • Accessibility and Sharing
    • DropBox
    • Google Drive
    • Cornell Box (for sharing within Cornell, horrid for external sharing)
  • Minimal Local Computer Storage Availability
    Access Via Web Interface (Cloud Storage) or File Explorer

    • DropBox
    • Google Drive (using Google Stream)
    • Cornell Box
    • TheCube
    • External HDD
  • Reliable (accessibility through time)
    • Local Storage (especially an external HDD if you will be relocating)
    • Dropbox
    • TheCube
  • Always Locally Accessible
    • Local Storage (notably where you will be utilizing the data, e.g. keep data on TheCube if you plan to utilize it there)
    • DropBox (with all files saved locally)
    • Cornell Box (with all files saved locally)
  • Large Capacity (~2 TB total)
    • Use Cornell Box or Google Drive
  • Extremely Large Capacity (or unlimited file size)

Storage Option Details and Tradeoffs

Working with large datasets can be challenging to do between workstations, changing the problem from simply incorporating the files directly within your workflow to interacting with the files from afar (e.g. keeping and utilizing files on TheCube).

But on a personal computer level, the most significant differentiator between storage types is whether you can (almost) instantaneously update and access files across computers (cloud-based storage with desktop file access) or if manual/automated backups occur. I personally like to have a majority of my files easily accessible, so I utilize Dropbox and Google Drive to constantly update between computers. I also back up all of my files from my workstation to an external hard drive just to maintain an extra  layer of data protection in case something goes awry.

  • Requirements for Data Storage
  • Local Storage: The Tried and True
    • Internal HDD
      • Installed on your desktop or laptop
      • Can most readily access data for constant use, making interactions with files the fastest
      • Likely the most at-risk version due to potential exposure to viruses in addition to nearly-constant uptime (and bumps for laptops)
      • Note that Solid State Drives (SSDs) do not have the same lifespan for the number of read/write as an HDD, leading to slowdowns or even failures if improperly managed. However, newer SSDs are less prone to these issues due to a combination of firmware and hardware advances.
      • A separate data drive (a secondary HDD that stores data and not the primary operating system) is useful for expanding easily-accessible space. However, it is not nearly as isolated as data contained within a user’s account on a computer and must be properly configured to ensure privacy of files
    • External Hard Drive Disk (HDD)
      • One-time cost ($50-200), depending on portability/size/speed
      • Can allow for off-line version of data to be stored, avoiding newly introduced viruses from preventing access or corrupting older versions (e.g. ransomware)—requires isolation from your workflow
      • May back up data instantaneously or as often as desired: general practice is to back up nightly or weekly
      • Software provided with external hard drives is generally less effective than self-generated scripts (e.g. Robocopy in Windows)
      • Unless properly encrypted, can be easily accessed by anyone with physical access
      • May be used without internet access, only requiring physical access
      • High quality (and priced) HDDs generally increase capacity and/or write/read speeds
    • Alternative Media Storage
      • Flash Thumb Drive
        • Don’t use these for data storage, only temporary transfer of files (e.g. for a presentation)
        • Likely to be lost
        • Likely to malfunction/break
      • Outdated Methods
        • DVD/Blu-Ray
        • Floppy Disks
        • Magnetic Tapes
      • M-Discs
        • Required a Blu-Ray or DVD reader/writer
        • Supposedly lasts multiple lifetimes
        • 375 GB for $67.50
  •  Dropbox
    • My experience is that Dropbox is the easiest cloud-storage solution to use
    • Free Version includes 2 GB of space without bells and whistles
    • 1 TB storage for $99.00/year
    • Maximum file size of 20 GB
    • Effective (and standard) for filesharing
    • 30-day version history (extended version history for one year can be purchased for an additional $39.00/year)
    • Professional, larger plans with additional features (e.g. collaborative document management) also available
    • Can easily create collaborative folders, but storage counts against all individuals added (an issue if individuals are sharing large datasets)
    • Can interface with both a web interface and across as operating system desktops
    • Fast upload/download speeds
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • Google Drive
    • My experience is that Google Drive is relatively straight forward
    • Unlimited data/email storage for Cornell students, staff, and faculty
    • Costs $9.99/mo for 1 TB
    • Maximum file size of 5 GB
    • Easy access to G Suite, which allows for real-time collaboration on browser-based documents
    • Likely to lose access to storage capabilities upon graduation
    • Google Drive is migrating over to Google Stream which stores less commonly used files online as opposed to on your hard drive
    • Google File Stream (used to sync files with desktop) requires a constant internet connection except for recently-used files
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup
  • Cornell Box
    • My experiences are that Cornell Box is not easy to use relative to other options
    • Unlimited storage space, 15 GB file-size limit
    • Free for Cornell students, staff, and faculty, but alumni lose access once graduating
    • Can only be used for university-related activities (e.g. classwork, research)
    • Sharable links for internal Cornell users; however, it is very intrusive to access files for external users (requires making an account)
    • Version history retains the 100 most recent versions for each file
    • Can connect with Google Docs
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • TheCube

Long-Term (5+ Years) Data Storage

It should be noted that most local media types degrade through time. Utilizing the 3-2-1 strategy is most important for long-term storage (with an emphasis on multiple media types and off-site storage). Notably, even if stored offline and never used, external HDDs, CDs, and Blu-Ray disks can only be expected to last at most around five years. Other strategies, such as magnetic tapes (10 years) or floppy disks (10-20 year), may last longer, there is no truly permanent storage strategy (source of lifespans).

M-Discs are a write-once (i.e. read only, cannot be modified) storage strategy that is projected to last many lifetimes and up to 1,000 years. If you’re able to dust off an old Blu-Ray disk reader/writer, M-Discs are likely the best long-term data strategy that is likely to survive the test of time—making two copies stored in two locations is definitely worthwhile. However, the biggest drawback is that M-Discs are relatively difficult to access compared to plugging in an external HD.

Because of the range of lifespans and how cheap storage has become, I would recommend maintaining your old (and likely relatively small) data archives within your regular storage strategy which is likely to migrate between services through time.

For larger datasets that you are required to retain and would like to easily access, I would maintain them on at least two offline external hard drive stored in separate locations (e.g. at home and your office) while occasionally (i.e. every six months) checking the health of the hard drives in perpetuity and replacing them as required.

Relying only on cloud storage for long-term storage is not recommended due to the possibility of companies closing their doors or simply deactivating your account. However, they can be used as an additional layer of protection in addition to having physical copies (i.e. external HD, M-Discs).

Windows 7 Robocopy Code

The script I use for backing up specific directories from my workstation (Windows 7) to an external HD is shown below. To set up my hard drive, I first formatted it to a format compatible with multiple operating systems using this guide. Note that your maximum file size and operating system requirements require different formats. Following this, I used the following guide to implement a nightly backup of all of my data while keeping a log on my C: drive. Note that I have only new files and versions of files copied over, ensuring that the back up does not take ages.

@echo off
robocopy C:\Users\pqs4\Desktop F:\Backups\Desktop /E /XA:H /W:0 /R:3 /REG > C:\externalbackup.log
robocopy E:\Dropbox F:\Backups\Dropbox /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Downloads F:\Backups\Downloads /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Documents F:\Backups\Documents /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4 F:\Backups\UserProfile /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy E:\Program Files\Zotero F:\Backups\Zotero /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log

Converting Latex to MS Word docx (almost perfectly)

Many of students and academics write academic papers and reports on Latex mainly for ease of formatting and of managing citations plus bibliography. However, collaborative editing Latex tools have not reached the level of versatility and convenience of Microsoft Word. We then often begin writing the paper in Word when major comments, additions and changes are made, and later reorganize it Latex: a waste of sharp people’s time.

A way around this hurdle is to use Pandoc to convert Latex files to Word documents. If using Linux and want to use apt-get, you should run sudo apt-get install pandoc pandoc-citeproc. To run Pandoc on a Latex document:

  1. Open a terminal (on Windows, hold the Windows key and press “r,” then type “cmd” in the command bar)
  2. Use the “cd” command to navigate to the folder where your Latex document it.
  3. Type pandoc -s my_latex_document.tex --bibliography=my_bib_file.bib -o resulting_word_document.docx.

Now you should have a word document with all your bitmap (png, jpeg, bmp, etc.) figures, equations in Word format and with a bibliography as you would have from Latex based on your citations and bib file. If the latest version of Pandoc does not work, try version 1.19.1.

There are some known limitations, though. Numbering and referencing equations, figures and tables, and referencing sections will not work and you will have to number those by hand. This limitation seems to apply only to word documents, other formats working with Pandoc reference filters plus the appropriate Pandoc calls and specific latex syntax (see filters’ pages for details). Also, vectorized figures will not be included in the document, although bitmaps (png, jpeg, bmp, etc.) and corresponding captions will.

Hopefully some skilled programmer feeling like generously volunteering his time for the Latex academic cause will address the mentioned limitations soon. Until we can thank someone for taking care of these limitations with clever coding, we will have to take care of them by hand.

 

 

Using HDF5/zlib compression in NetCDF4, part 2: testing the compression settings

There has been a previous post, courtesy of Greg Garner, on why HDF5/zlib compression matters for NetCDF4. That post featured a plot that showed how much you could compress your data when increasing the compression level. But the fine print also acknowledged that this data was for a pretty idealized dataset. So how much should you compress your data in a real-world application? How can you test what your trade-off really is between compression and computing time?

Follow this 4-step process to find out!

I’ll be illustrating this post using my own experience with the Water Balance Model (WBM), a model developed at the University of New Hampshire and that has served for several high-profile papers over the years (including Nature and Science). This is the first time that this model, written in Perl, is being ported to another research group, with the goal of exploring its behavior when running large ensembles of inputs (which I am starting to do! Exciting, but a story for another post).

Step 1. Read the manual

There is a lot of different software for creating NetCDF data. Depending on the situation, you may have a say on which to use, or be already using the tool that comes with the software suite you are working with. Of course, in the latter case, you can always change the tools. But reasonable a first step before that is to test them. Ergo, look up the documentation for the software you are using, to see how you can control compression on them.

Here, WBM uses the PDL::NetCDF Perl library, which has useful functions for adding data to a NetCDF file after every time step the model runs. Contrary to Greg’s post that uses C and where there are two flags (“shuffle” and “deflate”) and a compression level parameter (“deflate_level”), for PDL::NetCDF there are only two parameters. The SHUFFLE flag is the equivalent in Perl of the “shuffle” flag in C. The DEFLATE Perl parameter ihas integer values from 0 to 9, with a value 0 being equivalent to the C-flag “deflate” being turned off, and any value from 1 to 9 being equivalent to the “deflate”C-flag being on, the value of DEFLATE being then equivalent to the value of the “deflate_level” parameter in Greg’s post. Therefore, the DEFLATE variable from the PDL::NetCDF library in Perl lumps together the parameters “deflate” and “deflate_level” used in C.

I then located the DEFLATE and SHUFFLE variables within the auxiliary functions of the WBM. In the function write_nc, the following two lines of codes are key:

 my $deflate = set_default($$options{DEFLATE},1); # NetCDF4 deflate (compression) parameter</pre>
my $shuffle = set_default($$options{SHUFFLE},0); # NetCDF4 shuffle parameter 

Step 2. Set up a test protocol

This builds on Greg’s idea of recording time and resulting file size for all compression level. Here we are interested in these quantities for full-scale model runs, and not just for the generation of a single NetCDF dataset.

In this case therefore, we want to contrast the default setting above with stronger compression settings, for ensemble runs of WBM on the Cube (the local HPC cluster). For a better comparison, let us place ourselves in the conditions in which ensemble runs will be made. Runs will use all 16 cores of a Cube node, therefore for each compression setting, this experiment runs 16 instances of the WBM on a single node. Each of the 16 instances runs on a single core. All WBM runs are identical so the only differences between run times and result file size come from compression settings.

Compression settings for (SHUFFLE,DEFLATE) are (0,1) by default, and we compare that with all settings from (1,1) to (1,9).

Step 3. Run experiment, get results

Here are the results from this experiment. Results consider 47 output fields for WBM runs with a daily time-step for 8 years (2009-2016), plus 5 years of warmup (this is pretty common for hydrological models). All this in a spatial mesh of 148,500 grid cells. A folder containing binaries for a single input variable, for this time span and spatial coverage, has a size of 3.1GB. Therefore, the expected size for 47 variables in binary format is 146Go. Let us compare with our results:

netcdf_expe_results

As one can see the presence of the shuffle flag or the value of the deflate parameter have little influence on the size of the results files. Compressed results are 3 to 4 time smaller than binaries, which highlights the interest of compressing, but also means we do not have the order(s) of magnitude differences reported by Greg’s blog post. This is mainly because the binary format used for WBM inputs is much more efficient than the uncompressed ASCII that Greg used in his experiment. For a deflate parameter of 9, there is an apparent problem within the PDL library, and no output (note that a single-core run with shuffle=0 and deflate=9 did not lead to a similar problem).

Step 4. Conclude on compression parameters

Here the epxerimental setup has shown that carefully selecting the output fields will save more space than fine-tuning NetCDF compression parameters. For instance, some of the 47 output fields above are fully redundant with others. Others are residual fields, and the only interest of looking them up is to verify that a major development within the WBM code did not mess up with the overall water balance.

More generally, the effects of compression are situation-specific and are not as great when there is no obvious regularity in the data (as is often the case with outputs from large models), or when the binary format used is already much better than ASCII. This said, NetCDF still occupies much less space than binaries, and is much easier to handle: WBM outputs are contained in one file per year (8 files total) with very useful metadata info…


 

 

Using HDF5/zlib Compression in NetCDF4

Not too long ago, I posted an entry on writing NetCDF files in C and loading them in R.  In that post, I mentioned that the latest and greatest version of NetCDF includes HDF5/zlib compression, but I didn’t say much more beyond that.  In this post, I’ll explain briefly how to use this compression feature in your NetCDF4 files.

Disclaimer: I’m not an expert in any sense on the details of compression algorithms.  For more details on how HDF5/zlib compression is integrated into NetCDF, check out the NetCDF Documentation.  Also, I’ll be assuming that the NetCDF4 library was compiled on your machine to enable HDF5/zlib compression.  Details on building and installing NetCDF from source code can be found in the documentation too.

I will be using code similar to what was in my previous post.  The code generates three variables (x, y, z) each with 3 dimensions.  I’ve increased the size of the dimensions by an order of magnitude to better accentuate the compression capabilities.

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 100;
  int dim2_size = 50;
  int dim3_size = 200;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
        for(j = 0; j < dim2_size; j++) {
                for(k = 0; k < dim3_size; k++) {
                        x[i][j][k] = (i+j+k) * 0.2;
                        y[i][j][k] = (i+j+k) * 1.7;
                        z[i][j][k] = (i+j+k) * 2.4;
                }
          }
        }

Next is to setup the various IDs, create the NetCDF file, and apply the dimensions to the NetCDF file.  This has not changed since the last post.

  // Allocate space for netCDF dimension ids
  int dim1id, dim2id, dim3id;
  
  // Allocate space for the netcdf file id
  int ncid;
  
  // Allocate space for the data variable ids
  int xid, yid, zid;
  
  // Setup the netcdf file
  int retval;
  if((retval = nc_create(ncfile, NC_NETCDF4, &ncid))) { ncError(retval); }
  
  // Define the dimensions in the netcdf file
  if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
  if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
  // Gather the dimids into an array for defining variables in the netcdf file
  int dimids[ndims];
  dimids[0] = dim1id;
  dimids[1] = dim2id;
  dimids[2] = dim3id;

Here’s where the magic happens.  The next step is to define the variables in the NetCDF file.  The variables must be defined in the file before you tag it for compression.

  // Define the netcdf variables
  if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
  if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

Now that we’ve defined the variables in the NetCDF file, let’s tag them for compression.

  // OPTIONAL: Compress the variables
  int shuffle = 1;
  int deflate = 1;
  int deflate_level = 4;
  if((retval = nc_def_var_deflate(ncid, xid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, yid, shuffle, deflate, deflate_level))) { ncError(retval); }
  if((retval = nc_def_var_deflate(ncid, zid, shuffle, deflate, deflate_level))) { ncError(retval); }

The function nc_def_var_deflate() performs this.  It takes the following parameters:

  • int ncid – The NetCDF file ID returned from the nc_create() function
  • int varid – The variable ID associated with the variable you would like to compress.  This is returned from the nc_def_var() function
  • int shuffle – Enables the shuffle filter before compression.  Any non-zero integer enables the filter.  Zero disables the filter.  The shuffle filter rearranges the byte order in the data stream to enable more efficient compression. See this performance evaluation from the HDF group on integrating a shuffle filter into the HDF5 algorithm.
  • int deflate – Enable compression at the compression level indicated in the deflate_level parameter.  Any non-zero integer enables compression.
  • int deflate_level – The level to which the data should be compressed.  Levels are integers in the range [0-9].  Zero results in no compression whereas nine results in maximum compression.

The rest of the code doesn’t change from the previous post.

  // OPTIONAL: Give these variables units
  if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
  if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }
  
  // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }
  
  // Write the data to the file
  if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
  if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }
  
  // Close the netcdf file
  if((retval = nc_close(ncid))) { ncError(retval); }

So the question now is whether or not it’s worth compressing your data.  I performed a simple experiment with the code presented here and the resulting NetCDF files:

  1. Generate the example NetCDF file from the code above using each of the available compression levels.
  2. Time how long the code takes to generate the file.
  3. Note the final file size of the NetCDF.
  4. Time how long it takes to load and extract data from the compressed NetCDF file.

Below is a figure illustrating the results of the experiment (points 1-3).

compress_plot

Before I say anything about these results, note that individual results may vary.  I used a highly stylized data set to produce the NetCDF file which likely benefits greatly from the shuffle filtering and compression.  These results show a compression of 97% – 99% of the original file size.  While the run time did increase, it barely made a difference until hitting the highest compression levels (8,9).  As for point 4, there was only a small difference in load/read times (0.2 seconds) between the uncompressed and any of the compressed files (using ncdump and the ncdf4 package in R).  There’s no noticeable difference among the load/read times for any of the compressed NetCDF files.  Again, this could be a result of the highly stylized data set used as an example in this post.

For something more practical, I can only offer anecdotal evidence about the compression performance.  I recently included compression in my current project due to the large possible number of multiobjective solutions and states-of-the-world (SOW).  The uncompressed file my code produced was on the order of 17.5 GB (for 300 time steps, 1000 SOW, and about 3000 solutions).  I enabled compression of all variables (11 variables – 5 with three dimensions and 6 with two dimensions – compression level 4).  The next run produced just over 7000 solutions, but the compressed file size was 9.3 GB.  The down side is that it took nearly 45 minutes to produce the compressed file, as opposed to 10 minutes with the previous run.  There are many things that can factor into these differences that I did not control for, but the results are promising…if you’ve got the computer time.

I hope you found this post useful in some fashion.  I’ve been told that compression performance can be increased if you also “chunk” your data properly.  I’m not too familiar with chunking data for writing in NetCDF files…perhaps someone more clever than I can write about this?

Acknowledgement:  I would like to acknowledge Jared Oyler for his insight and helpful advice on some of the more intricate aspects of the NetCDF library.

Symbolic Links

A symbolic link (symlink) is a file that’s a pointer to another file (or a directory). One use for symlinks is to have a big file be in two places without using up twice as much space. ln -s /gpfs/home/abc123/scratch/big_file /gpfs/home/abc123/work/big_file

You can see another use in your home directory on the cluster: ls -l scratch gives us this directory listing:

lrwxrwxrwx 1 mjw5407 mjw5407 21 Jun 22 2011 scratch -> /gpfs/scratch/mjw5407.

The l right at the beginning of the line tells us that scratch is a symlink to /gpfs/scratch/mjw5407.

And if for some reason you see that scratch is not a symlink but a regular directory, something has gone wrong. (This happened recently to a cluster user. Check your directory listings!)

What’s taking up space on your cluster account?

A quick tip on finding and deleting big files on your cluster account. Use mmlsquota to inspect your quota usage on GPFS filesystems. This will tell you whether you really need to clean up. Use du -hs * to figure out how big your subdirectories are, and ls -lh to inspect the files in a directory. Use (with great caution) rm -rf <directory> to remove a big directory, and rm <filename> to delete a file (both commands are irreversible).

Some ideas for your Bash submission scripts

I’ve been playing around with some design options for PBS submission scripts that may help people doing cluster work.  Some things to look for in the source code:

  • You can use a list in bash that contains multiple text entries, and then access those text entries to create strings for your submissions.  Note that you can actually display the text first (see the ‘echo ${PBS}’) before you do anything; that way you aren’t requesting thousands of jobs that have a typo in them!
  • Using “read” allows the bash programmer to interact with the user.  Well, in reality you are usually both the programmer and the user.  But lots of times, I want to write a script and try it out first, before I submit hundreds of hours of time on the cluster.  The flags below can help with that process.
  • I added commands to compile the source code before actually submitting the jobs.  Plus, by using flags and pauses intelligently, you can bail out of the script if there’s a problem with compilation.
#!/bin/bash
NODES=32
WALLHOURS=5

PROBLEMS=("ProblemA" "ProblemB")
NSEEDS=10
SEEDS=$(seq 1 ${NSEEDS}) #note there are multiple ways to declare lists and sequences in bash

NFES=1000000
echo "NFEs is ${NFES}" #echo statements can improve usability of the script, especially if you're modifying it a lot for various trials

ASSUMEPERMISSIONFLAG=No #This is for pausing the submission script later

echo "Compile? Y or N."

read COMPILEFLAG
if [ "$COMPILEFLAG" = "Y" ]; then
    echo "Cleaning.."
    make clean -f MakefileParallel
    echo "Compiling.."
    make -f MakefileParallel
else
        echo "Not compiling."
fi

for PROBINDEX in ${!PROBLEMS[*]}
do
    PROBLEM=${PROBLEMS[$PROBINDEX]} #note the syntax to pull a list member out here
    echo "Problem is ${PROBLEM}"

    for SEED in ${SEEDS}
    do
        NAME=${PROBLEM}_${SEED} #Bash is really nice for manipulating strings like this
        echo "Submitting: ${NAME}"

        #Here is the actual PBS command, with bash variables used in place of different experimental parameters.  Note the use of getopt-style command line parsing to pass different arguments into the myProgram executable.  This implementation is also designed for parallel processing, but it can also be used for serial jobs too.

        PBS="#PBS -l nodes=32\n\
        #PBS -N ${NAME}\n\
        #PBS -l walltime=05:00:00\n\
        #PBS -j oe\n\
        #PBS -o ${NAME}.out\n\
        cd \$PBS_O_WORKDIR\n\
        module load openmpi/intel\n\
        mpirun ./myProgram -b ${PROBLEM} -c combined -f ${NFES} -s ${SEED}"

        #The first echo shows the user what is about to be passed to PBS.  The second echo then pipes it to the command qsub, and actually submits the job.

        echo ${PBS}

        if [ "$ASSUMEPERMISSIONFLAG" = "No" ]; then

            echo "Continue submitting? Y or N."

            read SUBMITFLAG

            #Here, the code is designed to just keep going after the user says Y once.  You can redesign this for your own purposes.  Also note that this code is fairly brittle in that the user MUST say Y, not y or yes.  You can build that functionality into the if statements if you'd like it.

            if [ "$SUBMITFLAG" = "Y" ]; then
                 ASSUMEPERMISSIONFLAG=Yes #this way, the user won't be asked again
                 echo -e ${PBS} | qsub
                 sleep 0.5
                 echo "done."
            fi
        else
            echo -e ${PBS} | qsub
            sleep 0.5
            echo "done."
         fi
    done
done