More Terminal Schooling

You are probably asking yourself “and why do I need more terminal schooling?”. The short answer is: to not have to spend as much time as you do on the terminal, most of which spent (1) pushing arrow keys thousands of times per afternoon to move through a command or history of commands, (2) waiting for a command that takes forever to be done running before you can run anything else, (3) clicking all over the place on MobaXTerm and still feeling lost, (4) manually running the same command multiple times with different inputs, (5) typing the two-step verification token every time you want to change a “+” to a “-” on a file on a supercomputer, (6) waiting forever for a time-consuming run done in serial on a single core, and (7, 8, …) other useless and horribly frustrating chores. Below are some tricks to make your Linux work more efficient and reduce the time you spend on the terminal. From now on, I will use a “$” sign to indicate that what follows is a command typed in the terminal.

The tab autocomple is your best friend

When trying to do something with that file whose name is 5480458 characters long, be smart and don’t type the whole thing. Just type the first few letters and hit tab. If it doesn’t complete all the way it’s because there are multiple files whose names begin with the sequence of characters. In this case, hitting tab twice will return the names of all such files. The tab autocomplete works for commands as well.

Ctrl+r for search through previous commands

When on the terminal, hit ctrl+r to switch to reverse search mode. This works like a simple search function o a text document, but instead looking in your bash history file for commands you used over the last weeks or months. For example, if you hit ctrl+r and type sbatch it will fill the line with the last command you ran that contained the word sbatch. If you hit ctrl+r again, it will find the second last used command, and so on.

Vim basics to edit files on a system that requires two-step authentication

Vim is one the most useful things I have came across when it comes to working on supercomputers with two-step identity verification, in which case using MobaXTerm of VS Code requires typing a difference security code all the time. Instead of uploading a new version of a code file every time you want to make a simple change, just edit the file on the computer itself using Vim. To make simple edits on your files, there are very few commands you need to know.

To open a file with Vim from the terminal: $ vim <file name> or $ vim +10 <file name>, if you want to open the file and go straight to line 10.

Vim has two modes of operation: text-edit (for you to type whatever you want in the file) and command (replacement to clicking on file, edit, view, etc. on the top bar of notepad). When you open Vim, it will be in command mode.

To switch to text-edit mode, just hit either “a” or “i” (you should then see “– INSERT –” at the bottom of the screen). To return to command mode, hit escape (Esc). When in text-edit more, the keys “Home,” “End,” “Pg Up,” “Pg Dn,” “Backspace,” and “Delete” work just like on Notepad and MS Word.

When in command mode, save your file by typing :w + Enter, save and quite with :wq, and quit without saving with :q!. Commands for selecting, copying and pasting, finding and replacing, replacing just one character, deleting a line, and other more advanced tasks can be found here. There’s also a great cheatsheet for Vim here. Hint: once you learn some more five to ten commands, making complex edits on your file with Vim becomes blazingly fast.

Perform repetitive tasks on the terminal using one-line Bash for-loops.

Instead of manually typing a command for each operation you want to perform on a subset of files in a directory (“e.g., cp file<i>.csv directory300-400 for i from 300 to 399 , tar -xzvf myfile<i>.tar.gz, etc.), you can use a Bash for-loop if using the is not possible.

Consider a situation in which you have 10,000 files and want to move files number 200 to 299 to a certain directory. Using the wildcard “*” in this case wouldn’t be possible, as result_2<i>.csv would return result_2.csv, result_20.csv to result_29.csv, and result_2000.csv to result_2999.csv as well–sometimes you may be able to use Regex, but that’s another story. To move a subset of result files to a directory using a Bash for-loop, you can use the following syntax:

$ for i in {0..99}; do cp result_2$i results_200s/; done

Keep in mind that you can have multiple commands inside a for-loop by separating them with “;” and also nest for-loops.

Run a time-intensive command on the background with an “&” and keep doing your terminal work

Some commands may take a long time to run and render the terminal unusable until it’s complete. Instead of opening another instance of the terminal and login in again, you can send a command to the background by adding “&” at the end of it. For example, if you want to extract a tar file with dozens of thousands of files in it and keep doing your work as the files are extracted, just run:

$ tar -xzf my_large_file.tar.gz &

If you have a directory with several tar files and want to extract a few of them in parallel while doing your work, you can use the for-loop described above and add “&” to the end of the tar command inside the loop. BE CAREFUL, if your for-loop iterates over dozens or more files, you may end up with your terminal trying to run dozens or more tasks at once. I accidentally crashed the Cube once doing this.

Check what is currently running on the terminal using ps

To make sure you are not overloading the terminal by throwing too many processes at it, you can check what it is currently running by running the command ps. For example, if I run an program with MPI creating two processes and run ps before my program is done, it will return the following:

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
$ mpirun -n 2 ./triangleSimulation -I Tests/test_input_file_borg.wp &
[1] 6129     <-- this is the process ID
bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ps
 PID TTY TIME CMD
 8 tty1 00:00:00 bash
 6129 tty1 00:00:00 mpirun    <-- notice the process ID 6129 again
 6134 tty1 00:00:00 triangleSimulat
 6135 tty1 00:00:00 triangleSimulat
 6136 tty1 00:00:00 ps

Check the output of a command running on the background

If you run a program on the background its output will not be printed on the screen. To know what’s happening with your program, send (to pipe) its output to a text file using the “>” symbol, which will be updated continuously as your program is running, and check it with cat <file name>, less +F<file name>, tail -n<file name>, or something similar. For example, if test_for_background.sh is a script that will print a number on the screen every one second, you could do the following (note the “> pipe.csv” in the first command):

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ ./test_for_background.sh > pipe.csv &
 [1] 6191

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ cat pipe.csv
 1
 2
 3

bernardoct@DESKTOP-J6145HK /mnt/c/Users/Bernardo/CLionProjects/WaterPaths
 $ tail -3 pipe.csv
 8
 9
 10

This is also extremely useful in situations when you want to run a command that takes long to run but whose outputs are normally displayed one time on the screen. For example, if you want to check the contents of a directory with thousands of files to search for a few specific files, you can pipe the output of ls to a file and send it to the background with ls > directory_contents.txt & and search the resulting text file for the file of interest.

System monitor: check core and memory usage with htop, or top if htop is not available

If ps does not provide enough information given your needs, such as if you’re trying to check if your multi-thread application is using the number of cores it should, you can try running htop instead. This will show on your screen something along the lines of the Performance view  of Windows’ Task Manager, but without the time plot. It will also show how much memory is being used, so that you do not accidentally shut down a node on an HPC system. If htop is not available, you can try top.

Running make in parallel with make -j for much shorter compiling time

If a C++ code is properly modularized, make can compile certain source code files in parallel. To do that, run make -j<number of cores> <rule in makefile>. For example, the following command would compile WaterPaths in parallel over four cores:

$ make -j4 gcc

For WaterPaths, make gcc takes 54s on the Cube, make -j4 gcc takes 15s, make -j8 gcc takes 9s, so the time and patience savings are real if you have to compile the code various times per day. To make your life simpler, you can add an alias to bash_aliases such as alias make='make -j4' (see below in section about .bash_aliases file). DO NOT USE MAKE -J ON NSF HPC SYSTEMS: it is against the rules. On the cube keep it to four cores or less not to disturb other users, but use all cores available if on the cloud or iterative section.

Check the size of files and directories using du -hs

The title above is quite self-explanatory. Running du -hs <file name> will tell you its size.

Check the data and time a file was created or last modified using the stat command

Also rather self-explanatory. Running stat <file name> is really useful if you cannot remember on which file you saved the output last time you ran your program.

Split large files into smaller chunks with the split command and put them back together with cat

This works for splitting a large text file into files with fewer lines, as well as for splitting large binary files (such as large tar files) so that you can, for example, upload them to GitHub or e-mail them to someone. To split a text file with 10,000 into ten files with 1,000 lines each, use:

 $ split -l 1000 myfile myfile_part

This will result in ten files called myfile_part00, myfile_part01, and so on with 1,000 lines each. Similarly, the command below would break a binary file into parts with 50 MB each:

 $ split -b 50m myfile myfile_part

To put all files back together in either case, run:

$ cat myfile_part* myfile

More information about the split command can be found in Joe’s post about it.

Checking your HPC submission history with `sacct`

Another quite sulf-explanatory tile. If you want to remember when you submitted something, such as to check if an output file resulted from this or that submission (see stat command), just run the command below in one line:

$ sacct -S 2019-09-18 -u bct52 --format=User,JobID,Jobname,start,end,elapsed,nnodes,nodelist,state

This will result in an output similar to the one below:

bct52 979 my_job 2019-09-10T21:48:30 2019-09-10T21:55:08 00:06:38 1 c0001 COMPLETED
bct52 980 skx_test_1 2019-09-11T01:44:08 2019-09-11T01:44:09 00:00:01 1 c0001 FAILED
bct52 981 skx_test_1 2019-09-11T01:44:33 2019-09-11T01:56:45 00:12:12 1 c0001 CANCELLED
bct52 1080 skx_test_4 2019-09-11T22:07:03 2019-09-11T22:08:39 00:01:36 4 c[0001-0004] COMPLETED
1080.0 orted 2019-09-11T22:08:38 2019-09-11T22:08:38 00:00:00 3 c[0002-0004] COMPLETED

Compare files with meld, fldiff, or diff

There are several programs to show the differences between text files. This is particularly useful when you want to see what the changes between different versions of the same file, normally a source code file. If you are on a computer running a Linux OS or have an X server like Xming installed, you can use meld and kdiff3 for pretty outputs on a nice GUI or fldiff to quickly handle a files with huge number of difference. Otherwise, diff will show you the differences in a cruder pure-terminal but still very much functional manner. The syntax for all of them is:

$ <command> <file1> <file2>

Except for diff, for which it is worth calling with the --color option:

$ diff --color <file1> <file2>

If cannot run a graphical user interface but is feeling fancy today, you can install the ydiff Python extension with (done just once):

$ python3 -m pip install --user ydiff 

and pipe diff’s output to it with the following:

$diff -u <file1> <file2> | python3 -m ydiff -s

This will show you the differences between two versions of a code file in a crystal clear, side by side, and colorized way.

Creating a .bashrc file for a terminal that’s easy to work with and good (or better) to look at

When we first login to several Linux systems the terminal is all black with white characters, in which it’s difficult find the commands you typed amidst all the output printed on the screen, and with limited autocomplete and history search. In short, it’s a real pain and you makes you long for Windows as much as for you long for your mother’s weekend dinner. There is, however, a way of making the terminal less of a pain to work with, which is by creating a file called .bashrc with the right contents in your home directory. Below is an example of a .bashrc file with the following features for you to just copy and paste in your home directory (e.g., /home/username/, or ~/ for short):

  • Colorize your username and show the directory you’re currently in, so that it’s easy to see when the output of a command ends and the next one begins–as in section “Checking the output of a command running on the background.”
  • Allow for a search function with the up and down arrow keys. This way, if you’re looking for all the times you typed a command starting with sbatch, you can just type “sba” and hit up arrow until you find the call you’re looking for.
  • A function that allows you to call extract and the compressed file will be extracted. No more need to tar with a bunch of options, unzip, unrar, etc. so long as you have all of them installed.
  • Colored man pages. This means that when you look for the documentation of a program using man, such as man cat to see all available options for the cat command, the output will be colorized.
  • A function called pretty_csv to let you see csv files in a convenient, organized and clean way from the terminal, without having to download it to your computer.
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# Load aliases
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi

# Automatically added by module
shopt -s expand_aliases

if [ ! -z "$PS1" ]; then
PS1='\[\033[G\]\[\e]0;\w\a\]\n\[\e[1;32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '
bind '"\e[A":history-search-backward'
bind '"\e[B":history-search-forward'
fi

set show-all-if-ambiguous on
set completion-ignore-case on
export PATH=/usr/local/gcc-7.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/gcc-7.1/lib64:$LD_LIBRARY_PATH

history -a
export DISPLAY=localhost:0.0

sshd_status=$(service ssh status)
if [[ $sshd_status = *"is not running"* ]]; then
sudo service ssh --full-restart
fi

HISTSIZE=-1
HISTFILESIZE=-1

extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2)   tar xvjf $1    ;;
*.tar.gz)    tar xvzf $1    ;;
*.bz2)       bunzip2 $1     ;;
*.rar)       unrar x $1       ;;
*.gz)        gunzip $1      ;;
*.tar)       tar xvf $1     ;;
*.tbz2)      tar xvjf $1    ;;
*.tgz)       tar xvzf $1    ;;
*.zip)       unzip $1       ;;
*.Z)         uncompress $1  ;;
*.7z)        7z x $1        ;;
*)           echo "don't know how to extract '$1'..." ;;
esac
else
echo "'$1' is not a valid file!"
fi
}

# Colored man pages
export LESS_TERMCAP_mb=$'\E[01;31m'
export LESS_TERMCAP_md=$'\E[01;31m'
export LESS_TERMCAP_me=$'\E[0m'
export LESS_TERMCAP_se=$'\E[0m'
export LESS_TERMCAP_so=$'\E[01;44;33m'
export LESS_TERMCAP_ue=$'\E[0m'
export LESS_TERMCAP_us=$'\E[01;32m'

# Combine multiline commands into one in history
shopt -s cmdhist

# Ignore duplicates, ls without options and builtin commands
HISTCONTROL=ignoredups
export HISTIGNORE="&:ls:[bf]g:exit"

pretty_csv () {
cat "$1" | column -t -s, | less -S
}

There are several .bashrc example files online with all sorts of functionalities. Believe me, a nice .bashrc will make your life A LOT BETTER. Just copy and paste the above into a text file called .bashrc and sent it to your home directory in your local or HPC system terminal.

Make the terminal far less user-friendly and less archane by setting up a .bash_aliases file

You should also have a .bash_aliases file to significantly reduce typing and colorizing the output of commands you often use for ease of navigation. Just copy all the below into a file called .bash_aliases and copy into your home directory (e.g., /home/username/, or ~/ for short). This way, every time you run the command between the word “alias” and the “=” sign, the command after the “=”sign will be run.

alias ls='ls --color=tty'
alias ll='ls -l --color=auto'
alias lh='ls -al --color=auto'
alias lt='ls -alt --color=auto'
alias uu='sudo apt-get update && sudo apt-get upgrade -y'
alias q='squeue -u '
alias qkill='scancel $(qselect -u bct52)'
alias csvd="awk -F, 'END {printf \"Number of Rows: %s\\nNumber of Columns: %s\\n\", NR, NF}'"
alias grep='grep --color=auto'                          #colorize grep output
alias gcc='gcc -fdiagnostics-color=always'                           #colorize gcc output
alias g++='g++ -fdiagnostics-color=always'                          #colorize g++ output
alias paper='cd /my/directory/with/my/beloved/paper/'
alias res='cd /my/directory/with/my/ok/results/'
alias diss='cd /my/directory/of/my/@#$%&/dissertation/'
alias aspell='aspell --lang=en --mode=tex check'
alias aspellall='find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;'
alias make='make -j4'

Check for spelling mistakes in your Latex files using aspell

Command-line spell checker, you know what this is.

aspell --lang=en --mode=tex check'

To run aspell check on all the Latexfiles in a directory and its subdirectories, run:

find . -name "*.tex" -exec aspell --lang=en --mode=tex check "{}" \;

Easily share a directory on certain HPC systems with others working on the same project [Hint from Stampede 2]

Here’s a great way to set permissions recursively to share a directory named projdir with your research group:

$ lfs find projdir | xargs chmod g+rX

Using lfs is faster and less stressful on Lustre than a recursive chmod. The capital “X” assigns group execute permissions only to files and directories for which the owner has execute permissions.

Run find and replace in all files in a directory [Hint from Stampede 2]

Suppose you wish to remove all trailing blanks in your *.c and *.h files. You can use the find command with the sed command with in place editing and regular expressions to this. Starting in the current directory you can do:

$ find . -name *.[ch] -exec sed -i -e ‘s/ +$//’ {} \;

The find command locates all the *.c and *.h files in the current directory and below. The -exec option run the sed command replacing {} with the name of each file. The -i option tells sed to make the changes in place. The s/ +$// tells sed to replace one or blanks at the end of the line with nothing. The \; is required to let find know where the end of the text for the -exec option. Being an effective user of sed and find can make a great different in your productivity, so be sure to check Tina’s post about them.

Other post in this blog

Be sure to look at other posts in this blog, such as Jon Herman’s post about ssh, Bernardo’s post about other useful Linux commands organized by task to be performed, and Joe’s posts about grep (search inside multiple files) and cut.

Parallelization of C/C++ and Python on Clusters

There are multiple ways of parallelizing C/C++ and Python codes to be used on clusters. The quick and dirty way, which often gets the job done, is to parallelize the code with MPI and launch one process per core across multiple nodes. Though easy to implement, this strategy may result in a fair waste of results because:

  • With too many processes on one node, inefficient use of memory may make the code memory bottle-necked, meaning that a lot of time will be spent accessing memory and the cores will be sitting twiddling their thumbs.
  • If you have a 48-core node with 192 GB of memory (as the Skylake nodes on Stampede 2) and your code uses 10 GB of RAM per process, you will not be able to have more than 19 out of 48 cores working at the same time because more than that would exceed the node’s memory, which would crash the node, which would get you kicked out of the cluster for a while without a warning — if you do not know how much RAM your code requires, you should look into the Remora tool.

If you are working on big runs on systems that charge per usage time or if you are creating a code you will have to run multiple times, it may be worth looking into more clever parallelization strategies. Some of the techniques described in the next subsections will also make your code two to four times faster on your desktop/laptop and are rather straight-forward to implement. Most of the code in this blog post can be freely downloaded from this repository. Lastly, if the #include statements in the examples below are not displayed as followed by a library between “<>”, please refer to the mentioned online repository.

Hybrid Parallelization in C/C++

The word hybrid refers in this context to a mixed use of shared and distributed memory parallelization techniques. In distributed parallelization techniques, such as the Message Passing Interface (MPI), each process has its own dedicated chunk of memory within a RAM chip. This means that each process will have its own copy of the input data (repeated across processes), variables within the program, etc. Conversely, shared memory parallelization techniques such as OpenMP allow all threads (parallel unit of computation) to access the same block of memory, which can make a mess in your code with variables being changed seemingly out of nowhere by various parallel runs, but which lessens the amount of RAM needed and in some cases allow for more efficient memory utilization.

If running the Borg Master-Slave multiobjective evolutionary optimization algorithm (Borg MS) with hybrid parallelization, Borg MS will have each function evaluation performed by one MPI process, which could run in parallel with OpenMP (if your model is setup for that) across the cores designated for that process. As soon as a function evaluation is complete, the slave MPI process in charge of it will submit the results back to Borg MS’s master process and be given another set of decision variables to work on. The figure below is a schematic drawing of such hybrid parallelization scheme:

BorgHybridParallelization.png

OpenMP

Parallelizing a for-loop in C or C++ in which each iteration is independent — such as in Monte Carlo simulations — is straight-forward:

#include
#include "path/to/my_model.h"

int main(int argc, char *argv[])
{
    int n_monte_carlo_simulations = omp_get_num_threads(); // One model run per core.
    vector system_inputs = ;
    vector results(n_monte_carlo_simulations, 0);
#pragma omp parallel for
    for (int i = 0; i < n_monte_carlo_simulations; ++i)
    {
        results[i] = RunMonteCarloSimulation(system_inputs[i]);
    }
}

The only caveat is that you have to be sure that the independent simulations are not writing over each other. For example, if by mistake you had written results[0] instead of results[i], the final value on results[0] would be that of whichever thread (parallel simulation) finished last, while the rest of the vector would be of zeros. It is important to notice here that the vector system_inputs is read only once by the parallel code and all n_monte_carlo_simulations threads would have access to it in memory.

Though this example is not rocket science, there is a lot more to OpenMP that can be found here and here.

MPI

MPI stands for Message Passing Interface. It creates processes which pass messages (information packages) to each other. The most basic MPI code is one that creates independent processes — each with its own allocated chunk of memory that cannot be accessed by any other process — and just has them running independently in parallel without communicating with each other. This is great to distribute Monte Carlo runs across various nodes, although it may lead to the limitations mentioned in the beginning of this post. Below is an example of a very basic MPI code to run Monte Carlo simulations in parallel:

#include
#include "path/to/my_model.h"

int main(int argc, char *argv[])
{
    MPI_Init(NULL, NULL);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    vector system_inputs = ;
    RunMonteCarloSimulation(system_inputs[world_rank]);

    MPI_Finalize();
    return 0;
}

I will not cover in this post how to pass the output of the RunMonteCarloSimulation function back to a results vector, as in the previous example, because details of MPI is outside the scope of this post — a good MPI tutorial can be found here and great examples, including ones for Monte Carlo runs, can be found here. To run this code, you would have to compile it with mpicc mpi_demo.c -o MPIDemo and call it with the mpirun or mpiexec commands, which should be followed by the -n flag to specify the number of processes to be created, or mpirun -n 4 ./MPIDemo. The code above would run as many MonteCarlo simulations as there are processes (four processes if -n 4 was used), each with a rank (here a type of ID), and can be spread across as many nodes as you can get on a cluster. Each process of the code above would run on a core, which may result in the memory and/or processing limitations listed in the beginning of this post.

Mixing OpenMP and MPI

The ideal situation for most cases is when you use MPI to distribute processes across nodes and OpenMP to make sure that as many cores within a node as it makes sense to use are being used. To do this, you can use the structure of the code below.

#include
#include
#include
#include "path/to/my_model.cpp"

int main(int argc, char *argv[])
{
    MPI_Init(NULL, NULL);

    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    int n_cores = omp_get_num_threads(); // get number of cores available for MPI process.
    vector system_inputs = ;
    vector results(n_cores , 0);
#pragma omp parallel for
    for (int i = 0; i < n_cores ; ++i)
    {
        // printf("C hybrid. Process-thread: %s-%d\n", world_rank, omp_get_thread_num());
        results[i] = RunMonteCarloSimulation(system_inputs[i]);
    }

    MPI_Finalize();
    return 0;

}

If you comment lines 4 and 20 and uncomment line 19, compile the code with mpicc hybrid.c -o Hybrid -fopenmp, set the number of threads in the terminal to eight by running set OMP_NUM_THREADS=8, and call the resulting executable with the command mpirun -n 4 ./Hybrid, the executable will spawn four processes, each of which spawning eight threads and printing the following output:

C hybrid. Process-thread: 0-0
C hybrid. Process-thread: 0-1
C hybrid. Process-thread: 0-4
C hybrid. Process-thread: 0-5
C hybrid. Process-thread: 0-7
C hybrid. Process-thread: 0-6
C hybrid. Process-thread: 0-2
C hybrid. Process-thread: 0-3
C hybrid. Process-thread: 1-6
C hybrid. Process-thread: 1-0
C hybrid. Process-thread: 1-1
C hybrid. Process-thread: 1-3
C hybrid. Process-thread: 1-5
C hybrid. Process-thread: 1-4
C hybrid. Process-thread: 1-7
C hybrid. Process-thread: 1-2
C hybrid. Process-thread: 2-6
C hybrid. Process-thread: 2-0
C hybrid. Process-thread: 2-3
C hybrid. Process-thread: 2-7
C hybrid. Process-thread: 2-2
C hybrid. Process-thread: 2-4
C hybrid. Process-thread: 2-5
C hybrid. Process-thread: 2-1
C hybrid. Process-thread: 3-0
C hybrid. Process-thread: 3-1
C hybrid. Process-thread: 3-3
C hybrid. Process-thread: 3-4
C hybrid. Process-thread: 3-6
C hybrid. Process-thread: 3-7
C hybrid. Process-thread: 3-2
C hybrid. Process-thread: 3-5

The code above can be used to have one process per node (and therefore one copy of system_inputs per node) using as many threads as there are cores in that node — depending on the number of cores, it is sometimes more efficient to have two or three MPI processes per node, each using half or a third of the cores in that node. Below is a PBS job submission script that can be used to generate the output above — job submission scripts for Slurm to be used on Stampede 2 can be found here.

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=:00:30
#PBS -o ./output/hybrid_c.out
#PBS -e ./error/hybrid_c.err
module load openmpi-1.10.7-gnu-x86_64
export OMP_NUM_THREADS=8

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
export TASKS_PER_NODE=2
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS ./Hybrid

Parallelizing Python on a Cluster

Python was not designed with shared-memory parallelization in mind, so its native parallelization library, the Multiprocessing library, does distributed-memory parallelization instead of shared. The issue with the Multiprocessing library is that it does not work across nodes, so you still need something like MPI. Here, we will use MPI4Py.

The code below parallelizes the function call across cores within the same node — please refer to this blog post for details about and other ways to use the Multiprocessing library.

from multiprocessing import Process, cpu_count

def do_something_useful(rank, shared_process_number):
    # Do something useful here.
    print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)

# Create node-local processes
shared_processes = []
#for i in range(cpu_count()):
for i in range(8):
    p = Process(target=do_something_useful, args=(i,))
    shared_processes.append(p)

# Start processes
for sp in shared_processes:
    sp.start()

# Wait for all processes to finish
for sp in shared_processes:
    sp.join()

By adding MPI4Py to the code above, following the same logic of the C/C++ example, we get the following result:

from mpi4py import MPI
from multiprocessing import Process, cpu_count

def do_something_useful(rank, shared_process_number):
    # Do something useful here.
    print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print 'Calling Python multiprocessing process from Python MPI rank {}'.format(rank)

# Create shared-ish processes
shared_processes = []
#for i in range(cpu_count()):
for i in range(8):
    p = Process(target=do_something_useful, args=(rank, i))
    shared_processes.append(p)

# Start processes
for sp in shared_processes:
    sp.start()

# Wait for all processes to finish
for sp in shared_processes:
    sp.join()

comm.Barrier()

Calling the code above with mpirun -n 4 python hybrid_pure_python.py will first result in four MPI processes being spawned with ranks 1 through 4, each spawning eight local processes. The output should look similar to the one below:

Calling Python multiprocessing process from Python MPI rank 0
Calling Python multiprocessing process from Python MPI rank 1
Python hybrid, MPI_Process-local_process (not quite a thread): 1-0
Python hybrid, MPI_Process-local_process (not quite a thread): 0-0
Python hybrid, MPI_Process-local_process (not quite a thread): 1-1
Python hybrid, MPI_Process-local_process (not quite a thread): 0-1
Python hybrid, MPI_Process-local_process (not quite a thread): 1-2
Python hybrid, MPI_Process-local_process (not quite a thread): 0-2
Python hybrid, MPI_Process-local_process (not quite a thread): 0-3
Python hybrid, MPI_Process-local_process (not quite a thread): 1-3
Calling Python multiprocessing process from Python MPI rank 2
Calling Python multiprocessing process from Python MPI rank 3
Python hybrid, MPI_Process-local_process (not quite a thread): 1-4
Python hybrid, MPI_Process-local_process (not quite a thread): 0-4
Python hybrid, MPI_Process-local_process (not quite a thread): 0-5
Python hybrid, MPI_Process-local_process (not quite a thread): 1-5
Python hybrid, MPI_Process-local_process (not quite a thread): 1-6
Python hybrid, MPI_Process-local_process (not quite a thread): 0-6
Python hybrid, MPI_Process-local_process (not quite a thread): 1-7
Python hybrid, MPI_Process-local_process (not quite a thread): 0-7
Python hybrid, MPI_Process-local_process (not quite a thread): 2-0
Python hybrid, MPI_Process-local_process (not quite a thread): 3-0
Python hybrid, MPI_Process-local_process (not quite a thread): 3-1
Python hybrid, MPI_Process-local_process (not quite a thread): 2-1
Python hybrid, MPI_Process-local_process (not quite a thread): 3-2
Python hybrid, MPI_Process-local_process (not quite a thread): 2-2
Python hybrid, MPI_Process-local_process (not quite a thread): 3-3
Python hybrid, MPI_Process-local_process (not quite a thread): 2-3
Python hybrid, MPI_Process-local_process (not quite a thread): 3-4
Python hybrid, MPI_Process-local_process (not quite a thread): 2-4
Python hybrid, MPI_Process-local_process (not quite a thread): 2-5
Python hybrid, MPI_Process-local_process (not quite a thread): 3-5
Python hybrid, MPI_Process-local_process (not quite a thread): 3-6
Python hybrid, MPI_Process-local_process (not quite a thread): 3-7
Python hybrid, MPI_Process-local_process (not quite a thread): 2-6
Python hybrid, MPI_Process-local_process (not quite a thread): 2-7

Below is a PBS submission script that can be used with the code above:

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=00:01:00
#PBS -o ./output/hybrid_python.out
#PBS -e ./error/hybrid_python.err
module load python-2.7.5
export OMP_NUM_THREADS=8

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
export TASKS_PER_NODE=2
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_pure_python.py

Parallelizing Monte Carlo Runs of C/C++ (or anything else) with Python

Very often we have to run the same model dozens to thousands of times for Monte-Carlo-style analyses. One option is to go on a cluster and submit dozens to thousand of jobs, but this becomes a mess for the scheduler to manage and cluster managers tend to not be very fond of such approach. A cleaner way of performing Monte Carlo runs in parallel is to write a code in Python or other language that spawns MPI processes which then call the model to be run. This can be done in Python with the subprocess library, as in the example below:

from mpi4py import MPI
import subprocess

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print 'Calling OpenmpTest from Python MPI rank {}'.format(rank)
function_call = ['./OpenmpTest', '{}'.format(rank)]
subprocess.call(function_call)

comm.Barrier()

Here, OpenmpTest is an executable compiled from the C-OpenMP code below:

#include
#include 

int main(int argc, char *argv[])
{
#pragma omp parallel for
    for (int i = 0; i < omp_get_num_threads(); ++i)
    {
        printf("C OpenMP. Process-thread: %s-%d\n", argv[1], i);
    }
    return 0;
}

Submitting the code above with following PBS job submission script should result in the output below.

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=00:01:00
#PBS -o ./output/hybrid_python_c.out
#PBS -e ./error/hybrid_python_c.err
module load python-2.7.5
export OMP_NUM_THREADS=8

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
export TASKS_PER_NODE=2
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec –hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_calls_c.py

Output:

Calling OpenmpTest from Python MPI rank 0
Calling OpenmpTest from Python MPI rank 1
Calling OpenmpTest from Python MPI rank 2
Calling OpenmpTest from Python MPI rank 3
C OpenMP. Process-thread: 0-7
C OpenMP. Process-thread: 0-5
C OpenMP. Process-thread: 0-0
C OpenMP. Process-thread: 0-2
C OpenMP. Process-thread: 0-4
C OpenMP. Process-thread: 0-1
C OpenMP. Process-thread: 0-3
C OpenMP. Process-thread: 0-6
C OpenMP. Process-thread: 1-1
C OpenMP. Process-thread: 1-2
C OpenMP. Process-thread: 1-6
C OpenMP. Process-thread: 1-4
C OpenMP. Process-thread: 1-5
C OpenMP. Process-thread: 1-0
C OpenMP. Process-thread: 1-3
C OpenMP. Process-thread: 1-7
C OpenMP. Process-thread: 2-3
C OpenMP. Process-thread: 2-7
C OpenMP. Process-thread: 2-0
C OpenMP. Process-thread: 2-4
C OpenMP. Process-thread: 2-6
C OpenMP. Process-thread: 2-5
C OpenMP. Process-thread: 2-1
C OpenMP. Process-thread: 2-2
C OpenMP. Process-thread: 3-5
C OpenMP. Process-thread: 3-1
C OpenMP. Process-thread: 3-4
C OpenMP. Process-thread: 3-2
C OpenMP. Process-thread: 3-7
C OpenMP. Process-thread: 3-3
C OpenMP. Process-thread: 3-0
C OpenMP. Process-thread: 3-6

End of the Story

As highlighted in the examples above (all of which can be download from here), creating parallel code is not rocket science and requires very few lines of code. The OpenMP and Multiprocessing parallelization tricks can save hours if you have to run a model several times on your personal computer, and paired with MPI these tricks can save you from hours to days in time and thousands of node-hours on clusters. I hope you make good use of it.

Intro to Boosting

Introduction

Once upon a time, in a machine learning class project, student Michael Kearns asked if it is possible to convert a weak classifier (high bias classifier whose outputs are correct only slightly over 50% of the time) into a classifier of arbitrary accuracy by using an ensemble of such classifiers. This question was posed in 1988 and, two years later, in 1990, Robert Schapire answered that it is possible (Schapire, 1990). And so boosting was born.

The idea of boosting is to train an ensemble of weak classifiers, each of which an expert in a specific region of the space. The ensemble of classifiers has the form below:
H(\vec{x}) = \sum_{t=1}^T \alpha_th_t(\vec{x})
where H is the ensemble of classifiers, \alpha_t is the weight assigned to weak classifier h_t around samples \vec{x} at iteration t, and T is the number of classifiers in the ensemble.

Boosting creates such an ensemble in a similar fashion to gradient descent. However, instead of:
\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \alpha \nabla \ell(\vec{x}_t)
as in gradient descent in real space, where \ell is a loss function, Boosting is trained via gradient descent in functional space, so that:
H_{t+1}(\boldsymbol{X}) = H_t(\boldsymbol{X}) + \alpha_t \nabla \ell(h_t(\boldsymbol{X}))

The question then becomes how to find the \alpha_t‘s and the classifiers h_t. Before answering these questions, we should get a geometric intuition for Boosting first. The presentation in the following sections were based on the presentation on these course notes.

Geometric Intuition

In the example below we have a set of blue crosses and red circles we would like our ensemble or weak classifiers to correctly classify (panel “a”). Our weak classifier for this example will be a line, orthogonal to either axes, dividing blue crosses from red circles regions. Such classifier can also be called a CART tree (Breiman et al., 1984) with depth of 1 — hereafter called a tree stump. For now, let’s assume all tree stumps will have the same weight \alpha_t in the final ensemble.

Boosting.png

The first tree stump, a horizontal divide in panel “b,” classified ten out of thirteen points correctly but failed to classify the remaining three. Since it incorrectly classified a few points in the last attempt, we would like the next classifier correctly classify these points. To make sure that will be the case, boosting will increase the weight of the points that were misclassified earlier before training the new tree stump. The second tree stump, a vertical line on panel “c,” correctly classifies the two blue crosses that were originally incorrectly classified, although it incorrectly three crosses that were originally correctly classified. For the third classifier, Boosting will now increase the weight of the three bottom misclassified crosses as well of the other misclassified two crosses and circle because they are still not correctly classified — technically speaking they are tied, each of the two classifiers classifies them in a different way, but we are here considering this a wrong classification. The third iteration will then prioritize correcting the high-weight points again, and will end up as the vertical line on the right of panel “d.” Now, all points are correctly classified.

There are different ways to mathematically approach Boosting. But before getting to Boosting, it is a good idea to go over gradient descent, which is a basis of Boosting. Following that, this post will cover, as an example, the AdaBoost algorithm, which assumes an exponential loss function.

Minimizing a Loss Function

Standard gradient descent

Before getting into Boosting per se, it is worth going over the standard gradient descent algorithm. Gradient descent is a minimization algorithm (hence, descent). The goal is to move from an initial x_0 to the value of x with the minimum value of f(x), which in machine learning is a loss function, henceforth called \ell(x). Gradient descent does that by moving one step of length s at a time starting from x^0 in the direction of the steeped downhill slope at location x_t, the value of x at the t^{th} iteration. This idea is formalized by the Taylor series expansion below:
\ell(x_{t + 1}) = \ell(x_t + s) \approx \ell(x_t) - sg(x_t)
where g(x_t)=\nabla\ell(x_t). Furthermore,
s=\alpha g(x_t)
which leads to:
\ell\left[x_{t+1} -\alpha g(x_t)\right] \approx \ell(x^{t})-\alpha g(x_t)^Tg(x_t)
where \alpha, called the learning rate, must be positive and can be set as a fixed parameter. The dot product on the last term g(x_t)^Tg(x_t) will also always be positive, which means that the loss should always decrease — the reason for the italics is that too high values for \alpha may make the algorithm diverge, so small values around 0.1 are recommended.

Gradient Descent in Functional Space

What if x is actually a function instead of a real number? This would mean that the loss function \ell(\cdot) would be a function of a function, say \ell(H(\boldsymbol{x})) instead of a real number x. As mentioned, Gradient Descent in real space works by adding small quantities to x_0 to find the final x_{min}, which is an ensemble of small \Delta x‘s added together. By analogy, gradient descent in functional space works by adding functions to an ever growing ensemble of functions. Using the definition of a functional gradient, which is beyond the scope of the this post, this leads us to:
\ell(H+\alpha h) \approx \ell(H) + \alpha \textless\nabla \ell(H),h\textgreater
where H is an ensemble of functions, h is a single function, and the \textless f, g\textgreater notation denotes a dot product between f and g. Gradient descent in function space is an important component of Boosting. Based on that, the next section will talk about AdaBoost, a specific Boosting algorithm.

AdaBoost

Basic Definitions

The goal of AdaBoost is to find an ensemble function H of functions h, a weak classifier, that minimize an exponential loss function below for a binary classification problem:
\ell(H)=\sum_{i=1}^ne^{-y(x_i)H(x_i)}
where x_i, y(x_i) is the i^{th} data point in the training set. The step size \alpha can be interpreted as the weight of each classifier in the ensemble, which optimized for each function h added to the ensemble. AdaBoost is an algorithm for binary classification, meaning the independent variables \boldsymbol{x} have corresponding vector of dependent variables \boldsymbol{y}(x_i), in which each y(x_i) \in \{-1, 1\} is a vector with the classification of each point, with -1 and 1 representing the two classes to which a point may belong (say, -1 for red circles and 1 for blue crosses). The weak classifiers h in AdaBoost also return h(x) \in \{-1, 1\}.

Setting the Weights of Each Classifier

The weight \alpha_t of each weak classifier h can be found by performing the following minimization:
\alpha=argmin_{\alpha}\ell(H+\alpha h)
Since the loss function is defined as the summation of the exponential loss of each point in the training set, the minimization problem above can be expanded into:
\alpha=argmin_{\alpha}\sum_{i=1}^ne^{y(\boldsymbol{x}_i)\left[H(\boldsymbol{x}_i)+\alpha h(\boldsymbol{x}_i)\right]}
Differentiating the error w.r.t. \alpha, equating it with zero and performing a few steps of algebra leads to:
\alpha = \frac{1}{2}ln\frac{1-\epsilon}{\epsilon}
where \epsilon is the classification error of weak classifier h_t. The error \epsilon can be calculated for AdaBoost as shown next.

The Classification Error of one Weak Classifier

Following from gradient descent in functional space, the next best classifier h_{t+1} will be the one that minimizes the term \textless\nabla \ell(H),h\textgreater, which when zero would mean that the zero-slope ensemble has been reached, denoting the minimum value of the loss function has been reached for a convex problem such as this. Replacing the dot product by a summation, this minimization problem can be written as:
h(\boldsymbol{x}_i)=argmin_h\epsilon=argmin_h\textless\nabla \ell(H),h\textgreater
h(\boldsymbol{x}_i)=argmin_h\sum_{i=1}^n\frac{\partial e^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}}{\partial H(\boldsymbol{x}_i)}h(\boldsymbol{x}_i)
which after some algebra becomes:
h(\boldsymbol{x}_i)=argmin_h\sum_{i:h(\boldsymbol{x}_i)\neq y(\boldsymbol{x}_i)}w_i
(the summation of the weights of misclassified points)

Comparing the last with the first expression, we have that the error \epsilon for iteration t is simply the summation of the weights of the points misclassified by h(\boldsymbol{x}_i)_t — e.g., in panel “b” the error would be summation the of the weights of the two crosses on the upper left corner and of the circle at the bottom right corner. Now let’s get to these weights.

The Weights of the Points

There are multiple ways we can think of for setting the weights of each data point at each iteration. Schapire (1990) found a great way of doing so:
\boldsymbol{w}_{t+1}=\frac{\boldsymbol{w}^{t}}{Z}e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}
where \boldsymbol{w} is a vector containing the weights of each point, Z is a normalization factor to ensure the weights will sum up to 1. Be sure not to confuse \alpha_t, the weight of classifier t, with the weight of the points at iteration t, represented by \boldsymbol{w}_t. For the weights to sum up to 1, Z needs to be the sum of their pre-normalization values, which is actually identical to the loss function, so
Z=\sum_{i=1}^ne^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}
Using the definition of the error \epsilon, the update for Z can be shown to be:
Z_{t+1}=Z_t\cdot2\sqrt{\epsilon(1-\epsilon)}
so that the complete update is:
w_{t+1}=w_t \frac{e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}}{2\sqrt{\epsilon(1-\epsilon)}}
Now AdaBoost is ready for implementation following the pseudo-code below.

AdaBoost Pseudo-code

Below is a pseudo-code of AdaBoost. Note that it can be used with any weak learner (high bias) classifier. Again, shallow decision trees are a common choice for their simplicity and good performance.Boosting_algorithm.png

Boosting Will Converge, and Fast

One fascinating fact about boosting is that any boosted algorithms is guaranteed to converge independently of which weak classified is chosen. Recall that Z, the normalization factor for the point weights update, equals the loss function. That being the case, we get the following relation:
\ell(H)=Z=n\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_ t}
where n is the normalizing factor for all weights at step 0 (all of the weights are initially set to 1/n). To derive an expression for the upper bound of the error, let’s assume that the errors at all steps t equal their highest value, \epsilon_{max}. We have that:
\ell(H)\leq n\left[2\sqrt{\epsilon_{max}(1-\epsilon_{max})}\right]^T
Given that necessarily \epsilon_{max} \leq \frac{1}{2}, we have that
\epsilon_{max}(1-\epsilon_{max})<\frac{1}{4}
or
\epsilon_{max}(1-\epsilon_{max})=\frac{1}{4}-\gamma^2
for any \gamma in the interval \left[-\frac{1}{2},\frac{1}{2}\right]. Therefore, replacing the equation above in the first loss inequality written as a function of \epsilon_{max}, we have that:
\ell(H)\leq n(1-4\gamma^2)^{T/2}
which means that the training error is bound by an exponential decay as you add classifiers to the ensemble. This is a fantastic result and applies to any boosted algorithm!

(Next to) No Overfitting

Lastly, Boosted algorithms are remarkably resistant to overfitting. According to Murphy (2012), a possible reason is that Boosting can be seen as a form of \ell_1 regularization, which is prone to eliminate irrelevant features and thus reduce overfitting. Another explanation is related to the concept of margins, so that at least certain boosting algorithms force a classification on a point only if possible by a certain margin, thus also preventing overfitting.

Final Remarks

In this post, I presented the general idea of boosting a weak classifier, emphasizing its use with shallow CART trees, and used the AdaBoost algorithm as an example. However, other loss functions can be used and boosting can also be used for non-binary classifiers and for regression. The Python package scikit-learn in fact allows the user to use boosting with different loss functions and with different weak classifiers.

Despite the theoretical proof that Boosting does not overfit, researchers running it for extremely long times on rather big supercomputers found at at some point it starts to overfit, although still very slowly. Still, that is not likely to happen in your application. Lastly, Boosting with shallow decision trees is also a great way to have a fine control over how much bias there will be on your model, as all you need to do for that is to choose the number of T iterations.

Bibliography

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression T rees (Monterey, California: Wadsworth).
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.

Introduction to Unit Testing

Motivation

Here is something that has happened to most or all of us engineers who code. We write a function or a class and make sure it works by running it and printing some internal state output on the console. The following week, after adding or making improvements on seemingly unrelated parts of the code, the results start to look weird. After a few hours of investigation (a.k.a. wasted time) we find out the reason for the weird results is that that first function or class is either not giving us the right numbers anymore or not behaving as expected with different inputs. I wonder how many papers would have been written, how many deadlines would have been met, and how many “this $@#%& does not @#$% work!!!” thoughts and comments to friends and on our codes themselves would have been avoided if we managed to get rid of such frustrating and lengthy situations.

In this post I will introduce a solution to this rather pervasive issue: unit testing, which amounts to ways of creating standard tests throughout the development of a code to avoid the mentioned setbacks and simplify debugging. I will also get into some of the specifics of unit testing for Python and demonstrate it with a reservoir routing example. Lastly, I will briefly mention a unit testing framework for C++.

The General Idea

Unit test frameworks are libraries with functions (or macros) that allow for the creation of standard tests for individual parts of code, such as function and classes, that can be run separately from the rest of code. The advantages of using Unit Testing are mainly that (1) it eliminates the need of running the entire code with a full set of inputs simply to see if a specific function or class is working, which in some cases can get convoluted take a really long time, (2) it allows for easy and fast debugging, given that if a certain test fails you know where the problem is, (3) it prevents you or someone else from breaking the code down the road by running the tests frequently as the code is developed, and (4) it prevents an error in a function or class to be made noticible in another, which complicates the debugging process. Of course, the more modular a code is the easier it is to separately test its parts — a code comprised of just one 1,000-lines long function cannot be properly unit-tested, or rather can only have one unit test for the entire function.

Example Code to be Tested

The first thing we need to perform unit testing is a code to be tested. Let’s then consider the reservoir routing code below, comprised of a reservoir class, a synthetic stream flow generation function, a function to run the simulation over time, and the main.:

import numpy as np
import matplotlib.pyplot as plt

class Reservoir:
    __capacity = -1
    __storage_area_curve = np.array([[], []])
    __evaporation_series = np.array([])
    __inflow_series = np.array([])
    __demand_series = np.array([])
    __stored_volume = np.array([])

    def __init__(self, storage_area_curve, evaporations, inflows, demands):
        """ Constructor for reservoir class

        Arguments:
            storage_area_curve {2D numpy matrix} -- Matrix with a
            row of storages and a row of areas
            evaporations {array/list} -- array of evaporations
            inflows {array/list} -- array of inflows
            demands {array/list} -- array of demands
        """

        self.__storage_area_curve = storage_area_curve
        assert(storage_area_curve.shape[0] == 2)
        assert(len(storage_area_curve[0]) == len(storage_area_curve[1]))

        self.__capacity = storage_area_curve[0, -1]

        n_weeks = len(demands)
        self.__stored_volume = np.ones(n_weeks, dtype=float) * self.__capacity
        self.__evaporation_series = evaporations
        self.__inflow_series = inflows
        self.__demand_series = demands

    def calculate_area(self, stored_volume):
        """ Calculates reservoir area based on its storave vs. area curve

        Arguments:
            stored_volume {float} -- current stored volume

        Returns:
            float -- reservoir area
        """

        storage_area_curve_T = self.__storage_area_curve.T .astype(float)

        if stored_volume  self.__capacity:
            print "Storage volume {} graeter than capacity {}.".format(
                stored_volume, self.__capacity)
            raise ValueError

        for i in range(1, len(storage_area_curve_T)):
            s, a = storage_area_curve_T[i]
            # the &st; below needs to be replace with "smaller than" symbol. WordPress code highlighter has a bug that was distorting the code because of this "smaller than."
            if stored_volume &st; s:
                sm, am = storage_area_curve_T[i - 1]
                return am + (stored_volume - sm)/ (s - sm) * (a - am)

        return a

    def mass_balance(self, upstream_flow, week):
        """ Performs mass balance on reservoir

        Arguments:
            upstream_flow {float} -- release from upstream reservoir
            week {int} -- week

        Returns:
            double, double -- reservoir release and unfulfilled
            demand (in case reservoir gets empty)
        """

        evaporation = self.__evaporation_series[week] *\
            self.calculate_area(self.__stored_volume[week - 1])
        new_stored_volume = self.__stored_volume[week - 1] + upstream_flow +\
            self.__inflow_series[week] - evaporation - self.__demand_series[week]

        release = 0
        unfulfiled_demand = 0

        if (new_stored_volume > self.__capacity):
            release = new_stored_volume - self.__capacity
            new_stored_volume = self.__capacity
        elif (new_stored_volume < 0.):
            unfulfiled_demand = -new_stored_volume
            new_stored_volume = 0.

        self.__stored_volume[week] = new_stored_volume

        return release, unfulfiled_demand

    def get_stored_volume_series(self):
        return self.__stored_volume

def run_mass_balance(reservoirs, n_weeks):
    upstream_flow = 0
    release = 0
    total_unfulfilled_demand = 0
    for week in range(1, n_weeks):
        for reservoir in reservoirs:
            release, unfulfilled_demand = reservoir.mass_balance(release, week)

def generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma):
    """Log-normally distributed stream flow generator.

    Arguments:
        n_weeks {int} -- number of weeks of stream flows
        sin_amplitude {double} -- amplitude of log-mean sinusoid fluctuation
        log_mu {double} -- mean log-mean
        log_sigma {double} -- log-sigma

    Returns:
        {Numpy array} -- stream flow series
    """

    streamflows = np.zeros(n_weeks)

    for i in range(n_weeks):
        streamflow = np.random.randn() * log_sigma + log_mu *\
            (1. + sin_amplitude * np.sin(2. * np.pi / 52 * (i % 52)))
        streamflows[i] = np.exp(streamflow)

    return streamflows

if __name__ == "__main__":
    np.random.seed(41)
    n_weeks = 522
    sin_amplitude, log_mean, log_std = 1., 2.0, 1.8
    streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mean, log_std)
    storage_area_curve = np.array([[0, 1000, 3000, 4000], [0, 400, 600, 900]])

    reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks) / 10,\
        streamflows, np.random.rand(n_weeks) * 60)

    run_mass_balance([reseroir1], n_weeks)

    plt.plot(reseroir1.get_stored_volume_series())
    plt.xlabel("Weeks [-]")
    plt.ylabel("Stored Volume [MG]")
    plt.title("Storage Over Time")
    plt.ylim(0, 4100)
    plt.show()

We have quite a few functions to be tested in the code above, but lets concentrate on Reservoir.calculate_area and generate_streamflow. The first is a function within a class (Reservoir) while the second is a standalone function. Another difference between both functions is that we know exactly what output to expect from Reservoir.calculate_area (area given storage), while given generate_stream flow is setup to return random numbers we can only test their estimated statistical properties, which will change slightly every time we run the function. We will use the PyTest framework to perform unit testing in the reservoir routing code above.

PyTest

PyTest is one among many available unit test frameworks for Python. The reason for its choice is that PyTest is commonly used by Python developers. You can find the API (list of commands) here and examples more here.

Basic Work Flow

Using PyTest is pretty simple. All you have to do it:

  • Install PyTest with pip or conda (pip install pytest or conda install pytest)
  • Create a file (preferably in the same directory as the file with the code you want to test) whose name either begins or end with “test_” or “_test.py,” respectively — this is a “must.” In this example, the main code is called “reservoir_routing.py,” so I created the PyTest file in the same directory and named it “test_reservoir_routing.py.”
  • In the unit test file (“test_reservoir_routing.py”), import from the main code file the functions and classes you want to test, the PyTest module pytest and any other packages you may use to write the tests, such as Numpy.
  • Write the tests. Tests in PyTest are simply regular Python functions whose names begin with “test_” and that have assertions (“assert,” will get to that below) in them.
  • On the Command Prompt, PowerShell, or Linux terminal, run the command “pytest.” PyTest will then look for all your files that begin with “test_” or end in “_test.py,” scan them for functions beginning with “test_,” which indicate a test function, run them and give you the results.

Assertions

Assertions are the actual checks, the cores of the tests. Assertions are performed with the assert command and will return “pass” if the condition being asserting is true and “fail” otherwise. Below is an example of an assertion, which will return “pass” if the function being tested returns the value on the right-hand-side for the given arguments or “false” otherwise:

assert function_calculate_something(1, 'a', 5) == 100

Sometimes we are not sure of the value a function will return but have an approximate idea. This is the case of functions that perform stochastic calculations or calculate numerically approximate solutions — e.g. sampling from distributions, Newton-Rhapson minimum finder, E-M algorithm, finite-differences PDE solvers, etc. In this case, we should perform our assertion against an approximate value with:

# assert if function returns 100 +- 2
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., abs=2.)

or with:

# assert if function returns 100 +- 0.1 (or 100 +- 100 * 1e-3)
assert function_calculate_something(1, 'a', 5) == pytest.approx(100., rel=1e-3)

Sometimes (although, unfortunately, rarely) we add code to a function to check the validity of the arguments, intermediate calculations, or to handle exceptions during the function’s execution. We can also test all that with:

# Test if an exception (such as Exception) are properly raised
with pytest.raises(Exception):
    function_calculate_something(1345336, 'agfhdhdh', 54784574)

Test File Example

Putting all the above together, below is a complete PyTest file with two tests (two test functions, each with whatever number of assertions):

from reservoir_routing import Reservoir, generate_streamflow
import numpy as np
import pytest

def test_calculate_area():
    n_weeks = 522
    storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

    reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
                            np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

    # Test specific values of storages and areas
    assert reseroir1.calculate_area(500) == 400
    assert reseroir1.calculate_area(650) == 500
    assert reseroir1.calculate_area(1000) == 900

    # Test if exceptions are properly raised
    with pytest.raises(ValueError):
        reseroir1.calculate_area(-10)
        reseroir1.calculate_area(1e6)

def test_generate_streamflow():
    n_weeks = 100000
    log_mu = 7.8
    log_sigma = 0.5
    sin_amplitude = 1.
    streamflows = generate_streamflow(n_weeks, sin_amplitude, log_mu, log_sigma)

    whitened_log_mu = (np.log(streamflows[range(0, n_weeks, 52)]) - log_mu) / log_sigma

    # Test if whitened mean in log space is 0 +- 0.5
    assert np.mean(whitened_log_mu) == pytest.approx(0., abs=0.05)

If I open my Command Prompt, navigate to the folder where both the main code and the test Python files are located and run the command pytest I will get the output below:

F:\Dropbox\Bernardo\Blog posts>pytest
============================= test session starts =============================
platform win32 -- Python 2.7.13, pytest-4.2.0, py-1.7.0, pluggy-0.8.1
rootdir: F:\Dropbox\Bernardo\Blog posts, inifile:
collected 2 items

test_reservoir_routing.py F.                                             [100%]

================================== FAILURES ===================================
_____________________________ test_calculate_area _____________________________

    def test_calculate_area():
        n_weeks = 522
        storage_area_curve = np.array([[0, 500, 800, 1000], [0, 400, 600, 900]])

        reseroir1 = Reservoir(storage_area_curve, np.random.rand(n_weeks),
                                np.zeros(n_weeks), np.random.rand(n_weeks) * 20)

        # Test specific values of storages and areas
        assert reseroir1.calculate_area(500) == 400
>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500
E        +  where 400 = (650)
E        +    where  = .calculate_area

test_reservoir_routing.py:14: AssertionError
========================== deprecated python version ==========================
You are using Python 2.7.13, which will no longer be supported in pytest 5.0
For more information, please read:
  https://docs.pytest.org/en/latest/py27-py34-deprecation.html
===================== 1 failed, 1 passed in 1.07 seconds ======================

F:\Dropbox\Bernardo\Blog posts>

This seems like a lot just to say that one test passed and the other failed, but there is actually some quite useful information in this output. The lines below (20 and 21 of the output above) indicate which assertion failed, so you can go ahead and debug your the corresponding specific part of code with a debugger:

>       assert reseroir1.calculate_area(650) == 500
E       assert 400 == 500

I actually found this error with PyTest as I was writing the reservoir routing code for this post and used VS Code, which has a debugger integrated with PyTest, to debug and fix the error. The error is happening because of an int to float conversion that is not being done correctly in line 45 of the original code, which can be replaced by the version below to fix the problem:

storage_area_curve_T = self.__storage_area_curve.T<strong>.astype(float)</strong>

The last piece of important information contained in the last few lines of the output is that support for Python 2.7 will be discontinued, as it is happening with other Python packages as well. It may be time to finally switch to Python 3.

Practice Exercise

As a practice exercise, try writing a test function for the Reservoir.mass_balance function including cases in which the reservoir spills, gets empty, and is partly full.

Unit Testing in C++

A framework for unit testing in C++ that is quite popular is the Catch2 framework. The idea is exactly the same: you have a code you want to test and use the framework to write test functions. An example of Catch2 being used to check a factorial function is shown below, with both the main and the test codes in the same file:

#define CATCH_CONFIG_MAIN  // This tells Catch to provide a main() - only do this in one cpp file
#include "catch.hpp"

unsigned int Factorial( unsigned int number ) {
    return number <= 1 ? number : Factorial(number-1)*number;
}

TEST_CASE( "Factorials are computed", "[factorial]" ) {
    REQUIRE( Factorial(1) == 1 );
    REQUIRE( Factorial(2) == 2 );
    REQUIRE( Factorial(3) == 6 );
    REQUIRE( Factorial(10) == 3628800 );
}

Conclusion

Setting up and running unit testing is not rocket science. Also, with my test functions I do not have to use my actual reservoir system test case in order to check if the surface area calculations are correct. If I find they are not, I can just re-run the test function with a debugger (VS Code is great for that) and fix my function without having to deal with the rest of the code. You would be surprised by how many hours of debugging can be saved by taking the unit test approach. Lastly, be sure to run your tests after any change to your code you consider minimally significant (I know, this sounds vague). You never know when you mess something up by trying to fix or improve something else.

Introduction to Gaussian Processes

Introduction to Gaussian Processes

In this post, I will cover the very basics of Gaussian Processes following the presentations in Murphy (2012) and Rasmussen and Williams (2006) — though hopefully easier to digest. I will approach the explanation in two ways: (1) derived from Bayesian Ordinary Linear Regression and (2) derived from the definition of Gaussian Processes in functional space (it is easier than it sounds). I am not sure why the mentioned books do not use “^” to denote estimation (if you do, please leave a comment), but I will stick to their notation assuming you may want to look into them despite running into the possibility of an statistical sin. Lastly, I would like to thank Jared Smith for reviewing this post and providing highly statistically significant insights.

If you are not familiar with Bayesian regression (see my previous post), skip the section below and begin reading from “Succinct derivation of Gaussian Processes from functional space.”

From Bayesian Ordinary Linear Regression to Gaussian Processes

Recapitulation of Bayesian Ordinary Regression

In my previous blog post, about Bayesian Ordinary Linear Regression (OLR), I showed that the predictive distribution for a point \boldsymbol{x}_* for which we are trying to estimate y_* with Bayesian OLR is

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where we are trying to regress a model over the data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, \boldsymbol{X} being the independent variables and \boldsymbol{y} the dependent variable, \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} means “a normal distribution of y_* with (or given) the regression mean \boldsymbol{\beta}_N^T\boldsymbol{x}_* and variance \sigma_N^2(\boldsymbol{x}_*),” where N is the number of points in the data set. The estimated variance of \boldsymbol{y}_*, \sigma_N^2(\boldsymbol{x}_*) and the estimated regression parameters \boldsymbol{\beta}_N can be calculated as

\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}^T\boldsymbol{X})^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}^T\boldsymbol{y}
\hat{\sigma}_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

where \boldsymbol{V}_0 and \boldsymbol{\beta}_0 are the mean and the covariance of the parameter prior distribution. All the above can be used to estimate y_* as

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* with an error variance of \sigma_N^2(\boldsymbol{x}_*)

If we assume a prior mean of 0 (\boldsymbol{\beta}_0=0), replace \boldsymbol{V}_N and \boldsymbol{\beta}_N by their definitions, and apply a function \phi(\boldsymbol{x}), e.g. \phi(\boldsymbol{x})=(1, x_1, x_2, x_1x_2, x_1^2 ...)^T, to add features to \boldsymbol{X} so that we can use a linear regression approach to fit a non-linear model over our data set \mathcal{D}, we would instead have that

\boldsymbol{\beta}_N = \boldsymbol{\phi}_*^T( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_X\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{\phi}_*^{T}( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_*

where \boldsymbol{\phi}_* = \phi(\boldsymbol{x}_*) and \boldsymbol{\phi}_X = \phi(\boldsymbol{X}). One problem with this approach is that as we add more terms (also called features) to function \phi(\cdot) to better capture non-linearities, the the dimension of the matrix we will have to invert increases. It is also not always obvious what features we should add to our data. Both problems can be handled with Gaussian Processes.

Now to Gaussian Processes

This is where we transition from Bayesian OLR to Gaussian Processes. What we want to accomplish in the rest of this derivation is to find a way of: (1) adding as many features to the data as we need to calculate \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) without increasing the size of the matrix \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T we have to invert (remember the dimensions of such matrix equals the number of features in \phi(\cdot)), and of (2) finding a way of implicitly adding features to \boldsymbol{X} and \boldsymbol{x}_* without having to do so manually — manually adding features to the data may take a long time, specially if we decide to add (literally) an infinite number of them to have an interpolator.

The first step is to make the dimensions of the matrix we want to invert equal to the number of data points instead of the number of features in \phi(\cdot). For this, we can re-arrange the equations for \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) as

\boldsymbol{\beta}_N =\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*( \sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_X)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*-\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_X(\sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{\phi}_X)^{-1}\boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_*

We now have our feature-expanded data and points for which we want point estimates always appearing in the form of \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})\boldsymbol{x}' is just another point \boldsymbol{x}_i either in the data set or for which we want to estimate y_*. From now on, \kappa(x, x') = \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'}) will be called a covariance or kernel function. Since the prior’s covariance \boldsymbol{V}_0 is positive semi-definite (as any covariance matrix), we can write

\kappa(\boldsymbol{x}, \boldsymbol{x}') = \phi(\boldsymbol{x})^T (\boldsymbol{V}_0^{1/2})^T(\boldsymbol{V}_0^{1/2})\phi(\boldsymbol{x}')=\psi(\boldsymbol{x})^T\psi(\boldsymbol{x}')

where \psi(\boldsymbol{x})=\boldsymbol{V}_0^{1/2}\phi(\boldsymbol{x'}). Using \kappa(\boldsymbol{x}, \boldsymbol{x}') and assuming a prior of \boldsymbol{I}, \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) then take a new shape in the predictor posterior of a Gaussian Process:

\boldsymbol{\beta}_N =K_*( \sigma_{\epsilon}^2\boldsymbol{I} + K)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = k_{**}-\boldsymbol{k}_*(\sigma_{\epsilon}^2\boldsymbol{I}+K)^{-1}\boldsymbol{k}_*^T

where K = \kappa(\boldsymbol{\phi}_X,\boldsymbol{\phi}_X), \boldsymbol{k}_* = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_X) and k_{**} = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_*). Now the features of the data are absorbed within the inner-products between observed \boldsymbol{x}‘s and or for \boldsymbol{x}_*, so we can add as many features as we want without impacting the dimensions of the matrix we will have to invert. Also, after this transformation, instead of \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) representing the mean and covariance between model parameters, they represent the mean and covariance of and among data points and/or points for which we want to get point estimates. The parameter posterior from Bayesian OLR would now be used to sample the values of y_* directly instead of model parameters. And now we have a Gaussian Process with which to estimate y_*. For plots of samples of \boldsymbol{y}_* from the prior and from the predictive posterior, of the mean plus or minus the error variances, and of models with different kernel parameters, see figures at the end of the post.

Kernel (or Covariance) matrices

The convenience of having \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) written in terms of \kappa(\cdot, \cdot), is that \kappa does not have to represent simply a matrix-matrix or vector-matrix multiplication of \boldsymbol{X} and \boldsymbol{x}_*. In fact, function \kappa can be any function that corresponds to an inner-product after some transformation from \boldsymbol{x}\rightarrow\phi(\boldsymbol{x}), which will necessarily return a positive semi-definite matrix and therefore be a valid covariance matrix. There are several kernels (or kernel functions or kernel shapes) available, each accounting in a different way for the nearness or similarity (at least in Gaussian Processes) between values of x_i:

  • the linear kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T\boldsymbol{x}'. This kernel is used when trying to fit a line going through the origin.
  • the polynomial kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = (1 + \boldsymbol{x}^T\boldsymbol{x}')^d, where d is the dimension of the polynomial function. This kernel is used when trying to fit a polynomial function (such as the fourth order polynomial in the blog post about Bayesian OLR), and
  • the Radial Basis Function, or RBF (aka Squared Exponential, or SE), \kappa(\boldsymbol{x}, \boldsymbol{x}') = e^{-\frac{||\boldsymbol{x}-\boldsymbol{x}'||^2}{2\sigma_{RBF}^2}}, where \sigma_{RBF} is a kernel parameter denoting the characteristic length scale of the function to be approximated. Using the RBF kernel is equivalent to adding an infinite (!) number of features the data, meaning \phi(\boldsymbol{x})=(1, x_1, ..., x_d, x_1x_2, x_1x_3,..., x_{d-1}x_d, x_1x_2x_3, ...)^T.

But this is all quite algebraic. Luckily, there is a more straight-forward derivation shown next.

Succinct derivation of Gaussian Processes from functional space.

Most regression we study in school is a way of estimating the parameters \boldsymbol{\beta} of a model. In Bayesian OLR, for example, we have a distribution (the parameter posterior) from which to sample model parameters. What if we instead derived a distribution from which to sample functions themselves? The first (and second and third) time I heard about the idea of sampling functions I thought right away of opening a box and from it drawing exponential and sinusoid functional forms. However, what is meant here is a function of an arbitrary shape without functional form. How is this possible? The answer is: instead of sampling a functional form itself, we sample the values of y_*=f(\boldsymbol{x}_*) at discrete points, with a collection of \boldsymbol{y}_i for all \boldsymbol{x}_i in our domain called a function \boldsymbol{f}. After all, don’t we want a function more often than not solely for the purpose of getting point estimates and associated errors for given values of \boldsymbol{x}_*?

In Gaussian Process regression, we sample functions \boldsymbol{f} from a distribution by considering each value of x_i in a discretized range along the x-axis as a random variable. That means that if we discretize our x-axis range into 100 equally spaced points, we will have 100 random variables x_0,..,x_{100}. The mean of all possible functions at \boldsymbol{x}_i therefore represents the mean value of y_i in \boldsymbol{x}_i, and each term in the covariance matrix (kernel matrix, see “Kernel (or Covariance) matrices” section above) represents how similar the values of y_i and y_j will be for each pair \boldsymbol{x}_i and \boldsymbol{x}_j based on the value of a distance metric between  \boldsymbol{x}_i and \boldsymbol{x}_j: if \boldsymbol{x}_i and \boldsymbol{x}_j are close to each other, \boldsymbol{y}_i and \boldsymbol{y}_j tend to be similar and vice-versa. We can therefore turn the sampled functions y = f(\boldsymbol{x}) into whatever functional form we want by choosing the appropriate covariance matrix (kernel) — e.g. a linear kernel will return a linear model, a polynomial kernel will return a polynomial function, and an RBF kernel will return a function with an all linear and an infinite number of non-linear terms. A Gaussian Process can then be written as

f(\boldsymbol{x}) \sim \mathcal{GP}(\mu(\boldsymbol{x}), \kappa(\boldsymbol{x}, \boldsymbol{x}'))

where \mu(\boldsymbol{x}) is the mean function (e.g. a horizontal line along the x-axis at y = 0 would mean that \mu(\boldsymbol{x}) = [0,..,0]^T) and \kappa(\boldsymbol{x}, \boldsymbol{x}') is the covariance matrix, which here expresses how similar the values of y_i will be based on the distance between two values of x_i.

To illustrate this point, assume we want to sample functions over 200 discretizations \boldsymbol{X}_* = x_{*0}, ..., x_{*199} along the x-axis from x=-5 to x=5 (\Delta x = 0.05) using a Gaussian Process. Assume the parameters of our Gaussian Process from which we will sample our functions are mean 0 and that it uses an RBF kernel for its covariance matrix with parameter \sigma_{RBF}=2 (not be confused with standard deviation), meaning that k_{i,i}=1.0 ,\:\forall i \in [0, 199] and k_{i,j}=e^{-\frac{||x_i-x_j||^2}{2\cdot 2^2}},\: \forall i,j \in [0, 199], or

\begin{aligned} p(\boldsymbol{f}_*|\boldsymbol{X_*})&=\mathcal{N}(\boldsymbol{f}|\boldsymbol{\mu},\boldsymbol{K_{**}})\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[\mu_{*0},..,\mu_{*199}], \begin{bmatrix} k_{*0,0} & k_{*0,1} & \cdots & k_{*0,199} \\ k_{*1,0} & k_{*1, 1} & \cdots & k_{*1,199}\\ \vdots & \vdots & \ddots & \vdots \\ k_{*199,0} & k_{*199,1} & \cdots & k_{*199,199} \end{bmatrix} \right)\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[0,..,0], \begin{bmatrix} 1 & 0.9997 & \cdots & 4.2\cdot 10^{-6} \\ 0.9997 & 1 & \cdots & 4.8\cdot 10^{-6}\\ \vdots & \vdots & \ddots & \vdots \\ 4.2\cdot 10^{-6} & 4.8\cdot 10^{-6} & \cdots & 1 \end{bmatrix} \right) \end{aligned}

Each sample from the normal distribution above will return 200 values: \boldsymbol{f}_* = [y_0, ..., y_{199}]. A draw of 3 such sample functions would look generally like the following:

Figure_1

The functions above look incredibly smooth because of the RBF kernel, which adds an infinite number of non-linear features to the data. The shaded region represents the prior distribution of functions. The grey line in the middle represents the mean of an infinite number of function samples, while the upper and lower bounds represent the corresponding prior mean plus or minus standard deviations.

These functions looks quite nice but pretty useless. We are actually interested in regressing our data assuming a prior, namely on a posterior distribution, not on the prior by itself. Another way of formulating this is to say that we are interested only in the functions that go through or close to certain known values of \boldsymbol{x} and y. All we have to do is to add random variables corresponding to our data and condition the resulting distribution on them, which means accounting only for samples of functions that exactly go through (or given) our data points. The resulting distribution would then look like p(\boldsymbol{f}_*|\mathcal{D}, \boldsymbol{X}_*). After adding our data \boldsymbol{X}, \boldsymbol{f}, or \boldsymbol{X}, \boldsymbol{y}, our distribution would look like

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}K(\boldsymbol{X},\boldsymbol{X}) & K(\boldsymbol{X},\boldsymbol{X}_*) \\ K(\boldsymbol{X}_*,\boldsymbol{X}) & K(\boldsymbol{X}_*,\boldsymbol{X}_*)\end{bmatrix}\right)

In this distribution we have random variables representing the \boldsymbol{X} from our data set \mathcal{D} and the \boldsymbol{X}_* from which we want to get point estimates and corresponding error standard deviations. The covariance matrix above accounts for correlations of all observations with all other observations, and correlations of all observations with all points to be predicted. We now just have to condition the multivariate normal distribution above over the points in our data set, so that the distribution accounts for the infinite number of functions that go through our \boldsymbol{y} at the corresponding \boldsymbol{X}. This yields

p(\boldsymbol{f}_*|\boldsymbol{X}_*,\boldsymbol{X},\boldsymbol{y})=\mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})

\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{y}

\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \boldsymbol{K} = K(\boldsymbol{X},\boldsymbol{X})\boldsymbol{K}_* = K(\boldsymbol{X},\boldsymbol{X}_*), \boldsymbol{K}_{**} = K(\boldsymbol{X_*},\boldsymbol{X_*}), and diag\left(\bar{\boldsymbol{\Sigma}}\right) is a vector containing the elements in the diagonal of the covariance matrix \bar{\boldsymbol{\Sigma}} — the variances \bar{\sigma}^2_i of each random variable x_i. If you do not want to use a prior with \boldsymbol{\mu} = \boldsymbol{\mu}_* = [0, ... , 0]^T, the mean of the Gaussian Process would be \bar{\boldsymbol{\mu}}=\boldsymbol{\mu}_* + \boldsymbol{K}_*^T\boldsymbol{K}^{-1}(\boldsymbol{f} - \boldsymbol{\mu}). A plot of \boldsymbol{f}_* = \bar{\boldsymbol{\mu}} estimated for 200 values of x_i — so that the resulting curve looks smooth — plus or minus associated uncertainty (\pm\bar{\boldsymbol{\sigma}}^2), regressed on 5 random data points \boldsymbol{X}, \boldsymbol{y}, should look generally like

Figure_1-1_2

 

A plot of three sampled functions (colored lines below, the grey line is the mean) \boldsymbol{f}_* \sim \mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}}) should look generally like

Figure_1-1-sampling

 

Several kernel have parameters that can strongly influence the regressed model and that can be estimated — see Murphy (2012) for a succinct introduction to kernel parameters estimation. One example is parameter \sigma_{RBF}^2 of the RBF kernel, which determines the correlation length scale of the fitted model and therefore how fast with increasing separation distance between data points the model will default to the prior (here, with \boldsymbol{\mu}=0). The figure below exemplifies this effect

Figure_kparams

All figures have the same data points but the resulting models look very different. Reasonable values for \sigma_{RBF}^2 depend on the spacing between the data points and are therefore data-specific. Also, if when you saw this plot you thought of fitting RBF functions as interpolators, that was not a coincidence: the equations solved when fitting RBFs to data as interpolators are the same solved when using Gaussian Processes with RBF kernels.

Lastly, in case of noisy observation of y in \mathcal{D} the Gaussian Process distribution and the expressions for the mean, covariance, and standard deviation become

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I} & \boldsymbol{K}_* \\ \boldsymbol{K}_*^T & \boldsymbol{K}_{**}\end{bmatrix}\right)

\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{y}

\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \sigma_{\epsilon}^2 is the error variance and \boldsymbol{I} is an identity matrix. The mean and error predictions would look like the following

Figure_1-1

Concluding Remarks

What I presented is the very basics of Gaussian Processes, leaving several important topics which were not covered here for the interested reader to look into. Examples are numerically more stable formulations, computationally more efficient formulations, kernel parameter estimation, more kernels, using a linear instead of zero-mean prior (semi-parametric Gaussian Processes), Gaussian Processes for classification and Poisson regression, the connection between Gaussian Processes and other methods, like krieging and kernel methods for nonstationary stochastic processes. Some of these can be found in great detail in Rasmussen and Williams (2006), and Murphy (2012) has a succinct presentation of all of the listed topics.

I was really confused about Gaussian Processes the first time I studied it and tried to make this blog post as accessible as possible. Please leave a comment if you think the post would benefit from any particular clarification.

 

References

Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press, ISBN 978-0262018029

Rasmussen, Carl Edward and Williams, Christopher K. I., 2006. Gaussian Processes for Machine Learning. The MIT Press, ISBN 978-0262182539.

From Frequentist to Bayesian Ordinary Linear Regression

In this post, I will briefly review the basic theory about Ordinary Linear Regression (OLR) using the frequentist approach and from it introduce the Bayesian approach. This will lay the terrain for a later post about Gaussian Processes. The reader may ask himself why “Ordinary Linear Regression” instead of “Ordinary Least Squares.” The answer is that least squares refers to the objective function to be minimized, the sum of the square errors, which is not used the Bayesian approach presented here. I want to thank Jared Smith and Jon Lamontagne for reviewing and improving this post. This post closely follows the presentation in Murphy (2012), which for some reason does not use “^” to denote estimated quantities (if you know why, please leave a comment).

Ordinary Linear Regression

OLR is used to fit a model that is linear in the parameters to a data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, where \boldsymbol{X} represents the independent variables and \boldsymbol{y} the dependent variable, and has the form

p(y|\boldsymbol{x}, \boldsymbol{\beta}) = \mathcal{N}(y|\boldsymbol{\beta}^T\boldsymbol{x}, \sigma_{\epsilon}^2)

where \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} should be read as “a normal distribution of \boldsymbol{y} with (or given) mean \boldsymbol{\beta}^T\boldsymbol{x} and variance \sigma_{\epsilon}^2.” One frequentist method of estimating the parameters of a linear regression model is the method of maximum likelihood estimation (MLE). MLE provides point estimates \boldsymbol{\hat{\beta}} for each of the regression model parameters \boldsymbol{\beta} by maximizing the likelihood function with respect to \boldsymbol{\beta} and \sigma_{\epsilon}^2 by solving the maximization problem

\boldsymbol{\hat{\beta}} = arg\,\underset{\boldsymbol{\beta}}{max}\;\mathcal{L}(\boldsymbol{\beta})

where the likelihood function \mathcal{L}(\boldsymbol{\beta}) is defined as

\mathcal{L}(\boldsymbol{\beta})=p(\mathcal{D}|\boldsymbol{\beta})={\displaystyle \prod_{i=1}^{N} p(y_i|\boldsymbol{x}_i, \boldsymbol{\beta}, \mu, \sigma_{\epsilon}^2)} = \mathcal{N}(\boldsymbol{y}|\mu + \boldsymbol{X}\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})

where \mu is the mean of the model error (normally \mu=0), N is the number of points in the data set, \boldsymbol{I} is an identity matrix, and \boldsymbol{x} is a vector of values of the independent variables for an individual data point of matrix \boldsymbol{X}. OLR assumes independence and the same model error variance among observations, which is expressed by the covariance matrix \sigma_{\epsilon}^2\boldsymbol{I}. Digression: in Generalized Linear Regression (GLR), observations may not be independent and may have different variances (heteroscedasticity), so the covariance matrix may have off-diagonal terms and the diagonal terms may not be equal.

If a non-linear function shape is sought, linear regression can be used to fit a model linear in the parameters over a function \phi(\cdot) of the data — I am not sure why the machine learning community chose \phi for this function, but do not confuse it with a normal distribution. This procedure is called basis function expansion and has a likelihood function of the form

\mathcal{L}(\boldsymbol{\beta}) = \mathcal{N}(\boldsymbol{y}|\mu + \phi(\boldsymbol{X})\boldsymbol{\beta},\sigma_{\epsilon}^2\boldsymbol{I})

where \phi(\boldsymbol{x}) can have, for example, the form

\phi(\boldsymbol{x})=\begin{pmatrix}1\\ x_1\\ x_1^2\end{pmatrix}

for fitting a parabola to a one-dimensional \boldsymbol{X} over \boldsymbol{y}. When minimizing the squared residuals we get to the famous Ordinary Least Squares regression. Linear regression can be further developed, for example, into Ridge Regression to better handle multicollinearity by introducing bias to the parameter estimates, and into Kernel Ridge regression to implicitly add non-linear terms to the model. These formulations are beyond the scope of this post.

What is important to notice is that the standard approaches for linear regression described here, although able to fit linear and non-linear functions, do not provide much insight into the model errors. That is when Bayesian methods come to play.

Bayesian Ordinary Linear Regression

In Bayesian Linear regression (and in any Bayesian approach), the parameters \boldsymbol{\beta} are treated themselves as random variables. This allows for the consideration of model uncertainty, given that now instead of having the best point estimates for \boldsymbol{\beta} we have their full distributions from which to sample models — a sample of  \boldsymbol{\beta} corresponds to a model. The distribution of parameters \boldsymbol{\beta}, P(\boldsymbol{\beta}|\mathcal{D}), called the parameter posterior distribution, is calculated by multiplying the likelihood function used in MLE by a prior distribution for the parameters. A prior distribution is assumed from knowledge prior to analyzing new data. In Bayesian Linear Regression, a Gaussian distribution is commonly assumed for the parameters. For example, the prior on \boldsymbol{\beta} can be \mathcal{N}(\boldsymbol{\beta_0},\boldsymbol{V}_0) for algebraic simplicity. The parameter posterior distribution assuming a known \sigma_{\epsilon}^2 then has the form

p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) =\frac{p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})}{p(\mathcal{D})} \propto p(\mathcal{D}|\boldsymbol{\beta}, \sigma_{\epsilon}^2)p(\boldsymbol{\beta})  ,        given p(\mathcal{D}) is a constant depending only on the data.

We now have an expression from which to derive our parameter posterior for our linear model from which to sample \boldsymbol{\beta}. If the likelihood and the prior are Gaussian, the parameter posterior will also be a Gaussian, given by

p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2) \propto \mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_0, \boldsymbol{V}_0)\mathcal{N}(\boldsymbol{y}|\boldsymbol{X\beta},\sigma_{\epsilon}^2\boldsymbol{I})=\mathcal{N}(\boldsymbol{\beta}|\boldsymbol{\beta}_N,\boldsymbol{V}_N)
where
\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}\boldsymbol{X}^T)^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}\boldsymbol{y}

If we calculate the parameter posterior (distribution over parameters \boldsymbol{\beta}) for a simple linear model f(x) = \beta_0 + \beta_1x for a data set \mathcal{D} in which \boldsymbol{X} and \boldsymbol{y} are approximately linearly related given some noise \sigma_{\epsilon}^2, we can use it to sample values for \boldsymbol{\beta}. This is equivalent to sampling linear models f(x) for data set \mathcal{D}. As the number of data points increase, the variability of the sampled models should decrease, as in the figure below

linear_model

This is all interesting but the parameter posterior per se is not of much use. We can, however, use the parameter posterior to find the posterior predictive distribution, which can be used to both get point estimates of y and the associated error. This is done by multiplying the likelihood by the parameter posterior and marginalizing the result over \boldsymbol{\beta}. This is equivalent to performing infinite sampling of blue lines in the example before to form density functions around a point estimate of \mu_{y_*}=f(\boldsymbol{x}_*), with (\boldsymbol{x}_*,y_*) denoting a new point that is not in \mathcal{D}. If the likelihood and the parameter posterior are Gaussian, the posterior predictive then takes the form below and will also be Gaussian (in Bayesian parlance, this means that the Gaussian distribution is conjugate to itself)!

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where

\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

or, to put it simply,

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* \pm \sigma_N^2(\boldsymbol{x}_*)

The posterior predictive, meaning final linear model and associated errors (the parameter uncertainty equivalent of frequentist statistics confidence intervals) are shown below

linear_regression

If, instead of having a data set \mathcal{D} in which \boldsymbol{X} and \boldsymbol{y} are related approximately according to a 4th order polynomial, we use a \phi(\cdot) function to artificially create more random variables (or features, in machine learning parlance) corresponding to the non-linear terms of the polynomial function. Our function \phi(\cdot) would be \phi(x) = [1, x, x^2, x^3, x^4]^T and the resulting X would therefore have five columns instead of two ([1, x]^T), so now the task is to find \boldsymbol{\beta}=[\beta_0, \beta_1, \beta_2, \beta_3, \beta_4]^T. Following the same logic as before for X'=\phi(X) and x_*'=\phi(x_*), where prime denotes the new set random variable x from function \phi(\cdot), we get the following plots

poly_model

poly_regression

 

 

This looks great, but there is a problem: what if we do not know the functional form of the model we are supposed to fit (e.g. a simple linear function or a 4th order polynomial)? This is often the case, such as when modeling the reliability of a water reservoir system contingent on stored volumes, inflows and evaporation rates, or when modeling topography based on samples surveyed points (we do not have detailed elevation information about the terrain, e.g. a DEM file). Gaussian Processes (GPs) provide a way of going around this difficulty.

References

Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Converting Latex to MS Word docx (almost perfectly)

Many of students and academics write academic papers and reports on Latex due to the ease of formatting and managing citations and bibliography. However, collaborative editing Latex tools have not reached the level of versatility and convenience of Microsoft Word and Google Docs. We then often begin writing the paper in Word when major comments, additions and changes are made, and later manually translate it to Latex, which is a tedious task and just a waste of time.

A way around this hurdle is to use Pandoc to automatically convert Latex files to Word documents. To install Pandoc on Linux, open a terminal run sudo apt-get install pandoc pandoc-citeproc texlive, while Windows executables and MacOS instructions are available here. To run Pandoc on a Latex document:

  1. Open a terminal (on Windows, hold the Windows key and press “r,” then type “cmd” in the command bar)
  2. Use the “cd” command to navigate to the folder where your Latex document it.
  3. Type pandoc -s latex_document.tex --bibliography=bib_file.bib -o output_word_document.docx.

Now you should have a Word document with all your bitmap (png, jpeg, bmp, etc.) figures, equations in Word format and with a bibliography based on your citations and bib file. If the latest version of Pandoc does not work, try version 1.19.1.

There are some known limitations, though. Numbering and referencing equations, figures and tables, and referencing sections will not work and you will have to number those by hand. This limitation seems to apply only to outputting Word documents and you can circumvent them by outputting to other formats with Pandoc reference filters plus the appropriate Pandoc calls and specific latex syntax (see filters’ pages for details). Vectorized figures will not be included in the output document (svg, eps, pdf, etc.). On the other hand, bitmaps will appear in the output document and will preserve their captions.

Hopefully some skilled programmer feeling like generously volunteering his time for the Latex academic cause will address the mentioned limitations soon. Until then, we will have to go around them by hand.

Creating parallel axis plots with multiple datasets, color gradients, and brushing in Python

Parallel axis plots (here is a good description of what they are) are a relatively recent development in the plotting world, so it is no surprise that there is no implementations of it with more than basic functionalities in the major plotting packages available online. Over the past couple of days I then created my own implementation of parallel axis plots in Python using Matplotlib Pandas’ and Plot.ly’s implementation get cumbersome when the user tries to apply brushing and multiple color gradients  to create versatile, high-resolution and story-telling plots for my next papers and presentations. This implementation allows for:

  • Plotting multiple datasets,
  • Displaying dataset names,
  • Choosing columns to be plot,
  • Coloring each dataset based on a column and a different Matplotlib color map,
  • Specifying ranges to be plotted,
  • Inverting multiple axis,
  • Brushing by intervales in multiple axis,
  • Choosing different fonts for title and rest of the plot, and
  • Export result as a figure file or viewing plot in Matplotlib’s interactive window.

The source code can be found here, and below is an example of how to use it:

import numpy as np
from plotting.parallel_axis import paxis_plot
from matplotlib.colors import LinearSegmentedColormap
from matplotlib import cm

bu_cy = LinearSegmentedColormap.from_list('BuCy', [(0, 0, 1), (0, 1, 1)])
bu_cy_r = bu_cy.reversed()

data1 = np.random.normal(size=(100, 8))
data2 = np.random.normal(size=(100, 8))
columns_to_plot = [0, 1, 3, 5, 7]
color_column = 0
axis_labels = ['axes ' + str(i) for i in range(8)]
dataset_names = ['Data set 1', 'Data set 2']
plot_ranges = [[-3.5, 3.5]] * 3 + [[-2.9, 3.1]] + [[-3.5, 3.5]] * 4
axis_to_invert = [1, 5]
brush_criteria = {1: [-10., 0.], 7: [10., 0.]}

paxis_plot((data1, data2),
           columns_to_plot,
           color_column,
           [bu_cy_r, cm.get_cmap('autumn_r')],
           axis_labels,
           'Title Here',
           dataset_names,
           axis_ranges=plot_ranges,
           fontname_title='Gill Sans MT',
           fontname_body='CMU Bright',
           file_name='test.png',
           axis_to_invert=axis_to_invert,
           brush_criteria=brush_criteria)

The output of this script should be a file named “test.png” that looks similar to the plot below:

test

Making Valgrind Easy

Some of this blog’s readers and authors (most notably, Joe Kasprzik) read the title of this post and though “wait, there already is a post about Valgrind in this blog.” And you are right, so in this blog post I will build on the legacy Joe has left us on his post about Valgrind and get into the details of how to use its basic functionalities to get your code right.

Common mistakes when coding in C++

Suppose we have the following code:

#include <stdio.h>

int main() {
    int *var = new int[5]; // you wouldn't do this if the size was always 5, but this makes the argument clear.
    int n = 5;
    int m;

    if (m > n) {
        printf("Got into if statement.\n");
        for (int i = 0; i < 6; ++i) {
            var[i] = i;
        }
    }

    printf("var[5] equals %d\n", var[n]);
}

Saving the code above in a file called test.cpp, compiling it with g++ to create an executable called "test," and running it with "./test" will return the following output:

bernardoct@DESKTOP-J6145HK ~
$ g++ test.cpp -o test

bernardoct@DESKTOP-J6145HK ~
$ ./test
Got into if statement.
var[5] equals 5

Great, it ran and did not crash (in such a simple code gcc's flag -Wall would have issued a warning saying m was not initialized, but in more complex code such warning may not be issued). However, it would be great if this code had crashed because this would make us look into it and figure out it actually has 3 problems:

  1. We did not assign a value to variable m (it was created but not initialized), so how did the code determine that m was greater than n to get into the code inside the if statement?
  2. The pointer array var was created as having length 5, meaning its elements are numbered 0 to 4. If the for-loop runs from 0 to 5 but element 5 does not exist, how did the code fill it in with the value of variable i when i was 5 in the loop? From the printf statement that returned 5 we know vars[5] equals 5.
  3. The pointer array var was not destroyed after the code did not need it any longer. This is not necessarily a problem in this case, but if this was a function that is supposed to be called over and over within a model there is a change the RAM would be filled with these seemingly inoffensive pointer arrays and the computer would freeze (or the node, if running on a cluster, would possibly crash and have to be rebooted).

Given C++ will not crash even in the presence of such errors, one way of making sure your code is clean is by running it through Valgrind. However, most people who has used Valgrind on a model that has a few hundreds or thousands of lines of code has gotten discouraged by its possibly long and cryptic-looking output. However, do not let this intimidate you because the output is actually fairly easy to read once you either learn what to look for or use Valkyrie, a graphical user interface for Valgrind.

Generating and interpreting Valgrind’s output

The first think that needs to be done for Valgrind to give you a meaningful output is to re-compile your code with the -O0 and -g flags, the former to prevent the compiler from modifying your code to make it more efficient but unintelligible to Valgrind (or to debuggers), and the latter for Valgrind (and debuggers) to be able to pinpoint the line of code where issues happen and are originated. Therefore, the code should be compiled as shown below:

bernardoct@DESKTOP-J6145HK ~
$ g++ -O0 -g test.cpp -o test

Now it is time to run your code with Valgrind to perform some memory checking. Valgrind itself will take flags that will dictate the type of analysis to be performed. Here we are interested in checking memory misuse (instead profiling, checking for thread safety, etc.), so the first flag (not required, but good to keep things for yourself) should be --tool=memcheck. Now that we specified that we want Valgrind to run a memory check, we should specify that we want it to look in detail for memory leaks and tell us where the erros are happening and originating, which can done by passing flags --leak-check=full and --track-origins-yes. This way, the complete function call to run Valgrind on our test program is:

bernardoct@DESKTOP-J6145HK ~
$ valgrind --tool=memcheck --leak-check=full --track-origins=yes ./test

Important: Beware that your code will take orders of magnitude longer to run with Valgrind than it would otherwise. This means that you should run something as small as possible but still representative — e.g. instead of running your stochastic model with 1,000 realizations and a simulation time of 50 years, consider running 2 realizations simulating 2 years, so that Valgrind analyzes the year-long simulation and the transition between realizations and years. Also, if running your code on a cluster, load the valgrind module with module load valgrind-xyz on your submission script and replace the call to your model on the submission script by the valgrind call above you can find the exact name of the Valgrind module by running module avail on the terminal. If running valgrind with a code that used MPI, use mpirun valgrind ./mycode -flags.

When called, valgrind will instrument our test.cpp and based on the collected information will print the following on the screen:

==385== Memcheck, a memory error detector
==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==385== Command: ./test
==385==
==385== Conditional jump or move depends on uninitialised value(s)
==385==    at 0x4006A9: main (test.cpp:9)
==385==  Uninitialised value was created by a stack allocation
==385==    at 0x400686: main (test.cpp:3)
==385==
Got into if statement.
==385== Invalid write of size 4
==385==    at 0x4006D9: main (test.cpp:12)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)
==385==
==385== Invalid read of size 4
==385==    at 0x4006F5: main (test.cpp:16)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)
==385==
var[5] equals 5
==385==
==385== HEAP SUMMARY:
==385==     in use at exit: 20 bytes in 1 blocks
==385==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
==385==
==385== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)
==385==
==385== LEAK SUMMARY:
==385==    definitely lost: 20 bytes in 1 blocks
==385==    indirectly lost: 0 bytes in 0 blocks
==385==      possibly lost: 0 bytes in 0 blocks
==385==    still reachable: 0 bytes in 0 blocks
==385==         suppressed: 0 bytes in 0 blocks
==385==
==385== For counts of detected and suppressed errors, rerun with: -v
==385== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)

Seeing Valgrind’s output being 5 times as long as the test code itself can be somewhat disheartening, but the information contained in the output is really useful. The first block of the output is the header it will always be printed so that you know the version of Valgrind you have been using, the call for your own code it used, and so on. In our example, the header is:

==385== Memcheck, a memory error detector
==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==385== Command: ./test

After that, Valgrind report the errors it found during the execution of your code. Errors are always reported as a description of the error in good old English, followed by where it happens in your code. Let’s look at the first error found by Valgrind:

==385== Conditional jump or move depends on uninitialised value(s)
==385==    at 0x4006A9: main (test.cpp:9)
==385==  Uninitialised value was created by a stack allocation
==385==    at 0x400686: main (test.cpp:3)

This tells us that there is an if statement (conditional statement) on line 9 of test.cpp in which at least one of the sides of the logical test has at least one uninitialized variable. As pointed out by Valgrind, line 9 of test.cpp has our problematic if statement which compares initialized variable n to uninitialized variable m, which will have whatever was put last in that memory address by the computer.

The second error block is the following:

==385== Invalid write of size 4
==385==    at 0x4006D9: main (test.cpp:12)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)

This means that your code is writing something in a location of memory that it did not allocated for its use. This block says that the illegal write, so to speak, happened in line 12 of test.cpp through a variable created in line 5 of test.cpp using the new[] operator. These lines correspond to var[i] = i; and to int *var = new int[5];. With this, we learned that either var was created too short on line 5 of test.cpp or that the for loop that assigns values to var goes one or more steps too far.

Similarly, the next block tells us that our printf statement used to print the value of var[5] on the screen has read past the amount of memory that was allocated to var in its declaration on line 5 of test.cpp, as shown below:

==385== Invalid read of size 4
==385==    at 0x4006F5: main (test.cpp:16)
==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==385==    by 0x400697: main (test.cpp:5)

The last thing Valgrind will report is the information about memory leaks, which are accounted for when the program is done running. The output about memory leaks for our example is:

==409== HEAP SUMMARY:
==409==     in use at exit: 20 bytes in 1 blocks
==409==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
==409==
==409== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
==409==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
==409==    by 0x400697: main (test.cpp:5)
==409==
==409== LEAK SUMMARY:
==409==    definitely lost: 20 bytes in 1 blocks
==409==    indirectly lost: 0 bytes in 0 blocks
==409==      possibly lost: 0 bytes in 0 blocks
==409==    still reachable: 0 bytes in 0 blocks
==409==         suppressed: 0 bytes in 0 blocks

The important points to take away from this last block are that:

  1. there were 20 bytes of memory leaks, meaning that if this were a function in your code every time it was run it would leave 20 bytes of garbage sitting in the RAM. This may not sound like a big deal but imagine if your code leaves 1 MB of garbage in the RAM for each of the 100,000 times a function is called. With this, there went 100 GB of RAM and everything else you were doing in your computer at that time because the computer will likely freeze and have to go through a hard-reset.
  2. the memory you allocated and did not free was allocated in line line 5 of test.cpp when you used the operator new[] to allocate the integer pointer array.

It is important to notice here that if we increase the amount of allocated memory by the new[] operator on line 5 to that corresponding to 6 instead of 5 integers, the last two errors (invalid read and invalid write) would disappear. This means that if you run your code with Valgrind and see hundreds of errors, chances are that it will take modifying a few lines of code to get rid of most of these errors.

Valkyrie — a graphical user interface for Valgrind

Another way of going through Valgrind’s output is by using Valkyrie (now installed in the login node of Reed’s cluster, The Cube). If you are analyzing your code from your own computer with a Linux terminal (does not work with Cygwin, but you can install a native Ubuntu terminal on Windows 10 by following instructions posted here) and do not have Valkyrie installed yet, you can install it by running the following on your terminal:

bernardoct@DESKTOP-J6145HK ~
$ sudo apt-get install valkyrie

Valkyrie works by reading an xml file exported by Valgrind containing the information about the errors it found. To export this file, you need to pass the flags --xml=yes and --xml-file=valgring_output.xml (or whatever name you want to give the file) to Valgrind, which would make the call to Valgrind become:

bernardoct@DESKTOP-J6145HK ~
$ valgrind --tool=memcheck --leak-check=full --track-origins=yes --xml=yes --xml-file=valgring_output.xml ./test

Now, you should have a file called “valgrind_output.xml” in the directory you are calling Valgrind from. To open it with Valkyrie, first open Valkyrie by typing valkyrie on your terminal if on Windows 10 you need to have Xming installed and running, which can be done by following the instructions in the end of this post. If on a cluster, besides having Xming open you also have to have ssh’ed into the cluster with the -X flag (e.g. by running ssh -X username@my.cluster.here) with either Cygwin or from a native Linux terminal. After opening Valkyrie, click on the green folder button and select the xml file, as in the screenshot below.

valkyrie_screenshot.png

After opening the xml file generated by Valgrind, Valkyrie should look like in the screenshot below:valkyrie_screenshot2

Now you can begin from a collapsed list of errors and unfold each error to see its details. Keep in mind that Valkyrie is not your only option of GUI for Valgrind, as IDEs like JetBrains’ CLion and QTCreator come integrated with Valgrind. Now go check your code!

PS: Thanks to folks on Redit for the comments which helped improve this post.

Installing a Native Ubuntu Terminal on Windows 10

Modern clusters and cloud platforms requires users to login through SSH on Unix terminals in order to have access to its functionalities. Such terminals also give its users access to a wide range of commands and programs such as awk and Vim which are not available on Windows. Linux and Mac users have such terminals installed by default, but Windows users normally have to install Cygwin or use a browser-based terminal, both with limited capabilities such as the inability to install new packages.

Fortunately, Microsoft has established a partnership with Canonical (Ubuntu’s parent company) which brought part of the Linux kernel to Windows 10, allowing users to install Ubuntu’s terminal on Windows through official means without the need for compatibility layers. Using Ubuntu’s terminal on Windows has the advantages of being able to use apt-get and dpkg to install new packages, which was not possible with Cygwin, and of running Python and C/C++ codes faster. Here are the steps to install Ubuntu terminal on Windows 10:

  1. On Windows Settings, click on “Update & Security.” There, click on “For Developers” close to the bottom on left pane and turn on the option “Developer Mode.”
  2. Windows Settings, click on “Apps” -> “Programs and Features” (right pane) -> “Turn Windows features on or off”  (left pane) and check “Windows Subsystem for Linux.”

screenshot_windows_subsystems_for_linux.png

  1. Restart your computer.
  2. Open Windows PowerShell as administrator, type the following line and press enter:
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
  1. Open Microsoft Store (Windows’ app store), look for “Ubuntu Terminal,” and install it.
  2. Now you should have it installed and a shortcut on your quick-start bar.
  3. Open the Ubuntu Terminal and type:
sudo apt-get update

In order to install programs such as the Intel compiler and profiler (free for students), pip, Vim, GNUPlot or the most recent version of GCC, just type:

sudo apt-get install program_to_be_installed

If the package you installed has graphical components, such GNUPlot and Python/Matplotlib, you will need to install a program on Windows to display the graphical components from the Ubuntu terminal. One such option is Xming. To use Xming, follow the following steps:

    1. Install Xming from here.
    2. Run it. Click “next” until you can click on “Finish.” This process will have to be repeated every time you open the terminal.
    3. Open Ubuntu terminal
    4. Type the following and press enter:
echo "export DISPLAY=localhost:0.0" >> .bashrc
    1. Close and re-open the terminal
    2. In order to make sure you can run graphic applications, run the following two commands:
sudo apt-get install x11-apps
xeyes