# Parallelization of C/C++ and Python on Clusters

There are multiple ways of parallelizing C/C++ and Python codes to be used on clusters. The quick and dirty way, which often gets the job done, is to parallelize the code with MPI and launch one process per core across multiple nodes. Though easy to implement, this strategy may result in a fair waste of results because:

• With too many processes on one node, inefficient use of memory may make the code memory bottle-necked, meaning that a lot of time will be spent accessing memory and the cores will be sitting twiddling their thumbs.
• If you have a 48-core node with 192 GB of memory (as the Skylake nodes on Stampede 2) and your code uses 10 GB of RAM per process, you will not be able to have more than 19 out of 48 cores working at the same time because more than that would exceed the node’s memory, which would crash the node, which would get you kicked out of the cluster for a while without a warning — if you do not know how much RAM your code requires, you should look into the Remora tool.

If you are working on big runs on systems that charge per usage time or if you are creating a code you will have to run multiple times, it may be worth looking into more clever parallelization strategies. Some of the techniques described in the next subsections will also make your code two to four times faster on your desktop/laptop and are rather straight-forward to implement. Most of the code in this blog post can be freely downloaded from this repository. Lastly, if the #include statements in the examples below are not displayed as followed by a library between “<>”, please refer to the mentioned online repository.

## Hybrid Parallelization in C/C++

The word hybrid refers in this context to a mixed use of shared and distributed memory parallelization techniques. In distributed parallelization techniques, such as the Message Passing Interface (MPI), each process has its own dedicated chunk of memory within a RAM chip. This means that each process will have its own copy of the input data (repeated across processes), variables within the program, etc. Conversely, shared memory parallelization techniques such as OpenMP allow all threads (parallel unit of computation) to access the same block of memory, which can make a mess in your code with variables being changed seemingly out of nowhere by various parallel runs, but which lessens the amount of RAM needed and in some cases allow for more efficient memory utilization.

If running the Borg Master-Slave multiobjective evolutionary optimization algorithm (Borg MS) with hybrid parallelization, Borg MS will have each function evaluation performed by one MPI process, which could run in parallel with OpenMP (if your model is setup for that) across the cores designated for that process. As soon as a function evaluation is complete, the slave MPI process in charge of it will submit the results back to Borg MS’s master process and be given another set of decision variables to work on. The figure below is a schematic drawing of such hybrid parallelization scheme:

### OpenMP

Parallelizing a for-loop in C or C++ in which each iteration is independent — such as in Monte Carlo simulations — is straight-forward:

#include
#include "path/to/my_model.h"

int main(int argc, char *argv[])
{
int n_monte_carlo_simulations = omp_get_num_threads(); // One model run per core.
vector system_inputs = ;
vector results(n_monte_carlo_simulations, 0);
#pragma omp parallel for
for (int i = 0; i < n_monte_carlo_simulations; ++i)
{
results[i] = RunMonteCarloSimulation(system_inputs[i]);
}
}


The only caveat is that you have to be sure that the independent simulations are not writing over each other. For example, if by mistake you had written results[0] instead of results[i], the final value on results[0] would be that of whichever thread (parallel simulation) finished last, while the rest of the vector would be of zeros. It is important to notice here that the vector system_inputs is read only once by the parallel code and all n_monte_carlo_simulations threads would have access to it in memory.

Though this example is not rocket science, there is a lot more to OpenMP that can be found here and here.

### MPI

MPI stands for Message Passing Interface. It creates processes which pass messages (information packages) to each other. The most basic MPI code is one that creates independent processes — each with its own allocated chunk of memory that cannot be accessed by any other process — and just has them running independently in parallel without communicating with each other. This is great to distribute Monte Carlo runs across various nodes, although it may lead to the limitations mentioned in the beginning of this post. Below is an example of a very basic MPI code to run Monte Carlo simulations in parallel:

#include
#include "path/to/my_model.h"

int main(int argc, char *argv[])
{
MPI_Init(NULL, NULL);

int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

vector system_inputs = ;
RunMonteCarloSimulation(system_inputs[world_rank]);

MPI_Finalize();
return 0;
}


I will not cover in this post how to pass the output of the RunMonteCarloSimulation function back to a results vector, as in the previous example, because details of MPI is outside the scope of this post — a good MPI tutorial can be found here and great examples, including ones for Monte Carlo runs, can be found here. To run this code, you would have to compile it with mpicc mpi_demo.c -o MPIDemo and call it with the mpirun or mpiexec commands, which should be followed by the -n flag to specify the number of processes to be created, or mpirun -n 4 ./MPIDemo. The code above would run as many MonteCarlo simulations as there are processes (four processes if -n 4 was used), each with a rank (here a type of ID), and can be spread across as many nodes as you can get on a cluster. Each process of the code above would run on a core, which may result in the memory and/or processing limitations listed in the beginning of this post.

### Mixing OpenMP and MPI

The ideal situation for most cases is when you use MPI to distribute processes across nodes and OpenMP to make sure that as many cores within a node as it makes sense to use are being used. To do this, you can use the structure of the code below.

#include
#include
#include
#include "path/to/my_model.cpp"

int main(int argc, char *argv[])
{
MPI_Init(NULL, NULL);

int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

int n_cores = omp_get_num_threads(); // get number of cores available for MPI process.
vector system_inputs = ;
vector results(n_cores , 0);
#pragma omp parallel for
for (int i = 0; i < n_cores ; ++i)
{
results[i] = RunMonteCarloSimulation(system_inputs[i]);
}

MPI_Finalize();
return 0;

}


If you comment lines 4 and 20 and uncomment line 19, compile the code with mpicc hybrid.c -o Hybrid -fopenmp, set the number of threads in the terminal to eight by running set OMP_NUM_THREADS=8, and call the resulting executable with the command mpirun -n 4 ./Hybrid, the executable will spawn four processes, each of which spawning eight threads and printing the following output:

C hybrid. Process-thread: 0-0


The code above can be used to have one process per node (and therefore one copy of system_inputs per node) using as many threads as there are cores in that node — depending on the number of cores, it is sometimes more efficient to have two or three MPI processes per node, each using half or a third of the cores in that node. Below is a PBS job submission script that can be used to generate the output above — job submission scripts for Slurm to be used on Stampede 2 can be found here.

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=:00:30
#PBS -o ./output/hybrid_c.out
#PBS -e ./error/hybrid_c.err

set -x
cd "$PBS_O_WORKDIR" # Construct a copy of the hostfile with only 16 entries per node. # MPI can use this to run 16 tasks on each node. export TASKS_PER_NODE=2 uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile #cat nodefile mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS ./Hybrid  ## Parallelizing Python on a Cluster Python was not designed with shared-memory parallelization in mind, so its native parallelization library, the Multiprocessing library, does distributed-memory parallelization instead of shared. The issue with the Multiprocessing library is that it does not work across nodes, so you still need something like MPI. Here, we will use MPI4Py. The code below parallelizes the function call across cores within the same node — please refer to this blog post for details about and other ways to use the Multiprocessing library. from multiprocessing import Process, cpu_count def do_something_useful(rank, shared_process_number): # Do something useful here. print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number) # Create node-local processes shared_processes = [] #for i in range(cpu_count()): for i in range(8): p = Process(target=do_something_useful, args=(i,)) shared_processes.append(p) # Start processes for sp in shared_processes: sp.start() # Wait for all processes to finish for sp in shared_processes: sp.join()  By adding MPI4Py to the code above, following the same logic of the C/C++ example, we get the following result: from mpi4py import MPI from multiprocessing import Process, cpu_count def do_something_useful(rank, shared_process_number): # Do something useful here. print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number) comm = MPI.COMM_WORLD rank = comm.Get_rank() print 'Calling Python multiprocessing process from Python MPI rank {}'.format(rank) # Create shared-ish processes shared_processes = [] #for i in range(cpu_count()): for i in range(8): p = Process(target=do_something_useful, args=(rank, i)) shared_processes.append(p) # Start processes for sp in shared_processes: sp.start() # Wait for all processes to finish for sp in shared_processes: sp.join() comm.Barrier()  Calling the code above with mpirun -n 4 python hybrid_pure_python.py will first result in four MPI processes being spawned with ranks 1 through 4, each spawning eight local processes. The output should look similar to the one below: Calling Python multiprocessing process from Python MPI rank 0 Calling Python multiprocessing process from Python MPI rank 1 Python hybrid, MPI_Process-local_process (not quite a thread): 1-0 Python hybrid, MPI_Process-local_process (not quite a thread): 0-0 Python hybrid, MPI_Process-local_process (not quite a thread): 1-1 Python hybrid, MPI_Process-local_process (not quite a thread): 0-1 Python hybrid, MPI_Process-local_process (not quite a thread): 1-2 Python hybrid, MPI_Process-local_process (not quite a thread): 0-2 Python hybrid, MPI_Process-local_process (not quite a thread): 0-3 Python hybrid, MPI_Process-local_process (not quite a thread): 1-3 Calling Python multiprocessing process from Python MPI rank 2 Calling Python multiprocessing process from Python MPI rank 3 Python hybrid, MPI_Process-local_process (not quite a thread): 1-4 Python hybrid, MPI_Process-local_process (not quite a thread): 0-4 Python hybrid, MPI_Process-local_process (not quite a thread): 0-5 Python hybrid, MPI_Process-local_process (not quite a thread): 1-5 Python hybrid, MPI_Process-local_process (not quite a thread): 1-6 Python hybrid, MPI_Process-local_process (not quite a thread): 0-6 Python hybrid, MPI_Process-local_process (not quite a thread): 1-7 Python hybrid, MPI_Process-local_process (not quite a thread): 0-7 Python hybrid, MPI_Process-local_process (not quite a thread): 2-0 Python hybrid, MPI_Process-local_process (not quite a thread): 3-0 Python hybrid, MPI_Process-local_process (not quite a thread): 3-1 Python hybrid, MPI_Process-local_process (not quite a thread): 2-1 Python hybrid, MPI_Process-local_process (not quite a thread): 3-2 Python hybrid, MPI_Process-local_process (not quite a thread): 2-2 Python hybrid, MPI_Process-local_process (not quite a thread): 3-3 Python hybrid, MPI_Process-local_process (not quite a thread): 2-3 Python hybrid, MPI_Process-local_process (not quite a thread): 3-4 Python hybrid, MPI_Process-local_process (not quite a thread): 2-4 Python hybrid, MPI_Process-local_process (not quite a thread): 2-5 Python hybrid, MPI_Process-local_process (not quite a thread): 3-5 Python hybrid, MPI_Process-local_process (not quite a thread): 3-6 Python hybrid, MPI_Process-local_process (not quite a thread): 3-7 Python hybrid, MPI_Process-local_process (not quite a thread): 2-6 Python hybrid, MPI_Process-local_process (not quite a thread): 2-7  Below is a PBS submission script that can be used with the code above: #!/bin/bash #PBS -N test_hybrid #PBS -l nodes=2:ppn=16 #PBS -l walltime=00:01:00 #PBS -o ./output/hybrid_python.out #PBS -e ./error/hybrid_python.err module load python-2.7.5 export OMP_NUM_THREADS=8 set -x cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 16 entries per node.
# MPI can use this to run 16 tasks on each node.
uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

#cat nodefile
mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_pure_python.py


## Parallelizing Monte Carlo Runs of C/C++ (or anything else) with Python

Very often we have to run the same model dozens to thousands of times for Monte-Carlo-style analyses. One option is to go on a cluster and submit dozens to thousand of jobs, but this becomes a mess for the scheduler to manage and cluster managers tend to not be very fond of such approach. A cleaner way of performing Monte Carlo runs in parallel is to write a code in Python or other language that spawns MPI processes which then call the model to be run. This can be done in Python with the subprocess library, as in the example below:

from mpi4py import MPI
import subprocess

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print 'Calling OpenmpTest from Python MPI rank {}'.format(rank)
function_call = ['./OpenmpTest', '{}'.format(rank)]
subprocess.call(function_call)

comm.Barrier()


Here, OpenmpTest is an executable compiled from the C-OpenMP code below:

#include
#include

int main(int argc, char *argv[])
{
#pragma omp parallel for
for (int i = 0; i < omp_get_num_threads(); ++i)
{
printf("C OpenMP. Process-thread: %s-%d\n", argv[1], i);
}
return 0;
}


Submitting the code above with following PBS job submission script should result in the output below.

#!/bin/bash
#PBS -N test_hybrid
#PBS -l nodes=2:ppn=16
#PBS -l walltime=00:01:00
#PBS -o ./output/hybrid_python_c.out
#PBS -e ./error/hybrid_python_c.err

set -x
cd "$PBS_O_WORKDIR" # Construct a copy of the hostfile with only 16 entries per node. # MPI can use this to run 16 tasks on each node. export TASKS_PER_NODE=2 uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile #cat nodefile mpiexec –hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_calls_c.py Output: Calling OpenmpTest from Python MPI rank 0 Calling OpenmpTest from Python MPI rank 1 Calling OpenmpTest from Python MPI rank 2 Calling OpenmpTest from Python MPI rank 3 C OpenMP. Process-thread: 0-7 C OpenMP. Process-thread: 0-5 C OpenMP. Process-thread: 0-0 C OpenMP. Process-thread: 0-2 C OpenMP. Process-thread: 0-4 C OpenMP. Process-thread: 0-1 C OpenMP. Process-thread: 0-3 C OpenMP. Process-thread: 0-6 C OpenMP. Process-thread: 1-1 C OpenMP. Process-thread: 1-2 C OpenMP. Process-thread: 1-6 C OpenMP. Process-thread: 1-4 C OpenMP. Process-thread: 1-5 C OpenMP. Process-thread: 1-0 C OpenMP. Process-thread: 1-3 C OpenMP. Process-thread: 1-7 C OpenMP. Process-thread: 2-3 C OpenMP. Process-thread: 2-7 C OpenMP. Process-thread: 2-0 C OpenMP. Process-thread: 2-4 C OpenMP. Process-thread: 2-6 C OpenMP. Process-thread: 2-5 C OpenMP. Process-thread: 2-1 C OpenMP. Process-thread: 2-2 C OpenMP. Process-thread: 3-5 C OpenMP. Process-thread: 3-1 C OpenMP. Process-thread: 3-4 C OpenMP. Process-thread: 3-2 C OpenMP. Process-thread: 3-7 C OpenMP. Process-thread: 3-3 C OpenMP. Process-thread: 3-0 C OpenMP. Process-thread: 3-6  ## End of the Story As highlighted in the examples above (all of which can be download from here), creating parallel code is not rocket science and requires very few lines of code. The OpenMP and Multiprocessing parallelization tricks can save hours if you have to run a model several times on your personal computer, and paired with MPI these tricks can save you from hours to days in time and thousands of node-hours on clusters. I hope you make good use of it. # MOEAFramework Training Part 1: Connecting an External Problem The goal of this training is to step a user through becoming familiar with the capabilities of MOEAFramework, a free and open source Java library created by Dave Hadka, that allows the user to design, execute, and test out a variety of popular multi-objective evolutionary algorithms (MOEAs). In this series, we will demonstrate the capabilities of MOEAFramework in 4 parts. Part 1 demonstrates how to hook up an external optimization problem to MOEAFramework. Part 2 will cover the optimization of the problem using a variety of algorithms. Part 3 will illustrate how to calculate metrics to assess the performance of the algorithms. Finally, Part 4 will step through generation of relevant figures from the metrics that convey the effectiveness, efficiency, reliability, and controllability of the algorithms. The example test case that will be used throughout this tutorial is the DPS version of the Lake Problem. The code is written and adapted by Dr. Julianne Quinn and can be found here. However, the tutorial will be set up so that the user can easily swap in their own problem formulation. Finally, this guide is specifically built to connect an external problem that is written in C++ but Dave Hadka’s Beginner Guide to the MOEA Framework, has examples of how to create the Java executable for problems written in a language other than C++. ### Connecting Your Problem Before starting this training, it is recommended to set up a directory in a Linux environment with Java installed. If you are a student in Reed Research Group, create a folder in your Cube directory called “MOEA_Diagnostics_Tutorial”. Throughout the training, we will be adding various subdirectories and files to this folder. For this part of the tutorial, you will need the following in your directory: 1. MOEAFramework Demo .jar file which can be found by clicking on “Demo Application” on the far right. 2. The C++ version of the Lake Problem 3. SOWs_Type6.txt (natural inflows read in by the .cpp file) 4. moeaframework.c and moeaframework.h found here Once you have the .cpp file, the next step is to get it set up to be recognized by MOEAFramework. I have made these changes in this file in my GitHub repo. If you compare with the file in 2, you will see some changes. First, I have commented out all parts of the code related to Borg, which the problem was originally set up to be optimized with. For this tutorial, we will be optimizing with multiple algorithms and comparing their performance. Secondly, in lines 55-57, I have use the #define directive to declare the number of variables, objectives, and constraints to be constants. Lines 300-309 is where the direct connection to MOEAFramework comes in. MOEA_Init initializes the communication between C++ and MOEAFramework and takes the number of objectives and constraints as arguments. Then we can start reading and evaluating solutions. MOEA_Read_doubles extracts the decision variables and stores them in an array, vars. Line 304 calls the lake_problem function that we would like to evaluate, along with the variables, and constraints. This results in the objs array being filled. Finally, MOEA_Write writes the objectives back into the framework. When all solutions are done being read and written, MOEA_Terminate closes the connection. The important thing here is just to make sure that you are passing the names of the arguments of your lake_problem function correctly. Next, we must make an executable of the C++ file that Java can read. The makefile is located here and takes lake.cpp, moeaframework.c, moeaframework.h, and some relevant libraries and compiles them into an executable called “lake”. In order to run this file, simply type “make”. Finally, we must turn our executable into a java class. The relevant java file can be found here. This file can be found in the MOEAFramework documentation, but must be tailored to an external problem. Lines 21-25 import in the relevant tools from MOEAFramework that help to configure and solve the problem. Lines 40, 42, 43, and 48 show the first change from the original file. Here we insert the name of our executable that was generated in the last step, “lake”. In lines 53, 58, and 63, we state the number of decision variables, objectives, and constraints. Finally, in lines 67-75, we create a newSolution ( ) method to specify that our solution should have 6 real-valued decision variables between 0 and 1. In order to create a class file, simply type:  javac -classpath MOEAFramework-2.12-Demo.jar lake.java  A file called lake.class will be created. The first part of training is done! Next time, we will set up some scripts, call these executables, and optimize the lake problem with a variety of different algorithms. # Making Valgrind Easy Some of this blog’s readers and authors (most notably, Joe Kasprzik) read the title of this post and though “wait, there already is a post about Valgrind in this blog.” And you are right, so in this blog post I will build on the legacy Joe has left us on his post about Valgrind and get into the details of how to use its basic functionalities to get your code right. # Common mistakes when coding in C++ Suppose we have the following code: #include <stdio.h> int main() { int *var = new int[5]; // you wouldn't do this if the size was always 5, but this makes the argument clear. int n = 5; int m; if (m > n) { printf("Got into if statement.\n"); for (int i = 0; i < 6; ++i) { var[i] = i; } } printf("var[5] equals %d\n", var[n]); }  Saving the code above in a file called test.cpp, compiling it with g++ to create an executable called "test," and running it with "./test" will return the following output: bernardoct@DESKTOP-J6145HK ~$ g++ test.cpp -o test

bernardoct@DESKTOP-J6145HK ~
$./test Got into if statement. var[5] equals 5  Great, it ran and did not crash (in such a simple code gcc's flag -Wall would have issued a warning saying m was not initialized, but in more complex code such warning may not be issued). However, it would be great if this code had crashed because this would make us look into it and figure out it actually has 3 problems: 1. We did not assign a value to variable m (it was created but not initialized), so how did the code determine that m was greater than n to get into the code inside the if statement? 2. The pointer array var was created as having length 5, meaning its elements are numbered 0 to 4. If the for-loop runs from 0 to 5 but element 5 does not exist, how did the code fill it in with the value of variable i when i was 5 in the loop? From the printf statement that returned 5 we know vars[5] equals 5. 3. The pointer array var was not destroyed after the code did not need it any longer. This is not necessarily a problem in this case, but if this was a function that is supposed to be called over and over within a model there is a change the RAM would be filled with these seemingly inoffensive pointer arrays and the computer would freeze (or the node, if running on a cluster, would possibly crash and have to be rebooted). Given C++ will not crash even in the presence of such errors, one way of making sure your code is clean is by running it through Valgrind. However, most people who has used Valgrind on a model that has a few hundreds or thousands of lines of code has gotten discouraged by its possibly long and cryptic-looking output. However, do not let this intimidate you because the output is actually fairly easy to read once you either learn what to look for or use Valkyrie, a graphical user interface for Valgrind. # Generating and interpreting Valgrind’s output The first think that needs to be done for Valgrind to give you a meaningful output is to re-compile your code with the -O0 and -g flags, the former to prevent the compiler from modifying your code to make it more efficient but unintelligible to Valgrind (or to debuggers), and the latter for Valgrind (and debuggers) to be able to pinpoint the line of code where issues happen and are originated. Therefore, the code should be compiled as shown below: bernardoct@DESKTOP-J6145HK ~$ g++ -O0 -g test.cpp -o test



Now it is time to run your code with Valgrind to perform some memory checking. Valgrind itself will take flags that will dictate the type of analysis to be performed. Here we are interested in checking memory misuse (instead profiling, checking for thread safety, etc.), so the first flag (not required, but good to keep things for yourself) should be --tool=memcheck. Now that we specified that we want Valgrind to run a memory check, we should specify that we want it to look in detail for memory leaks and tell us where the erros are happening and originating, which can done by passing flags --leak-check=full and --track-origins-yes. This way, the complete function call to run Valgrind on our test program is:

bernardoct@DESKTOP-J6145HK ~
$valgrind --tool=memcheck --leak-check=full --track-origins=yes ./test  Important: Beware that your code will take orders of magnitude longer to run with Valgrind than it would otherwise. This means that you should run something as small as possible but still representative — e.g. instead of running your stochastic model with 1,000 realizations and a simulation time of 50 years, consider running 2 realizations simulating 2 years, so that Valgrind analyzes the year-long simulation and the transition between realizations and years. Also, if running your code on a cluster, load the valgrind module with module load valgrind-xyz on your submission script and replace the call to your model on the submission script by the valgrind call above you can find the exact name of the Valgrind module by running module avail on the terminal. If running valgrind with a code that used MPI, use mpirun valgrind ./mycode -flags. When called, valgrind will instrument our test.cpp and based on the collected information will print the following on the screen: ==385== Memcheck, a memory error detector ==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==385== Command: ./test ==385== ==385== Conditional jump or move depends on uninitialised value(s) ==385== at 0x4006A9: main (test.cpp:9) ==385== Uninitialised value was created by a stack allocation ==385== at 0x400686: main (test.cpp:3) ==385== Got into if statement. ==385== Invalid write of size 4 ==385== at 0x4006D9: main (test.cpp:12) ==385== Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd ==385== at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423) ==385== by 0x400697: main (test.cpp:5) ==385== ==385== Invalid read of size 4 ==385== at 0x4006F5: main (test.cpp:16) ==385== Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd ==385== at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423) ==385== by 0x400697: main (test.cpp:5) ==385== var[5] equals 5 ==385== ==385== HEAP SUMMARY: ==385== in use at exit: 20 bytes in 1 blocks ==385== total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated ==385== ==385== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==385== at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423) ==385== by 0x400697: main (test.cpp:5) ==385== ==385== LEAK SUMMARY: ==385== definitely lost: 20 bytes in 1 blocks ==385== indirectly lost: 0 bytes in 0 blocks ==385== possibly lost: 0 bytes in 0 blocks ==385== still reachable: 0 bytes in 0 blocks ==385== suppressed: 0 bytes in 0 blocks ==385== ==385== For counts of detected and suppressed errors, rerun with: -v ==385== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)  Seeing Valgrind’s output being 5 times as long as the test code itself can be somewhat disheartening, but the information contained in the output is really useful. The first block of the output is the header it will always be printed so that you know the version of Valgrind you have been using, the call for your own code it used, and so on. In our example, the header is: ==385== Memcheck, a memory error detector ==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==385== Command: ./test  After that, Valgrind report the errors it found during the execution of your code. Errors are always reported as a description of the error in good old English, followed by where it happens in your code. Let’s look at the first error found by Valgrind: ==385== Conditional jump or move depends on uninitialised value(s) ==385== at 0x4006A9: main (test.cpp:9) ==385== Uninitialised value was created by a stack allocation ==385== at 0x400686: main (test.cpp:3)  This tells us that there is an if statement (conditional statement) on line 9 of test.cpp in which at least one of the sides of the logical test has at least one uninitialized variable. As pointed out by Valgrind, line 9 of test.cpp has our problematic if statement which compares initialized variable n to uninitialized variable m, which will have whatever was put last in that memory address by the computer. The second error block is the following: ==385== Invalid write of size 4 ==385== at 0x4006D9: main (test.cpp:12) ==385== Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd ==385== at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423) ==385== by 0x400697: main (test.cpp:5)  This means that your code is writing something in a location of memory that it did not allocated for its use. This block says that the illegal write, so to speak, happened in line 12 of test.cpp through a variable created in line 5 of test.cpp using the new[] operator. These lines correspond to var[i] = i; and to int *var = new int[5];. With this, we learned that either var was created too short on line 5 of test.cpp or that the for loop that assigns values to var goes one or more steps too far. Similarly, the next block tells us that our printf statement used to print the value of var[5] on the screen has read past the amount of memory that was allocated to var in its declaration on line 5 of test.cpp, as shown below: ==385== Invalid read of size 4 ==385== at 0x4006F5: main (test.cpp:16) ==385== Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd ==385== at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423) ==385== by 0x400697: main (test.cpp:5)  The last thing Valgrind will report is the information about memory leaks, which are accounted for when the program is done running. The output about memory leaks for our example is: ==409== HEAP SUMMARY: ==409== in use at exit: 20 bytes in 1 blocks ==409== total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated ==409== ==409== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==409== at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423) ==409== by 0x400697: main (test.cpp:5) ==409== ==409== LEAK SUMMARY: ==409== definitely lost: 20 bytes in 1 blocks ==409== indirectly lost: 0 bytes in 0 blocks ==409== possibly lost: 0 bytes in 0 blocks ==409== still reachable: 0 bytes in 0 blocks ==409== suppressed: 0 bytes in 0 blocks  The important points to take away from this last block are that: 1. there were 20 bytes of memory leaks, meaning that if this were a function in your code every time it was run it would leave 20 bytes of garbage sitting in the RAM. This may not sound like a big deal but imagine if your code leaves 1 MB of garbage in the RAM for each of the 100,000 times a function is called. With this, there went 100 GB of RAM and everything else you were doing in your computer at that time because the computer will likely freeze and have to go through a hard-reset. 2. the memory you allocated and did not free was allocated in line line 5 of test.cpp when you used the operator new[] to allocate the integer pointer array. It is important to notice here that if we increase the amount of allocated memory by the new[] operator on line 5 to that corresponding to 6 instead of 5 integers, the last two errors (invalid read and invalid write) would disappear. This means that if you run your code with Valgrind and see hundreds of errors, chances are that it will take modifying a few lines of code to get rid of most of these errors. # Valkyrie — a graphical user interface for Valgrind Another way of going through Valgrind’s output is by using Valkyrie (now installed in the login node of Reed’s cluster, The Cube). If you are analyzing your code from your own computer with a Linux terminal (does not work with Cygwin, but you can install a native Ubuntu terminal on Windows 10 by following instructions posted here) and do not have Valkyrie installed yet, you can install it by running the following on your terminal: bernardoct@DESKTOP-J6145HK ~$ sudo apt-get install valkyrie


Valkyrie works by reading an xml file exported by Valgrind containing the information about the errors it found. To export this file, you need to pass the flags --xml=yes and --xml-file=valgring_output.xml (or whatever name you want to give the file) to Valgrind, which would make the call to Valgrind become:

bernardoct@DESKTOP-J6145HK ~
valgrind --tool=memcheck --leak-check=full --track-origins=yes --xml=yes --xml-file=valgring_output.xml ./test  Now, you should have a file called “valgrind_output.xml” in the directory you are calling Valgrind from. To open it with Valkyrie, first open Valkyrie by typing valkyrie on your terminal if on Windows 10 you need to have Xming installed and running, which can be done by following the instructions in the end of this post. If on a cluster, besides having Xming open you also have to have ssh’ed into the cluster with the -X flag (e.g. by running ssh -X username@my.cluster.here) with either Cygwin or from a native Linux terminal. After opening Valkyrie, click on the green folder button and select the xml file, as in the screenshot below. After opening the xml file generated by Valgrind, Valkyrie should look like in the screenshot below: Now you can begin from a collapsed list of errors and unfold each error to see its details. Keep in mind that Valkyrie is not your only option of GUI for Valgrind, as IDEs like JetBrains’ CLion and QTCreator come integrated with Valgrind. Now go check your code! PS: Thanks to folks on Redit for the comments which helped improve this post. # A completely non-exhaustive list of tutorial resources for scientific computing This is a short blog post to put together a list of resources with tutorials (similar to what’s usually found on this blog) for various programming languages. It is by no means exhaustive, so please comment if you feel there’s an important one I left out. ## Matlab: https://www.arnevogel.com/ Kinda new blog with Matlab tutorials on numerical methods https://blogs.mathworks.com/loren/ One of the MathWorks blogs, great tutorials. https://www.mathworks.com/support/learn-with-matlab-tutorials.html Matlab tutorials from MathWorks http://undocumentedmatlab.com/ More Matlab tutorials, a lot of material on many topics ## Python: https://glowingpython.blogspot.com/ Not updated very frequently, but good data analysis and visualization tutorials are posted https://pythonprogramming.net/ Updated regularly, some great data visualization and analysis tutorials. The tutorials come with videos. I really like this site. http://treyhunner.com/ Updated regularly (about once a month) with python tutorials and general python coding insights. I really like the writing style of this author. https://docs.scipy.org/doc/scipy/reference/tutorial/ Tutorials on using SciPy ## C and C++: https://www.cprogramming.com/ C and C++ programming tutorials, tips and tricks http://www.cplusplus.com/articles/ Not really updated anymore but some good basic tutorials are listed https://blog.knatten.org/ Hadn’t been updated in a while, but it looks like it’s been picked up again. Good for general C++ programming and good practice. http://www.bfilipek.com/ C++ tutorials ## General: https://towardsdatascience.com/ This is a great general resource not devoted to a particular language. They cover topics in data science, machine learning and programming in general. Some great tutorials for Python and R. https://projecteuler.net/ Mathematical programming problems to help with learning any language https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md Free programming books repository ### Reddits (some of the bigger ones): /r/matlab (General on matlab, community provides help with coding) /r/programming (General on programming) /r/learnprogramming (Community provides help with debugging questions) /r/python (General on python, community provides help with coding) /r/learnpython (Community provides help with python questions, smaller than /r/python) /r/cpp (General on C++) /r/cpp_questions (Community provides help with C++ questions) I’ve also recently made /r/sci_comp which has very little activity for the moment, but the aim is to create a community with general resources on coding for scientific applications. # Simple tricks to make your C/C++ code run faster Us, engineers with no formal computer science training, for a myriad of good reasons tend to favor friendly programming languages such as Python over evil C/C++. However, when our application is performance sensitive and we cannot wait for results sitting in our chairs for as long as a Python code would require us to, we are sometimes forced to us C/C++. Why then not making the most of it when it this is the case? Here are some simple tips to make your C/C++ code run even faster, and how to get some advice about further performance improvements. The last idea (data locality) is transferable to Python and other languages. Most important trick Improve your algorithm. Thinking if there is a simpler way of doing what you coded may reduce your algorithm’s complexity (say, from say n3 to n*log(n)), which would: • yield great speed-up when you have lots of data or needs to run a computation several times in a row, and • make you look smarter. Compiler flags First and foremost, the those who know what they are doing — compiler developers — do the work for you by calling the compiler with the appropriate flags. There are an incredible amount of flags you can call for that, but I would say that the ones you should have on whenever possible are -O3 and –march=native. The optimization flags (-O1 to -O3, the latter more aggressive than the former) will perform a series of modification on your code behind the scenes to speed it up, some times by more than an order of magnitude. The issue is that this modifications may eventually make your code behave differently than you expected, so it’s always good to do a few smaller runs with -O0 and -O3 and compare their results before getting into production mode. The –march=native flag will make the compiler fine tune your code to the processor it is being compiled on (conversely, –march=haswell would fine tune it to haswell architectures, for example). This is great if you are only going to run your code on your own machine or in another machine known to have a similar architecture, but if you try to run the binary on a machine with a different architecture, specially if it is an older one, you may end up with illegal instruction errors. Restricted pointer array When declaring a pointer array which you are sure will not be subject to pointer aliasing — namely there will be no other pointer pointing to the same memory address –, you can declare that pointer as a restrict pointer, as below: • GCC: double* __restrict__ my_pointer_array • Intel compiler: double* restrict my_pointer_array This will let the compiler know that it can change order of certain operations involving my_pointer_array to make your code faster without changing some read/write order that may change your results. If you want to use the restricted qualifier with the intel compiler the flag -restrict must be passed as an argument to the compiler. Aligned pointer array By aligning an array, you are making sure the data is going to lie in the ideal location in memory for the processor to fetch it and perform the calculations it needs based on that array. To help your compiler optimize your data alignment, you need to (1) align your array when it is declared by a specific number of bytes and (2) tell your the compiler the array is aligned when using it to perform calculations — the compiler has no idea whether or not arrays received as arguments in function are aligned. Below are examples in C, although the same ideas apply to C++ as well. GCC #include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double *__restrict__ a, double *__restrict__ b, double *__restrict__ c, int n) { a = (double*) __builtin_assume_aligned(a, ALIGN); b = (double*) __builtin_assume_aligned(b, ALIGN); c = (double*) __builtin_assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { double a[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); double b[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); double c[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); for (int i = 0; i < SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(a, b, c, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }  Intel compiler #include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double* restrict a, double* restrict b, double* restrict c, int n) { __assume_aligned(a, ALIGN); __assume_aligned(b, ALIGN); __assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { __declspec(align(ALIGN )) float a[SIZE_ARRAY]; __declspec(align(ALIGN )) float b[SIZE_ARRAY]; __declspec(align(ALIGN )) float c[SIZE_ARRAY]; for (int i = 0; i < SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(a, b, c, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }  Edit: In a comment to this post, Krister Walfridsson not only caught an issue with my GCC code, for which mention I thank him, but he also shows the differences in machine code generated with and without alignment. Data Locality Computers are physical things, which means that data is physically stored and needs to be physically moved around in memory and between cache and processor in order to be used in calculations. This means that, if your data is stored all over the place in memory — e.g. in multiple pointer arrays in different parts of memory –, the processor will have to reach out to several parts of memory to fetch all your data before performing any computations. By having the data intelligently laid out in memory you ensure all data required for each computation is stored close to each other and in cache at the same time, which becomes even more important if your code uses too much data to fit in the cache at once. In order to making your processor’s life easier, it is a good idea to ensure that all data required for a calculation step is close together. For example, if a given computation required three arrays of fixed sizes, it is always a good idea to merge them into one long array, as in the example below for the Intel compiler. #include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double* restrict a, double* restrict b, double* restrict c, int n) { __assume_aligned(a, ALIGN); __assume_aligned(b, ALIGN); __assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { __declspec(align(ALIGN )) float abc[3 * SIZE_ARRAY]; for (int i = 0; i < 2 * SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(abc, abc + SIZE_ARRAY, abc + 2 * ARRAY, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }  or even, since c[i] depends only on b[i] and a[i], we can have the values of a, b and c intercalated to assure that all computations will be performed on data that is right next to each other in memory: #include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 #define STRIDE 3 void sum(double* restrict abc, int n, int stride) { __assume_aligned(abc, ALIGN); for (int i = 0; i < n; i += stride) abc[i+2] = abc[i] + abc[i+1]; } int main(void) { __declspec(align(ALIGN )) double abc[3 * SIZE_ARRAY]; for (int i = 0; i < 3 * SIZE_ARRAY; i += STRIDE) { abc[i] = 5.; abc[i+1] = 2.; } double start_time = omp_get_wtime(); sum(abc, SIZE_ARRAY, STRIDE ); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }  Conclusion According a class project in which we had to write C code to perform matrix multiplication, the improvements suggested should may improve the performance of your code by 5 or 10 times. Also, the idea of data locality can be transferred to other languages, such as Python. # Reading CSV files in C++ If you are an engineer used to coding in Python or Matlab who is transitioning to C++, you will soon find out that even the most innocent task will now require several lines of code. A previous post has already shown how to export data to a CSV file. In order to facilitate your transition to C++, see below for an example of how to read your new CSV file. Utils.cpp #include <string> #include <vector> #include <sstream> //istringstream #include <iostream> // cout #include <fstream> // ifstream using namespace std; /** * Reads csv file into table, exported as a vector of vector of doubles. * @param inputFileName input file name (full path). * @return data as vector of vector of doubles. */ vector<vector<double>> parse2DCsvFile(string inputFileName) { vector<vector<double> > data; ifstream inputFile(inputFileName); int l = 0; while (inputFile) { l++; string s; if (!getline(inputFile, s)) break; if (s[0] != '#') { istringstream ss(s); vector<double> record; while (ss) { string line; if (!getline(ss, line, ',')) break; try { record.push_back(stof(line)); } catch (const std::invalid_argument e) { cout << "NaN found in file " << inputFileName << " line " << l << endl; e.what(); } } data.push_back(record); } } if (!inputFile.eof()) { cerr << "Could not read file " << inputFileName << "\n"; __throw_invalid_argument("File not found."); } return data; } int main() { vector<vector<double>> data = parse2DCsvFile("test.csv"); for (auto l : data) { for (auto x : l) cout << x << " "; cout << endl; } return 0; }  # Dynamic memory allocation in C++ To have success creating C++ programs, it’s essential to have a good understanding of the workings and implementation of dynamic memory, which is the “art” of performing manual memory management. When developing code in C++, you may not always know how much memory you will need before your program runs. For instance, the size of an array may be unknown until you execute the program. ## Introduction to pointers In order to understand dynamic memory allocation, we must first talk about pointers. A pointer is essentially a variable that stores the address in memory of another variable. The pointer data type is a long hexadecimal number representing the memory address. This address can be accessed using an ampersand (&) exemplified in the following script: //Accessing a pointer: int age=43; cout << &age << endl; // The output would be something like: 0x7fff567ecb68  Since a pointer is also a variable, it requires declaration before being used . The declaration consists on giving it a name, hopefully following the good code style convention and declaring the data type that it “points to” using an asterisk (*) like so: //Declaring pointers of different data types: int *integerPointer; double *doublePointer; float *floatPointer; char *charPointer;  ## The pointer’s operators In summary, the two pointer operators are: address-of operator(&) which returns the memory address and contents-of operator (*) which returns the value of the variable located at that address. // Example of pointer operators: float variable=25.6; float *pointer; pointer= &variable; cout << variable << endl; //outputs 25.6, the variable’s value cout << pointer << endl; //outputs 0x7fff5a774b68, the variable’s location in memory cout << *pointer << endl; // outputs 25.6, value of the variable stored in that location  This last operator is also called deference operator which enables you to access directly the variable the pointer points to, which you can then use for regular operations: float width = 5.0; float length = 10.0; float area; float *pWidth = &width; float *pLength = &length; //Both of the following operations are equivalent area = *pWidth * *pLength; area = width * length; //the output for both would be 50.  Deferencing the pointers *pWidth and *pLength represents exactly the same as the variables width and length, respectively. ## Memory allocation in C++ Now that you have some basic understanding of pointers, we can talk about memory allocation in C++. Memory in C++ is divided in tow parts: the stack and the heap. All variables declared inside the function use memory from the stack whereas unused memory that can be used to allocate memory dynamically is stored in the heap. You may not know in advance how much memory you need to store information in a variable. You can allocate memory within the heap of a variable of a determined data type using the new operator like so: new float;  This operator allocates the memory necessary for storing a float on the heap and returns that address. This address can also be stored in a pointer, which can then be deferenced to access the variable: float *pointer = new float; //requesting memory *pointer = 12.0; //store value cout << *pointer << endl; //use value delete pointer;// free up memory // this is now a dangling pointer pointer= new float // reuse for new address  Here, the pointer is stored in the stack as a local variable, and holds the allocated address in the heap as its value. The value of 12.0 is then stored at the address in the heap. You can then use this value for other operations. Finally, the delete statement releases the memory when it’s no longer needed; however, it does not delete the pointer since it was stored in the stack. These type of pointers that point to non-existent memory are called dangling pointers and can be reused. ## Dynamic memory and arrays Perhaps the most common use of dynamic memory allocation are arrays. Here’s a brief example of the syntax: int *pointer= NULL; // initialized pointer pointer= new int[10] // request memory delete[]pointer; //delete array pointed to by pointer  The NULL pointer has a value of zero, you can declare a null pointer when you do not have the address to be assigned. Finally, to allocate and release memory for multi-dimensional arrays, you basically use an array of pointers to arrays, it sounds confusing but you can do this using the following sample method: int row = 3; int col = 4; double **p = new double* [row]; // Allocate memory for rows // Then allocate memory for columns for(int i = 0; i < col; i++) { p[i] = new double[col]; } //Release memory for(int i = 0; i < row; i++) { delete[] p[i]; } delete [] p;  I hope this quick overview provides a starting point on tackling your C++ memory allocation challenges. # Reed Group’s basic C++ code style conventions It is always good practice for programmers to adopt some sort of style convention when developing new code. This helps keeping the code readable for both authors and collaborators, as well as for people that read your code on online repositories. Here I will set a precedent for a minimal C++ code style for Reed’s group encompassing C++ features we normally use based on the most common practices out there, so that we can more easily help each other with out codes and keep consistency when publishing them. This post may be updated if somebody sets precedents for C++ features didn’t account for (e.g. namespaces). ### Naming conventions • Classes: Uppercase first letter. If the class name is comprised of more than one word, all words should be written together (no dashes, underscores, etc.) and the first letter of each word should be capitalized. E.g.: MyAwesomeClass. • Functions: Lowercase first letter. If the function name is comprised of more than one word, all words should be written together (no dashes, underscores, etc.) and the first letter of each word except the first should be capitalized. E.g.: myGreatFunction. If you are creating a getter or a setter, be sure to follow the this standard. E.g. the getter for variable “this_variable” would be “getThisVariable.” • Variables: All in small letters with words separated by “_” (underscore). E.g. this_variable. • Constants: All letters capitalized and words separated by underscores. E.g. MY_GREAT_CONSTANT. ### Other naming rules Besides naming conventions, there are other good practices when it comes to coming up with names in your code: • Do not assign one letter names, unless it is a temporary variable such as i, j, k used as indexes. • Assign informative names to your classes, functions, variables and constants. If you have a variable called “length,” another called “thisLength” and a third one called “realLength” your code will be really hard to follow. • Being concise is great (nobody reads code for its poetic variable names) but avoid shortening your names too much. Calling a variable “catchmentFallCreekIthaca” makes it much easier for someone else to know the information contained in that variable than calling it “catfacreith.” • We all get really frustrated with our codes at times, and want to curse it really bad. It’s fine to do it in your office when nobody is hearing, but be sure to not let that leak into your code and to keep some decency: e.g. avoid having in your code “this&%*%DoesNot&$%*#@Work = true” or anything of the sort. ### Other rules • Avoid magic numbers (hard coded numbers). Codes like the one below not only are hard to understand but also make the reader question if the results of the code are actually right: if (312 * evaporation + inflow / 52 - 7 * demand) { // Do something here }  Now imagine if the value 212 is the value of an area and is used in 83 different parts of your code: that’s a problem. Instead, declaring those numbers as constants would be preferred: const double DRYVILLE_RESERVOIR_AREA = 312.0; const double NUMBER_OF_WEEKS_IN_YEAR = 52.14; const double NUMBER_OF_DAYS_IN_WEEK = 7.0; // Lots of code here, since constants are normally declared at the top of the code. if (DRYVILLE_RESERVOIR_AREA * evaporation + inflow / NUMBER_OF_WEEKS_IN_YEAR - NUMBER_OF_DAYS_IN_WEEK * demand) { // Do something here }  • Keep your cpp files shorter than 500 lines. If you start approaching 500 lines, it may be the case that your class can be broken into parent and multiple children classes, or into two completely different classes. • Have only the main.cpp file in the root directory. All other files, if any, should be in directories so that the code is easy to navigate through. • If there is an issue or simplification to be fixed at some point in the future, use the “//FIXME:” comment to indicate it, as in the code below: //FIXME: replace constant area below by storage vs. area curve. if (DRYVILLE_RESERVOIR_AREA * evaporation + inflow / NUMBER_OF_WEEKS_IN_YEAR - NUMBER_OF_DAYS_IN_WEEK * demand) { // Do something here }  Note that different languages have different standards. If coding in Python or Matlab, for example, be sure to follow the best practices for these languages. Also, if developing code in collaboration with another research group, be sure to negotiate a convention. # Profiling C++ code with Callgrind Often times, we have to write code to perform tasks whose complexity vary from mundane, such as retrieving and organizing data, to highly complex, such as simulations CFD simulations comprising the spine of a project. In either case, depending on the complexity of the task and amount of data to be processed, it may happen for the newborn code to leave us staring at an underscore marker blinking gracefully for hours on a command prompt during its execution until the results are ready, leading to project schedule delays and shortages of patience. Two standard and preferred approaches to the problem of time intensive codes are to simplify the algorithm and to make the code more efficient. In order to better select the parts of the code to work on, it is often useful to first find the parts of the code in which more time by profiling the code. In this post, I will show how to use Callgrind, part of Valgrind, and KCachegrind to profile C/C++ codes on Linux — unfortunately, Valgrind is not available for Windows or Mac, although it can be ran on cluster from which results can be downloaded and visualized on Windows with QCachegrind. The first step is to install Valgrind and KCachegrind by typing the following commands in the terminal of a Debian based distribution, such as Ubuntu (equivalent yum commands area available for Red Hat based distributions): $ sudo apt-get install valgrind
$sudo apt-get install kcachegrind  Now that the required tools are installed, the next step is to compile your code with GCC/G++ (with a make file, cmake, IDE or by running the compiler directly from the terminal) and then run the following command in a terminal (type ctrl+shift+T to open the terminal): $ valgrind --tool=callgrind path/to/your/compiled/program program_arguments


Callgrind will then run your program with some instrumentation added to its execution to measure time expenditures and cache use by each function in your code. Because of the instrumentation, Your code will take considerably longer to run under Callgrind than it typically would, so be sure to run a representative task that is as small as possible when profiling your code. During its execution, Callgrind will output a report similar to the one below on terminal itself:

==12345== Callgrind, a call-graph generating cache profiler
==12345== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==12345== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==12345== Command: path/to/your/compiled/program program_arguments
==12345==
==12345== For interactive control, run 'callgrind_control -h'.
IF YOUR CODE OUTPUTS TO THE TERMINAL, THE OUTPUT WILL BE SHOWN HERE.
==12345==
==12345== Events    : Ir
==12345== Collected : 4171789731
==12345==
==12345== I   refs:      4,171,789,731


The report above shows that it collected 4 billion events in order to generate the comprehensive report saved in the file callgrind.out.12345 — 12345 is here your process id, shown in the report above. Instead of submerging your soul into a sea of despair by trying to read the output file in a text editor, you should load the file into KCachegrind by typing:

$kcachegrind calgrind.out.12345  You should now see a screen like the one below: The screenshot above shows the profiling results for my code. The left panel shows the functions called by my code sorted by total time spent inside each function. Because functions call each other, callgrind shows two cost metrics as proxies for time spent in each function: Incl., showing the total cost of a function, and self, showing the time spent in each function itself discounting the callees. By clicking on “Self” to order to functions by the cost of the function itself, we sort the functions by the costs of their own codes, as shown below: Callgrind includes functions that are native to C/C++ in its analysis. If one of them appears in the highest positions of the left panel, it may be the case to try to use a different function or data structure that performs a similar task in a more efficient way. Most of the time, however, our functions are the ones in most of the top positions in the list. In the example above, we can see that a possible first step I can take to improve the time performance of my code is to make function “ContinuityModelROF::shiftStorage” more efficient. A few weeks ago, however, the function “ContinuityModel::continuityStep” was ranked first with over 30% of the cost, followed by a C++ map related function. I then replaced a map inside that function by a pointer vector, resulting in the drop of my function’s cost to less than 5% of the total cost of the code. In case KCachegrind shows that a given function that is called from multiple places in the code is costly, you may want to know which function is the main culprit behind the costly calls. To do this, click on the function of interest (in this case, “_memcpy_sse2_unalight”) in the left panel, and then click on “Callers” in the right upper panel and on “Call Graph” in the lower right panel. This will show in list and graph forms the calls made to the function by other functions, and the asociated percent costs. Unfortunately, I have only the function “ContinuityModelROF::calculateROF” calling “_memcpy_sse2_unalight,” hence the simple graph, but the graph would be more complex if multiple functions made calls to “_memcpy_sse2_unalight.” I hope this saves you at least the time spend reading this post! # Root finding in MATLAB, R, Python and C++ In dynamical systems, we are often interested in finding stable points, or equilibria. Some systems have multiple equilibria. As an example, take the lake problem, which is modeled by the equation below where Xt is the lake P concentration, at are the anthropogenic P inputs, Yt~LN(μ,σ2) are random natural P inputs, b is the P loss rate, and q is a shape parameter controlling the rate of P recycling from the sediment. The first three terms on the right hand side make up the “Inputs” in the figure, while the last term represents the “Outputs.” A lake is in equilibrium when the inputs are equal to the outputs and the lake P concentration therefore is not changing over time. For irreversible lakes this occurs at three locations, even in the absence of anthropogenic and natural inputs: an oligotrophic equilibrium, an unstable equilibrium (called the critical P threshold) and a eutrophic equilibrium (see figure below). The unstable equilibrium in this case is called the critical P threshold because once it is crossed, it is impossible to return to an oligotrophic equilibrium by reducing anthropogenic and natural P inputs alone. In irreversible lakes like this, we would therefore like to keep the lake P concentration below the critical P threshold. How do we find the critical P threshold? With a root finding algorithm! As stated earlier, the system above will be in equilibrium when the inputs are equal to the outputs and the P concentration is not changing over time, i.e. when $X_{t+1} - X_t = \frac{X^q_t}{1+X^q_t} - bX_t = 0$ Therefore we simply need to find the zero, or “root” of the above equation. Most of the methods for this require either an initial estimate or upper and lower bounds on the location of the root. These are important, since an irreversible lake will have three roots. If we are only interested in the critical P threshold, we have to make sure that we provide an estimate which leads to the unstable equilibrium, not either of the stable equilibria. If possible, you should plot the function whose root you are finding to make sure you are giving a good initial estimate or bounds, and check afterward to ensure the root that was found is the one you want! Here are several examples of root-finding methods in different programming languages. In MATLAB, roots can be found with the function fzero(fun,x0) where ‘fun’ is the function whose root you want to find, and x0 is an initial estimate. This function uses Brent’s method, which combines several root-finding methods: bisection, secant, and inverse quadratic interpolation. Below is an example using the lake problem. myfun = @(x,b,q) x^q/(1+x^q)-b*x; b = 0.42; q = 2.0; fun = @(x) myfun(x,b,q); pcrit = fzero(fun,0.75);  This returns pcrit = 0.5445, which is correct. If we had provided an initial estimate of 0.25 instead of 0.75, we would get pcrit = 2.6617E-19, basically 0, which is the oligotrophic equilibrium in the absence of anthropogenic and natural P inputs. If we had used 1.5 as an initial estimate, we would get pcrit = 1.8364, the eutrophic equilibrium. In R, roots can be found with the function uniroot, which also uses Brent’s method. Dave uses this on line 10 of the function lake.eval in his OpenMORDM example. Instead of taking in an initial estimate of the root, this function takes in a lower and upper bound. This is safer, as you at least know that the root estimate will lie within these bounds. Providing an initial estimate that is close to the true value should do well, but is less predictable; the root finding algorithm may head in the opposite direction from what is desired. b <- 0.42 q <- 2.0 pcrit <- uniroot(function(x) x^q/(1+x^q) - b*x, c(0.01, 1.5))$root


This returns pcrit = 0.5445145. Good, we got the same answer as we did with MATLAB! If we had used bounds of c(0.75, 2.0) we would have gotten 1.836426, the eutrophic equilibrium.

What if we had given bounds that included both of these equilibria, say c(0.5, 2.0)? In that case, R returns an error: ‘f() values at end points not of opposite sign’. That is, if the value returned by f(x) is greater than 0 for the lower bound, it must be less than 0 for the upper bound and vice versa. In this case both f(0.5) and f(2.0) are greater than 0, so the algorithm fails. What if we gave bounds for which one is greater than 0 and another less, but within which there are multiple roots, say c(-0.5,2.0)? Then R just reports the first one it finds, in this case pcrit = 0.836437, the eutrophic equilibrium. So it’s important to make sure you pick narrow enough bounds that include the root you want, but not roots you don’t!

In Python, you can use either scipy.optimize.root or scipy.optimize.brentq, which is what Jon uses on line 14 here. scipy.optimize.root can be used with several different algorithms, but the default is Powell’s hybrid method, also called Powell’s dogleg method. This function only requires an initial estimate of the root.

from scipy.optimize import root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.75)


scipy.optimize.root returns an object with several attributes. The attribute of interest to us is the root, represented by x, so we want pcrit.x. In this case, we get the correct value of 0.54454. You can play around with initial estimates to see how pcrit.x changes.

Not surprisingly, scipy.optimize.brentq uses Brent’s method and requires bounds as an input.

from scipy.optimize import brentq as root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.01, 1.5)


This just returns the root itself, pcrit = 0.5445. Again, you can play around with the bounds to see how this estimate changes.

In C++, Dave again shows how this can be done in the function ‘main-lake.cpp’ provided in the Supplementary Material to OpenMORDM linked from this page under the “Publications” section. On lines 165-168 he uses the bisect tool to find the root of the function given on lines 112-114. I’ve copied the relevant sections of his code into the function ‘find_Pcrit.cpp’ below.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <boost/math/tools/roots.hpp>

namespace tools = boost::math::tools;
using namespace std;

double b, q, pcrit;

double root_function(double x) {
return pow(x,q)/(1+pow(x,q)) - b*x;
}

bool root_termination(double min, double max) {
return abs(max - min) <= 0.000001;
}

int main(int argc, char* argv[])
{
b = 0.42;
q = 2.0;

std::pair<double, double> result = tools::bisect(root_function, 0.01, 1.0, root_termination);
pcrit = (result.first + result.second)/2;
cout << pcrit << endl;
}



This yields the desired root of pcrit = 0.54454, but of course, changing the bounds may result in different estimates. In case you missed it, the take home message is to be careful about your initial estimate and bounds ;).