Parallel Programming Part 2: Demonstrating OpenMP constructs with a simple test case

Parallel Programming Part 2: Demonstrating OpenMP constructs with a simple test case

In Parallel Programming Part 1, I introduced a number of basic OpenMP directives and constructs, as well as provided a quick guide to high-performance computing (HPC) terminology. In Part 2 (aka this post), we will apply some of the constructs previously learned to execute a few OpenMP examples. We will also view an example of incorrect use of OpenMP constructs and its implications.

TheCube Cluster Setup

Before we begin, there are a couple of steps to ensure that we will be able to run C++ code on the Cube. First, check the default version of the C++ compiler. This compiler essentially transforms your code into an executable file that can be easily understood and implemented by the Cube. You can do this by entering

g++ --version

into your command line. You should see that the Cube has G++ version 8.3.0 installed, as shown below.

Now, create two files:

  • openmp-demo-print.cpp this is the file where we will perform a print demo to show how tasks are distributed between threads
  • openmp-demo-access.cpp this file will demonstrate the importance of synchronization and the implications if proper synchronization is not implemented

Example 1: Hello from Threads

Open the openmp-demo-print.cpp file, and copy and paste the following code into it:

#include <iostream>
#include <omp.h>
#include <stdio.h>

int main() {

    int num_threads = omp_get_max_threads(); // get max number of threads available on the Cube
    omp_set_num_threads(num_threads);

    printf("The CUBE has %d threads.\n", num_threads);


    #pragma omp parallel 
    {
        int thread_id = omp_get_thread_num();
        printf("Hello from thread num %d\n", thread_id);
    }

    printf("Thread numbers written to file successfully. \n");

    return 0;
}

Here is what the code is doing:

  • Lines 1 to 3 are the libraries that need to be included in this script to enable the use of OpenMP constructs and printing to the command line.
  • In main() on Lines 7 and 8, we first get the maximum number of threads that the Cube has available on one node and assign it to the num_threads variable. We then set the number of threads we want to use.
  • Next, we declare a parallel section using #pragma omp parallel beginning on Line 13. In this section, we are assigning the printf() function to each of the num_threads threads that we have set on Line 8. Here, each thread is now executing the printf() function independently of each other. Note that the threads are under no obligation to you to execute their tasks in ascending order.
  • Finally, we end the call to main() by returning 0

Be sure to first save this file. Once you have done this, enter the following into your command line:

g++ openmp-demo-print.cpp -o openmp-demo-print -fopenmp

This tells the system to compile openmp-demo-print.cpp and create an executable object using the same file name. Next, in the command line, enter:

./openmp-demo-print

You should see the following (or some version of it) appear in your command line:

Notice that the threads are all being printed out of order. This demonstrates that each thread takes a different amount of time to complete its task. If you re-run the following script, you will find that the order of threads that are printed out would have changed, showing that there is a degree of uncertainty associated with the amount of time they each require. This has implications for the order in which threads read, write, and access data, which will be demonstrated in the next example.

Example 2: Who’s (seg)Fault Is It?

In this example, we will first execute a script that stores each thread number ID into an array. Each thread then accesses its ID and prints it out into a text file called access-demo.txt. See the implementation below:

#include <iostream>
#include <omp.h>
#include <vector>

using namespace std;

void parallel_access_demo() {
    vector<int> data;
    int num_threads = omp_get_max_threads();

    // Parallel region for appending elements to the vector
    #pragma omp parallel for
    for (int i = 0; i < num_threads; ++i) {
        // Append the thread ID to the vector
        data.push_back(omp_get_thread_num()); // Potential race condition here
    }
}

int main() {
    // Execute parallel access demo
    parallel_access_demo();
    printf("Code ran successfully. \n");

    return 0;
}

As per the usual, here’s what’s going down:

  • After including the necessary libraries, we use include namespace std on Line 5 so that we can more easily print text without having to used std:: every time we attempt to write a statement to the output file
  • Starting on Line 7, we write a simple parallel_access_demo() function that does the following:
    • Initialize a vector of integers very creatively called data
    • Get the maximum number of threads available per node on the Cube
    • Declare a parallel for region where we iterate through all threads and push_back the thread ID number into the data vector
  • Finally, we execute this function in main() beginning Line 19.

If you execute this using a similar process shown in Example 1, you should see the following output in your command line:

Congratulations – you just encountered your first segmentation fault! Not to worry – this is by design. Take a few minutes to review the code above to identify why this might be the case.

Notice Line 15. Where, each thread is attempting to insert a new integer into the data vector. Since push_back() is being called in a parallel region, this means that multiple threads are attempting to modify data simultaneously. This will result in a race condition, where the size of data is changing unpredictably. There is a pretty simple fix to this, where we add a #pragma omp critical before Line 15.

As mentioned in Part 1, #pragma omp critical ensures that each thread executes its task one at a time, and not all at once. It essentially serializes a portion of the parallel for loop and prevents race conditions by ensuring that the size of data changes in a predictable and controlled manner. An alternative would be to declare #pragma omp single in place of #pragma omp parallel for on Line 12. This essentially delegates the following for-loop to a single thread, forcing the loop to be completed in serial. Both of these options will result in the blissful outcome of

Summary

In this post, I introduced two examples to demonstrate the use of OpenMP constructs. In Example 1, I demonstrated a parallel printing task to show that threads do not all complete their tasks at the same time. Example 2 showed how the lack of synchronization can lead to segmentation faults due to race conditions, and two methods to remedy this. In Part 3, I will introduce GPU programming using the Python PyTorch library and Google Colaboratory cloud infrastructure. Until then, happy coding!

Parallel Programming Part 1: A brief introduction to OpenMP directives and constructs

In my recent Applied High Performance Computing (HPC) lectures, I have been learning to better utilize OpenMP constructs and functions . If these terms mean little to nothing to you, welcome to the boat – it’ll be a collaborative learning experience. For full disclosure, this blog post assumes some basic knowledge of parallel programming and its related terminology. But just in case, here is a quick glossary:

  • Node: One computing unit within a supercomputing cluster. Can be thought of as on CPU unit on a traditional desktop/laptop.
  • Core: The physical processing unit within the CPU that enables computations to be executed. For example, an 11th-Gen Intel i5 CPU has four cores.
  • Processors/threads: Simple, lightweight “tasks” that can be executed within a core. Using the same example as above, this Intel i5 CPU has eight threads, indicating that each of its cores can run two threads simultaneously.
  • Shared memory architecture: This means that all cores share the same memory within a CPU, and are controlled by the same operating system. This architecture is commonly found in your run-of-the-mill laptops and desktops.
  • Distributed memory architecture: Multiple cores run their processes independently of each other. They communicate and exchange information via a network. This architecture is more commonly found in supercomputer clusters.
  • Serial execution: Tasks and computations are completed one at a time. Typically the safest and most error-resistant form of executing tasks, but is the least efficient.
  • Parallel execution: Tasks and computations are completed simultaneously. Faster and more efficient than serial execution, but runs the risk of errors where tasks are completed out of order.
  • Synchronization: The coordination of multiple tasks or computations to ensure that all processes aligns with each other and abide by a certain order of execution. Synchronization takes up more time than computation. AN overly-cautious code execution with too many synchronization points can result in unnecessary slowdown. However, avoiding it altogether can result in erroneous output due to out-of-order computations.
  • Race condition: A race condition occurs when two threads attempt to access the same location on the shared memory at the same time. This is a common problem when attempting parallelized tasks on shared memory architectures.
  • Load balance: The distribution of work (tasks, computations, etc) across multiple threads. OpenMP implicitly assumes equal load balance across all its threads within a parallel region – an assumption that may not always hold true.

This is intended to be a two-part post where I will first elaborate on the OpenMP API and how to use basic OpenMP clauses to facilitate parallelization and code speedup. In the second post, I will provide a simple example where I will use OpenMP to delineate my code into serial and parallel regions, force serialization, and synchronize memory accesses and write to prevent errors resulting from writing to the same memory location simultaneously.

What is OpenMP?

OpenMP stands for “Open Specification for Multi-Processing”. It is an application programming interface (API) that enables C/C++ or Fortran code to “talk” (or interface) with multiple threads within and across multiple computing nodes on shared memory architectures, therefore allowing parallelism. It should not be confused with MPI (Message Passing Interface), which is used to develop parallel programs on distributed memory systems (for a detailed comparison, please refer to this set of Princeton University slides here). This means that unlike a standalone coding language like C/C++. Python, R, etc., it depends only on a fixed set of simple implementations using lightweight syntax (#pragma omp [clause 1] [clause 2]) . To begin using OpenMP, you should first understand what it can and cannot do.

Things you can do with OpenMP

Explicit delineation of serial and parallel regions

One of the first tasks that OpenMP excels at is the explicit delineation of your code into serial and parallel regions. For example:

#include <stdio.h>
#include <omp.h>
void some_function() {
	#pragma omp parallel {
		do_a_parallel_task();
	}
	do_a_serial_task();
}

In the code block above, we are delineating an entire parallel region where the task performed by the do_a_parallel_task() function is performed independently by each thread on a single node. The number of tasks that can be completed at any one time depends on the number of threads that are available on a single node.

Hide stack management

In addition, using OpenMP allows you to automate memory allocation, or “hide stack management”. Most machines are able to read, write, and (de)allocate data to optimize performance, but often some help from the user is required to avoid errors caused by multiple memory access or writes . OpenMP allows the user to automate stack management without explicitly specifying memory reads or writes in certain cache locations via straightforward clauses such as collapse() (we will delve into the details of such clauses in later sections of this blog post).

Allow synchronization

OpenMP also provides synchronization constructs to enforce serial operations or wait times within parallel regions to avoid conflicting or simultaneous read or write operations where doing so might result in erroneous memory access or segmentation faults. For example, a sequential task (such as a memory write to update the value in a specific location) that is not carefully performed in parallel could result in output errors or unexpected memory accessing behavior. Synchronization ensures that multiple threads have to wait until the last operation on the last thread is completed, or only one thread is allowed to perform an operation at a time. It also protects access to shared data to prevent multiple reads (false sharing). All this can be achieved via the critical, barrier, or single OpenMP clauses.

Creating a specific number of threads

Unless explicitly stated, the system will automatically select the number of threads (or processors) to use for a specific computing task. Note that while you can request for a certain number of threads in your environment using export OMP_NUM_THREADS, the number of threads assigned to a specific task is set by the system unless otherwise specified. OpenMP allows you to control the number of threads you want to use in a specific task in one of two ways: (a) using omp_set_num_threads() and using (b) #pragma omp parallel [insert num threads here]. The former allows you to enforce an upper limit of  the number of threads to be freely assigned to tasks throughout the code, while the latter allows you to specify a set number of threads to perform an operation(s) within a parallel region.

In the next section, we will list some commonly-used and helpful OpenMP constructs and provide examples of their use cases.

OpenMP constructs and their use cases

OMP_NUM_THREADS
This is an environment variable that can be set either in the section of your code where you defined your global variables, or in the SLURM submission script you are using to execute your parallel code.

omp_get_num_threads()
This function returns the number of threads currently executing the parallel region from which it is called. Use this to gauge how many active threads are being used within a parallel region

omp_set_num_threads()
This function specifies the number of threads to use for any following parallel regions. It overrides the OMP_NUM_THREADS environment variable. Be careful to not exceed the maximum number of threads available across all your available cores. You will encounter an error otherwise.

omp_get_max_threads()
To avoid potential errors from allocating too many threads, you can use omp_set_num_threads() with omp_get_max_threads(). This returns the maximum number of threads that can be used to form a new “team” of threads in a parallel region.

#pragma omp parallel
This directive instructs your compiler to create a parallel region within which it instantiates a set (team) of threads to complete the computations within it. Without any further specifications, the parallelization of the code within this block will be automatically determined by the compiler.

#pragma omp for
Use this directive to instruct the compiler to parallelize for-loop iterations within the team of threads that it has instantiated. This directive is often used in combination with #pragma omp parallel to form the #pragma omp parallel for construct which both creates a parallel region and distributes tasks across all threads.

To demonstrate:

#include <stdio.h>
#include <omp.h>
void some_function() {
	#pragma omp parallel {
        #pragma omp for
		for (int i = 0; i < some_num; i++) {
             perform_task_here();
        }
	}
}

is equivalent to

#include <stdio.h>
#include <omp.h>
void some_function() {
	#pragma omp parallel for {
    for (int i = 0; i < some_num; i++) {
          perform_task_here();
     }
}

#pragma omp barrier
This directive marks the start of a synchronization point. A synchronization point can be thought of as a “gate” at which all threads must wait until all remaining threads have completed their individual computations. The threads in the parallel region beyond omp barrier will not be executed until all threads before is have completed all explicit tasks.

#pragma omp single
This is the first of three synchronization directives we will explore in this blog post. This directive identifies a section of code, typically contained within “{ }”, that can only be run with one thread. This directive enforces serial execution, which can improve accuracy but decrease speed. The omp single directive imposes an implicit barrier where all threads after #pragma omp single will not be executed until the single thread running its tasks is done. This barrier can be removed by forming the #pragma omp single nowait construct.

#pragma omp master
This directive is similar to that of the #pragma omp single directive, but chooses only the master thread to run the task (the thread responsible creating, managing and discarding all other threads). Unlike #pragma omp single, it does not have an implicit barrier, resulting in faster implementations. Both #pragma omp single and #pragma omp master will result in running only a section of code once, and it useful for controlling sections in code that require printing statements or indicating signals.

#pragma omp critical
A section of code immediate following this directive can only be executed by one thread at a time. It should not be confused with #pragma omp single. The former requires that the code be executed by one thread at a time, and will be executed as many times as has been set by omp_set_num_threads(). The latter will ensure that its corresponding section of code be executed only once. Use this directive to prevent race conditions and enforce serial execution where one-at-a-time computations are required.

Things you cannot (and shouldn’t) do with OpenMP

As all good things come to an end, all HPC APIs must also have limitations. In this case, here is a quick rundown in the case of OpenMP:

  • OpenMP is limited to function on shared memory machines and should not be run on distributed memory architectures.
  • It also implicitly assumes equal load balance. Incorrectly making this assumption can lead to synchronization delays due to some threads taking significantly longer to complete their tasks compared to others immediately prior to a synchronization directive.
  • OpenMP may have lower parallel efficiency due to it functioning on shared memory architectures, eventually being limited by Amdahl’s law.
  • It relies heavily on parallelizable for-loops instead of atomic operations.

    Summary

    In this post, we introduced OpenMP and a few of its key constructs. We also walked through brief examples of their use cases, as well as limitations of the use of OpenMP. In our next blog post, we will provide a tutorial on how to write, compile, and run a parallel for-loop on Cornell’s Cube Cluster. Until then, feel free to take a tour through the Cornell Virtual Workshop course on OpenMP to get a head start!

    References

    IBM documentation. (n.d.). https://www.ibm.com/docs/en/xl-c-aix/13.1.3

    Jones, M. D. (2009). Practical issues in Openmp. University at Buffalo (UB) Department of Computer Science and Engineering . https://cse.buffalo.edu/~vipin/nsf/docs/Tutorials/OpenMP/omp-II-handout.pdf

    Pros and cons of OpenMP. (n.d.). https://www.dartmouth.edu/~rc/classes/intro_openmp/Pros_and_Cons.html

    What is load balancing? – load balancing algorithm explained – AWS. (n.d.). https://aws.amazon.com/what-is/load-balancing/

    Ziskovin, G. (2022, October 29). #pragma OMP single [explained with example]. OpenGenus IQ: Computing Expertise & Legacy. https://iq.opengenus.org/pragma-omp-single/

    Legacy Code Reborn: Wrapping C++ into MATLAB Simulink

    Did you ever have the need to include legacy code in your own projects without rephrasing it into your new framework? In Matlab Simulink, there is a way to include C++ code within your system. This post will teach you how to fit legacy C++ code into your Simulink projects.

    Let’s assume you have been given a C++ code in which neural networks are implemented and already trained (as an example, but this guide extends to any generic class/function). Let’s also assume that your goal is to include them within a larger control system. Recreating them from scratch and trying to translate the original code into Matlab could be an unwise choice, as the implementation error could be around the corner. The best solution is to include the neural networks directly as they are, wrapping them in a nice box. How to proceed?

    Requirements
    To fully appreciate the following post, you should have a basic understanding of object-oriented programming, C++, and Matlab Simulink.

    Disclaimer
    The one I will show you differs from the fastest solution you could think about. Indeed, there are easier ways to wrap your C++ code if it fulfills certain limitations declared in the C Function Block documentation. Still, this solution is the most complete you can think of! And always works.
    Here below I recall the Matlab Simulink limitations from the official documentation.
    The C Function block cannot be used to directly access all C++ classes. These types of classes are not supported:

    • Template classes
    •  Standard Template Library (STL) containers
    •  Classes with private constructors and destructors

    If your code falls in one of the above cases, you should be interested in reading this post; otherwise, you can think about following the more straightforward way.

    1. Installing the C++ compiler
    To execute C++ code, you need a C++ compiler. If you already have one, skip this section. Otherwise, you have two possibilities, in both the cases, you will have a functioning C++ compiler on your device:

    1. Installing the standard open-source C++ compiler.
      In this case, I recommend MINGW (even if it is pretty old). You can install it simply by following the software installation guidelines.
    2. Installing MINGW as Matlab adds-on. 
      Another (even easier) possibility is installing MINGW directly from Matlab as a standard Matlab toolbox. 

    To let Matlab know that you will need to compile C++ code, you need to specify (each time you execute the code) the following command in Matlab terminal: 

    mex -setup C++
    

    This, if the compiler has been correctly installed (in my case through Microsoft Visual Studio 2022), should prompt out:

    MEX configured to use 'Microsoft Visual C++ 2022' for C++ language compilation.
    

    or, if you installed through Matlab adds-on:

    MEX configured to use 'MinGW64 Compiler (C++)' for C++ language compilation. 
    

    2. The C Function Block

    Here is where things start to be more challenging.
    The starting point is understanding which Simulink block we are going to use to wrap around our C++ executed code. In this case, it is represented by the C Function Block.

    The block has the following structure:

    • Output code: here is where you will generate your output. The following is a candidate example.
      output = predict(myNN,input1,input2);
      Here we invoke the predict method passing as arguments two possible inputs and returning the output (our hopefully prediction). The special guest is myNN. This is the key of the solution to the problem and represents an instance of the class of the object we are going to wrap.
    • Start code: here, you instantiate the myNN object. It is the equivalent of the constructor in the C++ class.
      myNN = createNN();
    • Terminate code: here is where you destroy your object (equivalent to the destructor).
      deleteNN(myNN);
    • Symbols: here, you declare your input, output and myNN object. You can specify names (input1,input2, myNN) and data types (int32,double…). Remark: it is fundamental that the type of the myNN object is VoidPointer. This depends on the fact that we are building a wrapper.

    To conclude the inspection of the C Function block, we must look at the settings. Here we must specify all the source files (.cpp) and header files (.h) inside the corresponding sections. The files must be in the same folder of the Simulink file, unless you specify the directory in the corresponding field.

    An important remarkthe settings are common for all the C Function blocks you will have inside the project. Thus, you can’t structure your application thinking that each C Function block can have separate and independent source files. For this reason, even if you need to instantiate multiple classes of different natures inside your scheme, in the end, you will need one unique wrapper (with corresponding .cpp and .h files).

    3. Writing your C++ Wrapper

    After exploring the C Function block, we are ready to start writing our wrapper. Let’s assume you have legacy code that implements your neural network in source/header files. What you must do is: 

    1. Write your NN class, whose methods will be invoked by the wrapper. 
    2. Write the wrapper.

    3a. Writing the class

    We will call the files uniqueNN.cpp/uniqueNN.h.

    #include "uniqueNN.h"
    #include "ann.h"
    // include here all additional dependences
    
    myNN::myNN() {
    }
    
    myNN::~myNN() {
        delete ANN;
    }
    
    double myNN::predict(double input1, double input2) {
    // copy-paste here your legacy code
    }
    

    In this class, you can see constructor, destructor, and predict method. Inside that method, you would copy-paste your legacy code. In there, you can do whatever you want. Remark: you don’t need to fulfill the limitations of the C Function Block because this class will be invoked by a “protective layer” represented by our wrapper!

    Suppose your system has multiple components, and a different network with a different structure represents each. You can still use the very same class and implement a corresponding method for each of your original networks to produce the corresponding network’s prediction. 

    Clearly, you must specify all your needed files for executing your legacy code in source and header files. 

    3b. Implementing the wrapper

    #include "uniqueNN.h"
    // include here all additional standard libraries 
    
    // Neural Networks
    
    void* createNN()
    {
        return new myNN();
    }
    
    void deleteNN(void* obj)
    {
        delete static_cast<myNN*>(obj);
    }
    
    double predict(void* obj, double input1, double input2)
    {
        myNN* nn = static_cast<myNN*>(obj);
        return nn-> predict(input1, input2);
    }
    
    

    Here we have three methods. These are the ones we have seen in the C Function Block! They will be invoked by Simulink.

    As you can see, we find again the constructor, the destructor, and the predict method. For the sake of conciseness, I suggest using the same name for the method implemented in the class and the method implemented in the wrapper, but it is not mandatory. Notice that you must pass as an argument for the predict method in the wrapper the void pointer (your class object you are instantiating, that is the one you declared in the C Function Block!).

    Remark: as you might have imagined, for each method implemented in the class (that in our example represented one different legacy neural network), you should create one corresponding method in the wrapper.

    A final remark: the wrapper needs only the header of your class (for us uniqueNN.h) (not all the ones from legacy code, so do not put them because it is redundant!)

    Here below the quality of the reconstruction of the legacy code through the C++ wrapper. The small differences that you might experience are due to internal data type differences between the legacy code simulator and the new one, absolutely negligible.

    The figure above shows that the predictions from the neural network are nearly identical for both the wrapper-based simulink and original C++ implementations.

    That’s it! Thank you for reading and if you have further questions, please feel free to contact me, here is my LinkedIn profile!

    Parallelization of C/C++ and Python on Clusters

    There are multiple ways of parallelizing C/C++ and Python codes to be used on clusters. The quick and dirty way, which often gets the job done, is to parallelize the code with MPI and launch one process per core across multiple nodes. Though easy to implement, this strategy may result in a fair waste of results because:

    • With too many processes on one node, inefficient use of memory may make the code memory bottle-necked, meaning that a lot of time will be spent accessing memory and the cores will be sitting twiddling their thumbs.
    • If you have a 48-core node with 192 GB of memory (as the Skylake nodes on Stampede 2) and your code uses 10 GB of RAM per process, you will not be able to have more than 19 out of 48 cores working at the same time because more than that would exceed the node’s memory, which would crash the node, which would get you kicked out of the cluster for a while without a warning — if you do not know how much RAM your code requires, you should look into the Remora tool.

    If you are working on big runs on systems that charge per usage time or if you are creating a code you will have to run multiple times, it may be worth looking into more clever parallelization strategies. Some of the techniques described in the next subsections will also make your code two to four times faster on your desktop/laptop and are rather straight-forward to implement. Most of the code in this blog post can be freely downloaded from this repository. Lastly, if the #include statements in the examples below are not displayed as followed by a library between “<>”, please refer to the mentioned online repository.

    Hybrid Parallelization in C/C++

    The word hybrid refers in this context to a mixed use of shared and distributed memory parallelization techniques. In distributed parallelization techniques, such as the Message Passing Interface (MPI), each process has its own dedicated chunk of memory within a RAM chip. This means that each process will have its own copy of the input data (repeated across processes), variables within the program, etc. Conversely, shared memory parallelization techniques such as OpenMP allow all threads (parallel unit of computation) to access the same block of memory, which can make a mess in your code with variables being changed seemingly out of nowhere by various parallel runs, but which lessens the amount of RAM needed and in some cases allow for more efficient memory utilization.

    If running the Borg Master-Slave multiobjective evolutionary optimization algorithm (Borg MS) with hybrid parallelization, Borg MS will have each function evaluation performed by one MPI process, which could run in parallel with OpenMP (if your model is setup for that) across the cores designated for that process. As soon as a function evaluation is complete, the slave MPI process in charge of it will submit the results back to Borg MS’s master process and be given another set of decision variables to work on. The figure below is a schematic drawing of such hybrid parallelization scheme:

    BorgHybridParallelization.png

    OpenMP

    Parallelizing a for-loop in C or C++ in which each iteration is independent — such as in Monte Carlo simulations — is straight-forward:

    #include
    #include "path/to/my_model.h"
    
    int main(int argc, char *argv[])
    {
        int n_monte_carlo_simulations = omp_get_num_threads(); // One model run per core.
        vector system_inputs = ;
        vector results(n_monte_carlo_simulations, 0);
    #pragma omp parallel for
        for (int i = 0; i < n_monte_carlo_simulations; ++i)
        {
            results[i] = RunMonteCarloSimulation(system_inputs[i]);
        }
    }
    

    The only caveat is that you have to be sure that the independent simulations are not writing over each other. For example, if by mistake you had written results[0] instead of results[i], the final value on results[0] would be that of whichever thread (parallel simulation) finished last, while the rest of the vector would be of zeros. It is important to notice here that the vector system_inputs is read only once by the parallel code and all n_monte_carlo_simulations threads would have access to it in memory.

    Though this example is not rocket science, there is a lot more to OpenMP that can be found here and here.

    MPI

    MPI stands for Message Passing Interface. It creates processes which pass messages (information packages) to each other. The most basic MPI code is one that creates independent processes — each with its own allocated chunk of memory that cannot be accessed by any other process — and just has them running independently in parallel without communicating with each other. This is great to distribute Monte Carlo runs across various nodes, although it may lead to the limitations mentioned in the beginning of this post. Below is an example of a very basic MPI code to run Monte Carlo simulations in parallel:

    #include
    #include "path/to/my_model.h"
    
    int main(int argc, char *argv[])
    {
        MPI_Init(NULL, NULL);
    
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    
        vector system_inputs = ;
        RunMonteCarloSimulation(system_inputs[world_rank]);
    
        MPI_Finalize();
        return 0;
    }
    

    I will not cover in this post how to pass the output of the RunMonteCarloSimulation function back to a results vector, as in the previous example, because details of MPI is outside the scope of this post — a good MPI tutorial can be found here and great examples, including ones for Monte Carlo runs, can be found here. To run this code, you would have to compile it with mpicc mpi_demo.c -o MPIDemo and call it with the mpirun or mpiexec commands, which should be followed by the -n flag to specify the number of processes to be created, or mpirun -n 4 ./MPIDemo. The code above would run as many MonteCarlo simulations as there are processes (four processes if -n 4 was used), each with a rank (here a type of ID), and can be spread across as many nodes as you can get on a cluster. Each process of the code above would run on a core, which may result in the memory and/or processing limitations listed in the beginning of this post.

    Mixing OpenMP and MPI

    The ideal situation for most cases is when you use MPI to distribute processes across nodes and OpenMP to make sure that as many cores within a node as it makes sense to use are being used. To do this, you can use the structure of the code below.

    #include
    #include
    #include
    #include "path/to/my_model.cpp"
    
    int main(int argc, char *argv[])
    {
        MPI_Init(NULL, NULL);
    
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    
        int n_cores = omp_get_num_threads(); // get number of cores available for MPI process.
        vector system_inputs = ;
        vector results(n_cores , 0);
    #pragma omp parallel for
        for (int i = 0; i < n_cores ; ++i)
        {
            // printf("C hybrid. Process-thread: %s-%d\n", world_rank, omp_get_thread_num());
            results[i] = RunMonteCarloSimulation(system_inputs[i]);
        }
    
        MPI_Finalize();
        return 0;
    
    }
    

    If you comment lines 4 and 20 and uncomment line 19, compile the code with mpicc hybrid.c -o Hybrid -fopenmp, set the number of threads in the terminal to eight by running set OMP_NUM_THREADS=8, and call the resulting executable with the command mpirun -n 4 ./Hybrid, the executable will spawn four processes, each of which spawning eight threads and printing the following output:

    C hybrid. Process-thread: 0-0
    C hybrid. Process-thread: 0-1
    C hybrid. Process-thread: 0-4
    C hybrid. Process-thread: 0-5
    C hybrid. Process-thread: 0-7
    C hybrid. Process-thread: 0-6
    C hybrid. Process-thread: 0-2
    C hybrid. Process-thread: 0-3
    C hybrid. Process-thread: 1-6
    C hybrid. Process-thread: 1-0
    C hybrid. Process-thread: 1-1
    C hybrid. Process-thread: 1-3
    C hybrid. Process-thread: 1-5
    C hybrid. Process-thread: 1-4
    C hybrid. Process-thread: 1-7
    C hybrid. Process-thread: 1-2
    C hybrid. Process-thread: 2-6
    C hybrid. Process-thread: 2-0
    C hybrid. Process-thread: 2-3
    C hybrid. Process-thread: 2-7
    C hybrid. Process-thread: 2-2
    C hybrid. Process-thread: 2-4
    C hybrid. Process-thread: 2-5
    C hybrid. Process-thread: 2-1
    C hybrid. Process-thread: 3-0
    C hybrid. Process-thread: 3-1
    C hybrid. Process-thread: 3-3
    C hybrid. Process-thread: 3-4
    C hybrid. Process-thread: 3-6
    C hybrid. Process-thread: 3-7
    C hybrid. Process-thread: 3-2
    C hybrid. Process-thread: 3-5
    

    The code above can be used to have one process per node (and therefore one copy of system_inputs per node) using as many threads as there are cores in that node — depending on the number of cores, it is sometimes more efficient to have two or three MPI processes per node, each using half or a third of the cores in that node. Below is a PBS job submission script that can be used to generate the output above — job submission scripts for Slurm to be used on Stampede 2 can be found here.

    #!/bin/bash
    #PBS -N test_hybrid
    #PBS -l nodes=2:ppn=16
    #PBS -l walltime=:00:30
    #PBS -o ./output/hybrid_c.out
    #PBS -e ./error/hybrid_c.err
    module load openmpi-1.10.7-gnu-x86_64
    export OMP_NUM_THREADS=8
    
    set -x
    cd "$PBS_O_WORKDIR"
    
    # Construct a copy of the hostfile with only 16 entries per node.
    # MPI can use this to run 16 tasks on each node.
    export TASKS_PER_NODE=2
    uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile
    
    #cat nodefile
    mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS ./Hybrid
    

    Parallelizing Python on a Cluster

    Python was not designed with shared-memory parallelization in mind, so its native parallelization library, the Multiprocessing library, does distributed-memory parallelization instead of shared. The issue with the Multiprocessing library is that it does not work across nodes, so you still need something like MPI. Here, we will use MPI4Py.

    The code below parallelizes the function call across cores within the same node — please refer to this blog post for details about and other ways to use the Multiprocessing library.

    from multiprocessing import Process, cpu_count
    
    def do_something_useful(rank, shared_process_number):
        # Do something useful here.
        print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)
    
    # Create node-local processes
    shared_processes = []
    #for i in range(cpu_count()):
    for i in range(8):
        p = Process(target=do_something_useful, args=(i,))
        shared_processes.append(p)
    
    # Start processes
    for sp in shared_processes:
        sp.start()
    
    # Wait for all processes to finish
    for sp in shared_processes:
        sp.join()
    

    By adding MPI4Py to the code above, following the same logic of the C/C++ example, we get the following result:

    from mpi4py import MPI
    from multiprocessing import Process, cpu_count
    
    def do_something_useful(rank, shared_process_number):
        # Do something useful here.
        print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)
    
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    
    print 'Calling Python multiprocessing process from Python MPI rank {}'.format(rank)
    
    # Create shared-ish processes
    shared_processes = []
    #for i in range(cpu_count()):
    for i in range(8):
        p = Process(target=do_something_useful, args=(rank, i))
        shared_processes.append(p)
    
    # Start processes
    for sp in shared_processes:
        sp.start()
    
    # Wait for all processes to finish
    for sp in shared_processes:
        sp.join()
    
    comm.Barrier()
    

    Calling the code above with mpirun -n 4 python hybrid_pure_python.py will first result in four MPI processes being spawned with ranks 1 through 4, each spawning eight local processes. The output should look similar to the one below:

    Calling Python multiprocessing process from Python MPI rank 0
    Calling Python multiprocessing process from Python MPI rank 1
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-3
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-3
    Calling Python multiprocessing process from Python MPI rank 2
    Calling Python multiprocessing process from Python MPI rank 3
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-7
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-7
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-3
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-3
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-7
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-7
    

    Below is a PBS submission script that can be used with the code above:

    #!/bin/bash
    #PBS -N test_hybrid
    #PBS -l nodes=2:ppn=16
    #PBS -l walltime=00:01:00
    #PBS -o ./output/hybrid_python.out
    #PBS -e ./error/hybrid_python.err
    module load python-2.7.5
    export OMP_NUM_THREADS=8
    
    set -x
    cd "$PBS_O_WORKDIR"
    
    # Construct a copy of the hostfile with only 16 entries per node.
    # MPI can use this to run 16 tasks on each node.
    export TASKS_PER_NODE=2
    uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile
    
    #cat nodefile
    mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_pure_python.py
    

    Parallelizing Monte Carlo Runs of C/C++ (or anything else) with Python

    Very often we have to run the same model dozens to thousands of times for Monte-Carlo-style analyses. One option is to go on a cluster and submit dozens to thousand of jobs, but this becomes a mess for the scheduler to manage and cluster managers tend to not be very fond of such approach. A cleaner way of performing Monte Carlo runs in parallel is to write a code in Python or other language that spawns MPI processes which then call the model to be run. This can be done in Python with the subprocess library, as in the example below:

    from mpi4py import MPI
    import subprocess
    
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    
    print 'Calling OpenmpTest from Python MPI rank {}'.format(rank)
    function_call = ['./OpenmpTest', '{}'.format(rank)]
    subprocess.call(function_call)
    
    comm.Barrier()
    

    Here, OpenmpTest is an executable compiled from the C-OpenMP code below:

    #include
    #include 
    
    int main(int argc, char *argv[])
    {
    #pragma omp parallel for
        for (int i = 0; i < omp_get_num_threads(); ++i)
        {
            printf("C OpenMP. Process-thread: %s-%d\n", argv[1], i);
        }
        return 0;
    }
    

    Submitting the code above with following PBS job submission script should result in the output below.

    #!/bin/bash
    #PBS -N test_hybrid
    #PBS -l nodes=2:ppn=16
    #PBS -l walltime=00:01:00
    #PBS -o ./output/hybrid_python_c.out
    #PBS -e ./error/hybrid_python_c.err
    module load python-2.7.5
    export OMP_NUM_THREADS=8

    set -x
    cd "$PBS_O_WORKDIR"

    # Construct a copy of the hostfile with only 16 entries per node.
    # MPI can use this to run 16 tasks on each node.
    export TASKS_PER_NODE=2
    uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

    #cat nodefile
    mpiexec –hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_calls_c.py

    Output:

    Calling OpenmpTest from Python MPI rank 0
    Calling OpenmpTest from Python MPI rank 1
    Calling OpenmpTest from Python MPI rank 2
    Calling OpenmpTest from Python MPI rank 3
    C OpenMP. Process-thread: 0-7
    C OpenMP. Process-thread: 0-5
    C OpenMP. Process-thread: 0-0
    C OpenMP. Process-thread: 0-2
    C OpenMP. Process-thread: 0-4
    C OpenMP. Process-thread: 0-1
    C OpenMP. Process-thread: 0-3
    C OpenMP. Process-thread: 0-6
    C OpenMP. Process-thread: 1-1
    C OpenMP. Process-thread: 1-2
    C OpenMP. Process-thread: 1-6
    C OpenMP. Process-thread: 1-4
    C OpenMP. Process-thread: 1-5
    C OpenMP. Process-thread: 1-0
    C OpenMP. Process-thread: 1-3
    C OpenMP. Process-thread: 1-7
    C OpenMP. Process-thread: 2-3
    C OpenMP. Process-thread: 2-7
    C OpenMP. Process-thread: 2-0
    C OpenMP. Process-thread: 2-4
    C OpenMP. Process-thread: 2-6
    C OpenMP. Process-thread: 2-5
    C OpenMP. Process-thread: 2-1
    C OpenMP. Process-thread: 2-2
    C OpenMP. Process-thread: 3-5
    C OpenMP. Process-thread: 3-1
    C OpenMP. Process-thread: 3-4
    C OpenMP. Process-thread: 3-2
    C OpenMP. Process-thread: 3-7
    C OpenMP. Process-thread: 3-3
    C OpenMP. Process-thread: 3-0
    C OpenMP. Process-thread: 3-6
    

    End of the Story

    As highlighted in the examples above (all of which can be download from here), creating parallel code is not rocket science and requires very few lines of code. The OpenMP and Multiprocessing parallelization tricks can save hours if you have to run a model several times on your personal computer, and paired with MPI these tricks can save you from hours to days in time and thousands of node-hours on clusters. I hope you make good use of it.

    MOEAFramework Training Part 1: Connecting an External Problem

    The goal of this training is to step a user through becoming familiar with the capabilities of MOEAFramework, a free and open source Java library created by Dave Hadka, that allows the user to design, execute, and test out a variety of popular multi-objective evolutionary algorithms (MOEAs). In this series, we will demonstrate the capabilities of MOEAFramework in 4 parts. Part 1 demonstrates how to hook up an external optimization problem to MOEAFramework. Part 2 will cover the optimization of the problem using a variety of algorithms. Part 3 will illustrate how to calculate metrics to assess the performance of the algorithms. Finally, Part 4 will step through generation of relevant figures from the metrics that convey the effectiveness, efficiency, reliability, and controllability of the algorithms.

    The example test case that will be used throughout this tutorial is the DPS version of the Lake Problem. The code is written and adapted by Dr. Julianne Quinn and can be found here. However, the tutorial will be set up so that the user can easily swap in their own problem formulation.

    Finally, this guide is specifically built to connect an external problem that is written in C++ but Dave Hadka’s Beginner Guide to the MOEA Framework, has examples of how to create the Java executable for problems written in a language other than C++.

    Connecting Your Problem

    Before starting this training, it is recommended to set up a directory in a Linux environment with Java installed. If you are a student in Reed Research Group, create a folder in your Cube directory called “MOEA_Diagnostics_Tutorial”. Throughout the training, we will be adding various subdirectories and files to this folder. For this part of the tutorial, you will need the following in your directory:

    1. MOEAFramework Demo .jar file which can be found by clicking on “Demo Application” on the far right.
    2. The C++ version of the Lake Problem
    3. SOWs_Type6.txt (natural inflows read in by the .cpp file)
    4. moeaframework.c and moeaframework.h found here

    Once you have the .cpp file, the next step is to get it set up to be recognized by MOEAFramework. I have made these changes in this file in my GitHub repo. If you compare with the file in 2, you will see some changes.

    First, I have commented out all parts of the code related to Borg, which the problem was originally set up to be optimized with. For this tutorial, we will be optimizing with multiple algorithms and comparing their performance. Secondly, in lines 55-57, I have use the #define directive to declare the number of variables, objectives, and constraints to be constants. Lines 300-309 is where the direct connection to MOEAFramework comes in. MOEA_Init initializes the communication between C++ and MOEAFramework and takes the number of objectives and constraints as arguments. Then we can start reading and evaluating solutions. MOEA_Read_doubles extracts the decision variables and stores them in an array, vars. Line 304 calls the lake_problem function that we would like to evaluate, along with the variables, and constraints. This results in the objs array being filled. Finally, MOEA_Write writes the objectives back into the framework. When all solutions are done being read and written, MOEA_Terminate closes the connection. The important thing here is just to make sure that you are passing the names of the arguments of your lake_problem function correctly.

    Next, we must make an executable of the C++ file that Java can read. The makefile is located here and takes lake.cpp, moeaframework.c, moeaframework.h, and some relevant libraries and compiles them into an executable called “lake”. In order to run this file, simply type “make”.

    Finally, we must turn our executable into a java class. The relevant java file can be found here. This file can be found in the MOEAFramework documentation, but must be tailored to an external problem. Lines 21-25 import in the relevant tools from MOEAFramework that help to configure and solve the problem. Lines 40, 42, 43, and 48 show the first change from the original file. Here we insert the name of our executable that was generated in the last step, “lake”. In lines 53, 58, and 63, we state the number of decision variables, objectives, and constraints. Finally, in lines 67-75, we create a newSolution ( ) method to specify that our solution should have 6 real-valued decision variables between 0 and 1.

    In order to create a class file, simply type:

     
    
    javac -classpath MOEAFramework-2.12-Demo.jar lake.java 
    
    

     

    A file called lake.class will be created.

    The first part of training is done! Next time, we will set up some scripts, call these executables, and optimize the lake problem with a variety of different algorithms.

     

     

     

    Making Valgrind Easy

    Some of this blog’s readers and authors (most notably, Joe Kasprzik) read the title of this post and though “wait, there already is a post about Valgrind in this blog.” And you are right, so in this blog post I will build on the legacy Joe has left us on his post about Valgrind and get into the details of how to use its basic functionalities to get your code right.

    Common mistakes when coding in C++

    Suppose we have the following code:

    #include <stdio.h>
    
    int main() {
        int *var = new int[5]; // you wouldn't do this if the size was always 5, but this makes the argument clear.
        int n = 5;
        int m;
    
        if (m > n) {
            printf("Got into if statement.\n");
            for (int i = 0; i < 6; ++i) {
                var[i] = i;
            }
        }
    
        printf("var[5] equals %d\n", var[n]);
    }
    

    Saving the code above in a file called test.cpp, compiling it with g++ to create an executable called "test," and running it with "./test" will return the following output:

    bernardoct@DESKTOP-J6145HK ~
    $ g++ test.cpp -o test
    
    bernardoct@DESKTOP-J6145HK ~
    $ ./test
    Got into if statement.
    var[5] equals 5
    

    Great, it ran and did not crash (in such a simple code gcc's flag -Wall would have issued a warning saying m was not initialized, but in more complex code such warning may not be issued). However, it would be great if this code had crashed because this would make us look into it and figure out it actually has 3 problems:

    1. We did not assign a value to variable m (it was created but not initialized), so how did the code determine that m was greater than n to get into the code inside the if statement?
    2. The pointer array var was created as having length 5, meaning its elements are numbered 0 to 4. If the for-loop runs from 0 to 5 but element 5 does not exist, how did the code fill it in with the value of variable i when i was 5 in the loop? From the printf statement that returned 5 we know vars[5] equals 5.
    3. The pointer array var was not destroyed after the code did not need it any longer. This is not necessarily a problem in this case, but if this was a function that is supposed to be called over and over within a model there is a change the RAM would be filled with these seemingly inoffensive pointer arrays and the computer would freeze (or the node, if running on a cluster, would possibly crash and have to be rebooted).

    Given C++ will not crash even in the presence of such errors, one way of making sure your code is clean is by running it through Valgrind. However, most people who has used Valgrind on a model that has a few hundreds or thousands of lines of code has gotten discouraged by its possibly long and cryptic-looking output. However, do not let this intimidate you because the output is actually fairly easy to read once you either learn what to look for or use Valkyrie, a graphical user interface for Valgrind.

    Generating and interpreting Valgrind’s output

    The first think that needs to be done for Valgrind to give you a meaningful output is to re-compile your code with the -O0 and -g flags, the former to prevent the compiler from modifying your code to make it more efficient but unintelligible to Valgrind (or to debuggers), and the latter for Valgrind (and debuggers) to be able to pinpoint the line of code where issues happen and are originated. Therefore, the code should be compiled as shown below:

    bernardoct@DESKTOP-J6145HK ~
    $ g++ -O0 -g test.cpp -o test
    
    

    Now it is time to run your code with Valgrind to perform some memory checking. Valgrind itself will take flags that will dictate the type of analysis to be performed. Here we are interested in checking memory misuse (instead profiling, checking for thread safety, etc.), so the first flag (not required, but good to keep things for yourself) should be --tool=memcheck. Now that we specified that we want Valgrind to run a memory check, we should specify that we want it to look in detail for memory leaks and tell us where the erros are happening and originating, which can done by passing flags --leak-check=full and --track-origins-yes. This way, the complete function call to run Valgrind on our test program is:

    bernardoct@DESKTOP-J6145HK ~
    $ valgrind --tool=memcheck --leak-check=full --track-origins=yes ./test
    

    Important: Beware that your code will take orders of magnitude longer to run with Valgrind than it would otherwise. This means that you should run something as small as possible but still representative — e.g. instead of running your stochastic model with 1,000 realizations and a simulation time of 50 years, consider running 2 realizations simulating 2 years, so that Valgrind analyzes the year-long simulation and the transition between realizations and years. Also, if running your code on a cluster, load the valgrind module with module load valgrind-xyz on your submission script and replace the call to your model on the submission script by the valgrind call above you can find the exact name of the Valgrind module by running module avail on the terminal. If running valgrind with a code that used MPI, use mpirun valgrind ./mycode -flags.

    When called, valgrind will instrument our test.cpp and based on the collected information will print the following on the screen:

    ==385== Memcheck, a memory error detector
    ==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==385== Command: ./test
    ==385==
    ==385== Conditional jump or move depends on uninitialised value(s)
    ==385==    at 0x4006A9: main (test.cpp:9)
    ==385==  Uninitialised value was created by a stack allocation
    ==385==    at 0x400686: main (test.cpp:3)
    ==385==
    Got into if statement.
    ==385== Invalid write of size 4
    ==385==    at 0x4006D9: main (test.cpp:12)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    ==385==
    ==385== Invalid read of size 4
    ==385==    at 0x4006F5: main (test.cpp:16)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    ==385==
    var[5] equals 5
    ==385==
    ==385== HEAP SUMMARY:
    ==385==     in use at exit: 20 bytes in 1 blocks
    ==385==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
    ==385==
    ==385== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    ==385==
    ==385== LEAK SUMMARY:
    ==385==    definitely lost: 20 bytes in 1 blocks
    ==385==    indirectly lost: 0 bytes in 0 blocks
    ==385==      possibly lost: 0 bytes in 0 blocks
    ==385==    still reachable: 0 bytes in 0 blocks
    ==385==         suppressed: 0 bytes in 0 blocks
    ==385==
    ==385== For counts of detected and suppressed errors, rerun with: -v
    ==385== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
    

    Seeing Valgrind’s output being 5 times as long as the test code itself can be somewhat disheartening, but the information contained in the output is really useful. The first block of the output is the header it will always be printed so that you know the version of Valgrind you have been using, the call for your own code it used, and so on. In our example, the header is:

    ==385== Memcheck, a memory error detector
    ==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==385== Command: ./test
    

    After that, Valgrind report the errors it found during the execution of your code. Errors are always reported as a description of the error in good old English, followed by where it happens in your code. Let’s look at the first error found by Valgrind:

    ==385== Conditional jump or move depends on uninitialised value(s)
    ==385==    at 0x4006A9: main (test.cpp:9)
    ==385==  Uninitialised value was created by a stack allocation
    ==385==    at 0x400686: main (test.cpp:3)
    

    This tells us that there is an if statement (conditional statement) on line 9 of test.cpp in which at least one of the sides of the logical test has at least one uninitialized variable. As pointed out by Valgrind, line 9 of test.cpp has our problematic if statement which compares initialized variable n to uninitialized variable m, which will have whatever was put last in that memory address by the computer.

    The second error block is the following:

    ==385== Invalid write of size 4
    ==385==    at 0x4006D9: main (test.cpp:12)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    

    This means that your code is writing something in a location of memory that it did not allocated for its use. This block says that the illegal write, so to speak, happened in line 12 of test.cpp through a variable created in line 5 of test.cpp using the new[] operator. These lines correspond to var[i] = i; and to int *var = new int[5];. With this, we learned that either var was created too short on line 5 of test.cpp or that the for loop that assigns values to var goes one or more steps too far.

    Similarly, the next block tells us that our printf statement used to print the value of var[5] on the screen has read past the amount of memory that was allocated to var in its declaration on line 5 of test.cpp, as shown below:

    ==385== Invalid read of size 4
    ==385==    at 0x4006F5: main (test.cpp:16)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    

    The last thing Valgrind will report is the information about memory leaks, which are accounted for when the program is done running. The output about memory leaks for our example is:

    ==409== HEAP SUMMARY:
    ==409==     in use at exit: 20 bytes in 1 blocks
    ==409==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
    ==409==
    ==409== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
    ==409==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==409==    by 0x400697: main (test.cpp:5)
    ==409==
    ==409== LEAK SUMMARY:
    ==409==    definitely lost: 20 bytes in 1 blocks
    ==409==    indirectly lost: 0 bytes in 0 blocks
    ==409==      possibly lost: 0 bytes in 0 blocks
    ==409==    still reachable: 0 bytes in 0 blocks
    ==409==         suppressed: 0 bytes in 0 blocks
    

    The important points to take away from this last block are that:

    1. there were 20 bytes of memory leaks, meaning that if this were a function in your code every time it was run it would leave 20 bytes of garbage sitting in the RAM. This may not sound like a big deal but imagine if your code leaves 1 MB of garbage in the RAM for each of the 100,000 times a function is called. With this, there went 100 GB of RAM and everything else you were doing in your computer at that time because the computer will likely freeze and have to go through a hard-reset.
    2. the memory you allocated and did not free was allocated in line line 5 of test.cpp when you used the operator new[] to allocate the integer pointer array.

    It is important to notice here that if we increase the amount of allocated memory by the new[] operator on line 5 to that corresponding to 6 instead of 5 integers, the last two errors (invalid read and invalid write) would disappear. This means that if you run your code with Valgrind and see hundreds of errors, chances are that it will take modifying a few lines of code to get rid of most of these errors.

    Valkyrie — a graphical user interface for Valgrind

    Another way of going through Valgrind’s output is by using Valkyrie (now installed in the login node of Reed’s cluster, The Cube). If you are analyzing your code from your own computer with a Linux terminal (does not work with Cygwin, but you can install a native Ubuntu terminal on Windows 10 by following instructions posted here) and do not have Valkyrie installed yet, you can install it by running the following on your terminal:

    bernardoct@DESKTOP-J6145HK ~
    $ sudo apt-get install valkyrie
    

    Valkyrie works by reading an xml file exported by Valgrind containing the information about the errors it found. To export this file, you need to pass the flags --xml=yes and --xml-file=valgring_output.xml (or whatever name you want to give the file) to Valgrind, which would make the call to Valgrind become:

    bernardoct@DESKTOP-J6145HK ~
    $ valgrind --tool=memcheck --leak-check=full --track-origins=yes --xml=yes --xml-file=valgring_output.xml ./test
    

    Now, you should have a file called “valgrind_output.xml” in the directory you are calling Valgrind from. To open it with Valkyrie, first open Valkyrie by typing valkyrie on your terminal if on Windows 10 you need to have Xming installed and running, which can be done by following the instructions in the end of this post. If on a cluster, besides having Xming open you also have to have ssh’ed into the cluster with the -X flag (e.g. by running ssh -X username@my.cluster.here) with either Cygwin or from a native Linux terminal. After opening Valkyrie, click on the green folder button and select the xml file, as in the screenshot below.

    valkyrie_screenshot.png

    After opening the xml file generated by Valgrind, Valkyrie should look like in the screenshot below:valkyrie_screenshot2

    Now you can begin from a collapsed list of errors and unfold each error to see its details. Keep in mind that Valkyrie is not your only option of GUI for Valgrind, as IDEs like JetBrains’ CLion and QTCreator come integrated with Valgrind. Now go check your code!

    PS: Thanks to folks on Redit for the comments which helped improve this post.

    A completely non-exhaustive list of tutorial resources for scientific computing

    This is a short blog post to put together a list of resources with tutorials (similar to what’s usually found on this blog) for various programming languages. It is by no means exhaustive, so please comment if you feel there’s an important one I left out.

    Matlab:

    https://www.arnevogel.com/

    Kinda new blog with Matlab tutorials on numerical methods

    https://blogs.mathworks.com/loren/

    One of the MathWorks blogs, great tutorials.

    https://www.mathworks.com/support/learn-with-matlab-tutorials.html

    Matlab tutorials from MathWorks

    matlab

    More Matlab tutorials, a lot of material on many topics

    Python:

    https://glowingpython.blogspot.com/

    Not updated very frequently, but good data analysis and visualization tutorials are posted

    https://pythonprogramming.net/

    Updated regularly, some great data visualization and analysis tutorials. The tutorials come with videos. I really like this site.

    http://treyhunner.com/

    Updated regularly (about once a month) with python tutorials and general python coding insights. I really like the writing style of this author. 

    https://docs.scipy.org/doc/scipy/reference/tutorial/

    Tutorials on using SciPy

    C and C++:

    https://www.cprogramming.com/

    C and C++ programming tutorials, tips and tricks

    http://www.cplusplus.com/articles/

    Not really updated anymore but some good basic tutorials are listed

    https://blog.knatten.org/

    Hadn’t been updated in a while, but it looks like it’s been picked up again. Good for general C++ programming and good practice.

    http://www.bfilipek.com/

    C++ tutorials

    General:

    https://towardsdatascience.com/

    This is a great general resource not devoted to a particular language. They cover topics in data science, machine learning and programming in general. Some great tutorials for Python and R. 

    https://projecteuler.net/

    Mathematical programming problems to help with learning any language

    https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md

    Free programming books repository

    Reddits (some of the bigger ones):

    /r/matlab (General on matlab, community provides help with coding)

    /r/programming (General on programming)

    /r/learnprogramming (Community provides help with debugging questions)

    /r/python (General on python, community provides help with coding)

    /r/learnpython (Community provides help with python questions, smaller than /r/python)

    /r/cpp (General on C++)

    /r/cpp_questions (Community provides help with C++ questions)

    I’ve also recently made /r/sci_comp which has very little activity for the moment, but the aim is to create a community with general resources on coding for scientific applications.

     

    Simple tricks to make your C/C++ code run faster

    Us, engineers with no formal computer science training, for a myriad of good reasons tend to favor friendly programming languages such as Python over evil C/C++. However, when our application is performance sensitive and we cannot wait for results sitting in our chairs for as long as a Python code would require us to, we are sometimes forced to us C/C++. Why then not making the most of it when it this is the case?

    Here are some simple tips to make your C/C++ code run even faster, and how to get some advice about further performance improvements. The last idea (data locality) is transferable to Python and other languages.

    Most important trick

    Improve your algorithm. Thinking if there is a simpler way of doing what you coded may reduce your algorithm’s complexity (say, from say n3 to n*log(n)), which would:

    • yield great speed-up when you have lots of data or needs to run a computation several times in a row, and
    • make you look smarter.

    Compiler flags

    First and foremost, the those who know what they are doing — compiler developers — do the work for you by calling the compiler with the appropriate flags. There are an incredible amount of flags you can call for that, but I would say that the ones you should have on whenever possible are -O3 and –march=native.

    The optimization flags (-O1 to -O3, the latter more aggressive than the former) will perform a series of modification on your code behind the scenes to speed it up, some times by more than an order of magnitude. The issue is that this modifications may eventually make your code behave differently than you expected, so it’s always good to do a few smaller runs with -O0 and -O3 and compare their results before getting into production mode.

    The –march=native flag will make the compiler fine tune your code to the processor it is being compiled on (conversely, –march=haswell would fine tune it to haswell architectures, for example). This is great if you are only going to run your code on your own machine or in another machine known to have a similar architecture, but if you try to run the binary on a machine with a different architecture, specially if it is an older one, you may end up with illegal instruction errors.

    Restricted pointer array

    When declaring a pointer array which you are sure will not be subject to pointer aliasing — namely there will be no other pointer pointing to the same memory address –, you can declare that pointer as a restrict pointer, as below:

    • GCC: double* __restrict__ my_pointer_array
    • Intel compiler: double* restrict my_pointer_array

    This will let the compiler know that it can change order of certain operations involving my_pointer_array to make your code faster without changing some read/write order that may change your results. If you want to use the restricted qualifier with the intel compiler the flag -restrict must be passed as an argument to the compiler.

    Aligned pointer array

    By aligning an array, you are making sure the data is going to lie in the ideal location in memory for the processor to fetch it and perform the calculations it needs based on that array. To help your compiler optimize your data alignment, you need to (1) align your array when it is declared by a specific number of bytes and (2) tell your the compiler the array is aligned when using it to perform calculations — the compiler has no idea whether or not arrays received as arguments in function are aligned. Below are examples in C, although the same ideas apply to C++ as well.

    GCC

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    
    void sum(double *__restrict__ a, double *__restrict__ b, double *__restrict__ c, int n) {
        a = (double*) __builtin_assume_aligned(a, ALIGN);
        b = (double*) __builtin_assume_aligned(b, ALIGN);
        c = (double*) __builtin_assume_aligned(c, ALIGN);
        for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
    }
    
    int main(void) {
    
        double a[SIZE_ARRAY] __attribute__((aligned(ALIGN )));
        double b[SIZE_ARRAY] __attribute__((aligned(ALIGN )));
        double c[SIZE_ARRAY] __attribute__((aligned(ALIGN )));
    
        for (int i = 0; i < SIZE_ARRAY; ++i) {
            a[i] = 5.;
            b[i] = 2.;
        }
        double start_time = omp_get_wtime();
        sum(a, b, c, SIZE_ARRAY);
        double time = omp_get_wtime() - start_time;
        printf("%0.6fs", time);
    }
    

    Intel compiler

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    
    void sum(double* restrict a, double* restrict b, double* restrict c, int n) {
        __assume_aligned(a, ALIGN);
        __assume_aligned(b, ALIGN);
        __assume_aligned(c, ALIGN);
        for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
    }
    
    int main(void) {
    
        __declspec(align(ALIGN )) float a[SIZE_ARRAY];
        __declspec(align(ALIGN )) float b[SIZE_ARRAY];
        __declspec(align(ALIGN )) float c[SIZE_ARRAY];
    
        for (int i = 0; i < SIZE_ARRAY; ++i) {
            a[i] = 5.;
            b[i] = 2.;
        }
        double start_time = omp_get_wtime();
        sum(a, b, c, SIZE_ARRAY);
        double time = omp_get_wtime() - start_time;
    
        printf("%0.6fs", time);
    }
    

    Edit: In a comment to this post, Krister Walfridsson not only caught an issue with my GCC code, for which mention I thank him, but he also shows the differences in machine code generated with and without alignment.

    Data Locality

    Computers are physical things, which means that data is physically stored and needs to be physically moved around in memory and between cache and processor in order to be used in calculations. This means that, if your data is stored all over the place in memory — e.g. in multiple pointer arrays in different parts of memory –, the processor will have to reach out to several parts of memory to fetch all your data before performing any computations. By having the data intelligently laid out in memory you ensure all data required for each computation is stored close to each other and in cache at the same time, which becomes even more important if your code uses too much data to fit in the cache at once.

    In order to making your processor’s life easier, it is a good idea to ensure that all data required for a calculation step is close together. For example, if a given computation required three arrays of fixed sizes, it is always a good idea to merge them into one long array, as in the example below for the Intel compiler.

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    
    void sum(double* restrict a, double* restrict b, double* restrict c, int n) {
        __assume_aligned(a, ALIGN);
        __assume_aligned(b, ALIGN);
        __assume_aligned(c, ALIGN);
        for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
    }
    
    int main(void) {
    
        __declspec(align(ALIGN )) float abc[3 * SIZE_ARRAY];
    
        for (int i = 0; i < 2 * SIZE_ARRAY; ++i) {
            a[i] = 5.;
            b[i] = 2.;
        }
        double start_time = omp_get_wtime();
        sum(abc, abc + SIZE_ARRAY, abc + 2 * ARRAY, SIZE_ARRAY);
        double time = omp_get_wtime() - start_time;
    
        printf("%0.6fs", time);
    }
    

    or even, since c[i] depends only on b[i] and a[i], we can have the values of a, b and c intercalated to assure that all computations will be performed on data that is right next to each other in memory:

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    #define STRIDE 3
    
    void sum(double* restrict abc, int n, int stride) {
        __assume_aligned(abc, ALIGN);
    
        for (int i = 0; i < n; i += stride)
            abc[i+2] = abc[i] + abc[i+1];
    }
    
    int main(void) {
        __declspec(align(ALIGN )) double abc[3 * SIZE_ARRAY];
    
        for (int i = 0; i < 3 * SIZE_ARRAY; i += STRIDE) {
            abc[i] = 5.;
            abc[i+1] = 2.;
                                                }
        double start_time = omp_get_wtime();
        sum(abc, SIZE_ARRAY, STRIDE );
        double time = omp_get_wtime() - start_time;
    
        printf("%0.6fs", time);
    }
    

    Conclusion

    According a class project in which we had to write C code to perform matrix multiplication, the improvements suggested should may improve the performance of your code by 5 or 10 times. Also, the idea of data locality can be transferred to other languages, such as Python.

    Reading CSV files in C++

    If you are an engineer used to coding in Python or Matlab who is transitioning to C++, you will soon find out that even the most innocent task will now require several lines of code. A previous post has already shown how to export data to a CSV file. In order to facilitate your transition to C++, see below for an example of how to read your new CSV file.

    Utils.cpp

    #include <string>
    #include <vector>
    #include <sstream> //istringstream
    #include <iostream> // cout
    #include <fstream> // ifstream
    
    using namespace std;
    
    /**
     * Reads csv file into table, exported as a vector of vector of doubles.
     * @param inputFileName input file name (full path).
     * @return data as vector of vector of doubles.
     */
    vector<vector<double>> parse2DCsvFile(string inputFileName) {
    
        vector<vector<double> > data;
        ifstream inputFile(inputFileName);
        int l = 0;
    
        while (inputFile) {
            l++;
            string s;
            if (!getline(inputFile, s)) break;
            if (s[0] != '#') {
                istringstream ss(s);
                vector<double> record;
    
                while (ss) {
                    string line;
                    if (!getline(ss, line, ','))
                        break;
                    try {
                        record.push_back(stof(line));
                    }
                    catch (const std::invalid_argument e) {
                        cout << "NaN found in file " << inputFileName << " line " << l
                             << endl;
                        e.what();
                    }
                }
    
                data.push_back(record);
            }
        }
    
        if (!inputFile.eof()) {
            cerr << "Could not read file " << inputFileName << "\n";
            __throw_invalid_argument("File not found.");
        }
    
        return data;
    }
    
    int main()
    {
        vector<vector<double>> data = parse2DCsvFile("test.csv");
    
        for (auto l : data) {
        	for (auto x : l)
        		cout << x << " ";
        	cout << endl;
        }
    
        return 0;
    }
    

    Dynamic memory allocation in C++

    To have success creating C++ programs, it’s essential to have a good understanding of the workings and implementation of dynamic memory,  which is the “art” of performing manual memory management.  When developing code in C++, you may not always know how much memory you will need before your program runs.  For instance, the size of an array may be unknown until you execute the program.

    Introduction to pointers

    In order to understand dynamic memory allocation, we must first talk about pointers. A pointer is essentially a variable that stores the address in memory of another variable. The pointer data type is a long hexadecimal number representing the memory address.   This address can be accessed using an ampersand (&) exemplified in the following script:

    //Accessing a pointer:
    int age=43;
    cout << &age << endl;
    // The output would be something like: 0x7fff567ecb68
    

    Since a pointer is also a variable, it requires declaration before being used .   The declaration consists on giving it a name, hopefully following the good code style convention and declaring the data type that it “points to” using an asterisk (*) like so:

    //Declaring pointers of different data types:
    int *integerPointer;
    double *doublePointer;
    float *floatPointer;
    char *charPointer;
    

    The pointer’s operators

    In summary, the two pointer operators are: address-of operator(&) which returns the memory address and contents-of operator (*) which returns the value of the variable located at that address.

    // Example of pointer operators:
    float variable=25.6;
    float *pointer;
    pointer= &variable;
    cout << variable << endl; //outputs 25.6, the variable’s value
    cout << pointer << endl; //outputs 0x7fff5a774b68, the variable’s location in memory
    cout << *pointer << endl; // outputs 25.6, value of the variable stored in that location
    

    This last operator is also called deference operator which enables you to access directly the variable the pointer points to, which you can then use for regular operations:

    float width = 5.0;
    float length = 10.0;
    float area;
    float *pWidth = &width;
    float *pLength = &length;
    
    //Both of the following operations are equivalent
    area = *pWidth * *pLength;
    area = width * length;
    //the output for both would be 50.
    

    Deferencing the pointers *pWidth and *pLength represents exactly the same as the variables width and length, respectively.

    Memory allocation in C++

    Now that you have some basic understanding of pointers, we can talk about memory allocation in C++.  Memory in C++ is divided in tow parts: the stack and the heap.  All variables declared inside the function use memory from the stack whereas unused memory that can be used to allocate memory dynamically is stored in the heap.

    You may not know in advance how much memory you need to store information in a variable. You can allocate memory within the heap of a variable of a determined data type using the new operator like so:

    new float;
    

    This operator allocates the memory necessary for storing a float on the heap and returns that address. This address can also be stored in a pointer, which can then be deferenced to access the variable:

    float *pointer = new float; //requesting memory
    *pointer = 12.0; //store value
    cout << *pointer << endl; //use value
    delete pointer;// free up memory
    // this is now a dangling pointer
    pointer= new float // reuse for new address
    

    Here, the pointer is stored in the stack as a local variable, and holds the allocated address in the heap as its value. The value of 12.0 is then stored at the address in the heap. You can then use this value for other operations. Finally, the delete statement releases the memory when it’s no longer needed; however, it does not delete the pointer since it was stored in the stack. These type of pointers that point to non-existent memory are called dangling pointers and can be reused.

    Dynamic memory and arrays

    Perhaps the most common use of dynamic memory allocation are arrays. Here’s a brief example of the syntax:

    int *pointer= NULL; // initialized pointer
    pointer= new int[10] // request memory
    delete[]pointer; //delete array pointed to by pointer
    

    The NULL pointer has a value of zero, you can declare a null pointer when you do not have the address to be assigned.

    Finally, to allocate and release memory for multi-dimensional arrays, you basically use an array of pointers to arrays, it sounds confusing but you can do this using the following  sample method:

    int row = 3;
    int col = 4;
    double **p  = new double* [row]; // Allocate memory for rows
    
    // Then allocate memory for columns
    for(int i = 0; i < col; i++) {
        p[i] = new double[col];
    }
    
    //Release memory
    for(int i = 0; i < row; i++) {
       delete[] p[i];
    }
    delete [] p;
    

    I hope this quick overview provides a starting point on tackling your  C++ memory allocation challenges.