Parallel Programming Part 1: A brief introduction to OpenMP directives and constructs

In my recent Applied High Performance Computing (HPC) lectures, I have been learning to better utilize OpenMP constructs and functions . If these terms mean little to nothing to you, welcome to the boat – it’ll be a collaborative learning experience. For full disclosure, this blog post assumes some basic knowledge of parallel programming and its related terminology. But just in case, here is a quick glossary:

  • Node: One computing unit within a supercomputing cluster. Can be thought of as on CPU unit on a traditional desktop/laptop.
  • Core: The physical processing unit within the CPU that enables computations to be executed. For example, an 11th-Gen Intel i5 CPU has four cores.
  • Processors/threads: Simple, lightweight “tasks” that can be executed within a core. Using the same example as above, this Intel i5 CPU has eight threads, indicating that each of its cores can run two threads simultaneously.
  • Shared memory architecture: This means that all cores share the same memory within a CPU, and are controlled by the same operating system. This architecture is commonly found in your run-of-the-mill laptops and desktops.
  • Distributed memory architecture: Multiple cores run their processes independently of each other. They communicate and exchange information via a network. This architecture is more commonly found in supercomputer clusters.
  • Serial execution: Tasks and computations are completed one at a time. Typically the safest and most error-resistant form of executing tasks, but is the least efficient.
  • Parallel execution: Tasks and computations are completed simultaneously. Faster and more efficient than serial execution, but runs the risk of errors where tasks are completed out of order.
  • Synchronization: The coordination of multiple tasks or computations to ensure that all processes aligns with each other and abide by a certain order of execution. Synchronization takes up more time than computation. AN overly-cautious code execution with too many synchronization points can result in unnecessary slowdown. However, avoiding it altogether can result in erroneous output due to out-of-order computations.
  • Race condition: A race condition occurs when two threads attempt to access the same location on the shared memory at the same time. This is a common problem when attempting parallelized tasks on shared memory architectures.
  • Load balance: The distribution of work (tasks, computations, etc) across multiple threads. OpenMP implicitly assumes equal load balance across all its threads within a parallel region – an assumption that may not always hold true.

This is intended to be a two-part post where I will first elaborate on the OpenMP API and how to use basic OpenMP clauses to facilitate parallelization and code speedup. In the second post, I will provide a simple example where I will use OpenMP to delineate my code into serial and parallel regions, force serialization, and synchronize memory accesses and write to prevent errors resulting from writing to the same memory location simultaneously.

What is OpenMP?

OpenMP stands for “Open Specification for Multi-Processing”. It is an application programming interface (API) that enables C/C++ or Fortran code to “talk” (or interface) with multiple threads within and across multiple computing nodes on shared memory architectures, therefore allowing parallelism. It should not be confused with MPI (Message Passing Interface), which is used to develop parallel programs on distributed memory systems (for a detailed comparison, please refer to this set of Princeton University slides here). This means that unlike a standalone coding language like C/C++. Python, R, etc., it depends only on a fixed set of simple implementations using lightweight syntax (#pragma omp [clause 1] [clause 2]) . To begin using OpenMP, you should first understand what it can and cannot do.

Things you can do with OpenMP

Explicit delineation of serial and parallel regions

One of the first tasks that OpenMP excels at is the explicit delineation of your code into serial and parallel regions. For example:

#include <stdio.h>
#include <omp.h>
void some_function() {
	#pragma omp parallel {
		do_a_parallel_task();
	}
	do_a_serial_task();
}

In the code block above, we are delineating an entire parallel region where the task performed by the do_a_parallel_task() function is performed independently by each thread on a single node. The number of tasks that can be completed at any one time depends on the number of threads that are available on a single node.

Hide stack management

In addition, using OpenMP allows you to automate memory allocation, or “hide stack management”. Most machines are able to read, write, and (de)allocate data to optimize performance, but often some help from the user is required to avoid errors caused by multiple memory access or writes . OpenMP allows the user to automate stack management without explicitly specifying memory reads or writes in certain cache locations via straightforward clauses such as collapse() (we will delve into the details of such clauses in later sections of this blog post).

Allow synchronization

OpenMP also provides synchronization constructs to enforce serial operations or wait times within parallel regions to avoid conflicting or simultaneous read or write operations where doing so might result in erroneous memory access or segmentation faults. For example, a sequential task (such as a memory write to update the value in a specific location) that is not carefully performed in parallel could result in output errors or unexpected memory accessing behavior. Synchronization ensures that multiple threads have to wait until the last operation on the last thread is completed, or only one thread is allowed to perform an operation at a time. It also protects access to shared data to prevent multiple reads (false sharing). All this can be achieved via the critical, barrier, or single OpenMP clauses.

Creating a specific number of threads

Unless explicitly stated, the system will automatically select the number of threads (or processors) to use for a specific computing task. Note that while you can request for a certain number of threads in your environment using export OMP_NUM_THREADS, the number of threads assigned to a specific task is set by the system unless otherwise specified. OpenMP allows you to control the number of threads you want to use in a specific task in one of two ways: (a) using omp_set_num_threads() and using (b) #pragma omp parallel [insert num threads here]. The former allows you to enforce an upper limit of  the number of threads to be freely assigned to tasks throughout the code, while the latter allows you to specify a set number of threads to perform an operation(s) within a parallel region.

In the next section, we will list some commonly-used and helpful OpenMP constructs and provide examples of their use cases.

OpenMP constructs and their use cases

OMP_NUM_THREADS
This is an environment variable that can be set either in the section of your code where you defined your global variables, or in the SLURM submission script you are using to execute your parallel code.

omp_get_num_threads()
This function returns the number of threads currently executing the parallel region from which it is called. Use this to gauge how many active threads are being used within a parallel region

omp_set_num_threads()
This function specifies the number of threads to use for any following parallel regions. It overrides the OMP_NUM_THREADS environment variable. Be careful to not exceed the maximum number of threads available across all your available cores. You will encounter an error otherwise.

omp_get_max_threads()
To avoid potential errors from allocating too many threads, you can use omp_set_num_threads() with omp_get_max_threads(). This returns the maximum number of threads that can be used to form a new “team” of threads in a parallel region.

#pragma omp parallel
This directive instructs your compiler to create a parallel region within which it instantiates a set (team) of threads to complete the computations within it. Without any further specifications, the parallelization of the code within this block will be automatically determined by the compiler.

#pragma omp for
Use this directive to instruct the compiler to parallelize for-loop iterations within the team of threads that it has instantiated. This directive is often used in combination with #pragma omp parallel to form the #pragma omp parallel for construct which both creates a parallel region and distributes tasks across all threads.

To demonstrate:

#include <stdio.h>
#include <omp.h>
void some_function() {
	#pragma omp parallel {
        #pragma omp for
		for (int i = 0; i < some_num; i++) {
             perform_task_here();
        }
	}
}

is equivalent to

#include <stdio.h>
#include <omp.h>
void some_function() {
	#pragma omp parallel for {
    for (int i = 0; i < some_num; i++) {
          perform_task_here();
     }
}

#pragma omp barrier
This directive marks the start of a synchronization point. A synchronization point can be thought of as a “gate” at which all threads must wait until all remaining threads have completed their individual computations. The threads in the parallel region beyond omp barrier will not be executed until all threads before is have completed all explicit tasks.

#pragma omp single
This is the first of three synchronization directives we will explore in this blog post. This directive identifies a section of code, typically contained within “{ }”, that can only be run with one thread. This directive enforces serial execution, which can improve accuracy but decrease speed. The omp single directive imposes an implicit barrier where all threads after #pragma omp single will not be executed until the single thread running its tasks is done. This barrier can be removed by forming the #pragma omp single nowait construct.

#pragma omp master
This directive is similar to that of the #pragma omp single directive, but chooses only the master thread to run the task (the thread responsible creating, managing and discarding all other threads). Unlike #pragma omp single, it does not have an implicit barrier, resulting in faster implementations. Both #pragma omp single and #pragma omp master will result in running only a section of code once, and it useful for controlling sections in code that require printing statements or indicating signals.

#pragma omp critical
A section of code immediate following this directive can only be executed by one thread at a time. It should not be confused with #pragma omp single. The former requires that the code be executed by one thread at a time, and will be executed as many times as has been set by omp_set_num_threads(). The latter will ensure that its corresponding section of code be executed only once. Use this directive to prevent race conditions and enforce serial execution where one-at-a-time computations are required.

Things you cannot (and shouldn’t) do with OpenMP

As all good things come to an end, all HPC APIs must also have limitations. In this case, here is a quick rundown in the case of OpenMP:

  • OpenMP is limited to function on shared memory machines and should not be run on distributed memory architectures.
  • It also implicitly assumes equal load balance. Incorrectly making this assumption can lead to synchronization delays due to some threads taking significantly longer to complete their tasks compared to others immediately prior to a synchronization directive.
  • OpenMP may have lower parallel efficiency due to it functioning on shared memory architectures, eventually being limited by Amdahl’s law.
  • It relies heavily on parallelizable for-loops instead of atomic operations.

    Summary

    In this post, we introduced OpenMP and a few of its key constructs. We also walked through brief examples of their use cases, as well as limitations of the use of OpenMP. In our next blog post, we will provide a tutorial on how to write, compile, and run a parallel for-loop on Cornell’s Cube Cluster. Until then, feel free to take a tour through the Cornell Virtual Workshop course on OpenMP to get a head start!

    References

    IBM documentation. (n.d.). https://www.ibm.com/docs/en/xl-c-aix/13.1.3

    Jones, M. D. (2009). Practical issues in Openmp. University at Buffalo (UB) Department of Computer Science and Engineering . https://cse.buffalo.edu/~vipin/nsf/docs/Tutorials/OpenMP/omp-II-handout.pdf

    Pros and cons of OpenMP. (n.d.). https://www.dartmouth.edu/~rc/classes/intro_openmp/Pros_and_Cons.html

    What is load balancing? – load balancing algorithm explained – AWS. (n.d.). https://aws.amazon.com/what-is/load-balancing/

    Ziskovin, G. (2022, October 29). #pragma OMP single [explained with example]. OpenGenus IQ: Computing Expertise & Legacy. https://iq.opengenus.org/pragma-omp-single/

    Parallelization of C/C++ and Python on Clusters

    There are multiple ways of parallelizing C/C++ and Python codes to be used on clusters. The quick and dirty way, which often gets the job done, is to parallelize the code with MPI and launch one process per core across multiple nodes. Though easy to implement, this strategy may result in a fair waste of results because:

    • With too many processes on one node, inefficient use of memory may make the code memory bottle-necked, meaning that a lot of time will be spent accessing memory and the cores will be sitting twiddling their thumbs.
    • If you have a 48-core node with 192 GB of memory (as the Skylake nodes on Stampede 2) and your code uses 10 GB of RAM per process, you will not be able to have more than 19 out of 48 cores working at the same time because more than that would exceed the node’s memory, which would crash the node, which would get you kicked out of the cluster for a while without a warning — if you do not know how much RAM your code requires, you should look into the Remora tool.

    If you are working on big runs on systems that charge per usage time or if you are creating a code you will have to run multiple times, it may be worth looking into more clever parallelization strategies. Some of the techniques described in the next subsections will also make your code two to four times faster on your desktop/laptop and are rather straight-forward to implement. Most of the code in this blog post can be freely downloaded from this repository. Lastly, if the #include statements in the examples below are not displayed as followed by a library between “<>”, please refer to the mentioned online repository.

    Hybrid Parallelization in C/C++

    The word hybrid refers in this context to a mixed use of shared and distributed memory parallelization techniques. In distributed parallelization techniques, such as the Message Passing Interface (MPI), each process has its own dedicated chunk of memory within a RAM chip. This means that each process will have its own copy of the input data (repeated across processes), variables within the program, etc. Conversely, shared memory parallelization techniques such as OpenMP allow all threads (parallel unit of computation) to access the same block of memory, which can make a mess in your code with variables being changed seemingly out of nowhere by various parallel runs, but which lessens the amount of RAM needed and in some cases allow for more efficient memory utilization.

    If running the Borg Master-Slave multiobjective evolutionary optimization algorithm (Borg MS) with hybrid parallelization, Borg MS will have each function evaluation performed by one MPI process, which could run in parallel with OpenMP (if your model is setup for that) across the cores designated for that process. As soon as a function evaluation is complete, the slave MPI process in charge of it will submit the results back to Borg MS’s master process and be given another set of decision variables to work on. The figure below is a schematic drawing of such hybrid parallelization scheme:

    BorgHybridParallelization.png

    OpenMP

    Parallelizing a for-loop in C or C++ in which each iteration is independent — such as in Monte Carlo simulations — is straight-forward:

    #include
    #include "path/to/my_model.h"
    
    int main(int argc, char *argv[])
    {
        int n_monte_carlo_simulations = omp_get_num_threads(); // One model run per core.
        vector system_inputs = ;
        vector results(n_monte_carlo_simulations, 0);
    #pragma omp parallel for
        for (int i = 0; i < n_monte_carlo_simulations; ++i)
        {
            results[i] = RunMonteCarloSimulation(system_inputs[i]);
        }
    }
    

    The only caveat is that you have to be sure that the independent simulations are not writing over each other. For example, if by mistake you had written results[0] instead of results[i], the final value on results[0] would be that of whichever thread (parallel simulation) finished last, while the rest of the vector would be of zeros. It is important to notice here that the vector system_inputs is read only once by the parallel code and all n_monte_carlo_simulations threads would have access to it in memory.

    Though this example is not rocket science, there is a lot more to OpenMP that can be found here and here.

    MPI

    MPI stands for Message Passing Interface. It creates processes which pass messages (information packages) to each other. The most basic MPI code is one that creates independent processes — each with its own allocated chunk of memory that cannot be accessed by any other process — and just has them running independently in parallel without communicating with each other. This is great to distribute Monte Carlo runs across various nodes, although it may lead to the limitations mentioned in the beginning of this post. Below is an example of a very basic MPI code to run Monte Carlo simulations in parallel:

    #include
    #include "path/to/my_model.h"
    
    int main(int argc, char *argv[])
    {
        MPI_Init(NULL, NULL);
    
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    
        vector system_inputs = ;
        RunMonteCarloSimulation(system_inputs[world_rank]);
    
        MPI_Finalize();
        return 0;
    }
    

    I will not cover in this post how to pass the output of the RunMonteCarloSimulation function back to a results vector, as in the previous example, because details of MPI is outside the scope of this post — a good MPI tutorial can be found here and great examples, including ones for Monte Carlo runs, can be found here. To run this code, you would have to compile it with mpicc mpi_demo.c -o MPIDemo and call it with the mpirun or mpiexec commands, which should be followed by the -n flag to specify the number of processes to be created, or mpirun -n 4 ./MPIDemo. The code above would run as many MonteCarlo simulations as there are processes (four processes if -n 4 was used), each with a rank (here a type of ID), and can be spread across as many nodes as you can get on a cluster. Each process of the code above would run on a core, which may result in the memory and/or processing limitations listed in the beginning of this post.

    Mixing OpenMP and MPI

    The ideal situation for most cases is when you use MPI to distribute processes across nodes and OpenMP to make sure that as many cores within a node as it makes sense to use are being used. To do this, you can use the structure of the code below.

    #include
    #include
    #include
    #include "path/to/my_model.cpp"
    
    int main(int argc, char *argv[])
    {
        MPI_Init(NULL, NULL);
    
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    
        int n_cores = omp_get_num_threads(); // get number of cores available for MPI process.
        vector system_inputs = ;
        vector results(n_cores , 0);
    #pragma omp parallel for
        for (int i = 0; i < n_cores ; ++i)
        {
            // printf("C hybrid. Process-thread: %s-%d\n", world_rank, omp_get_thread_num());
            results[i] = RunMonteCarloSimulation(system_inputs[i]);
        }
    
        MPI_Finalize();
        return 0;
    
    }
    

    If you comment lines 4 and 20 and uncomment line 19, compile the code with mpicc hybrid.c -o Hybrid -fopenmp, set the number of threads in the terminal to eight by running set OMP_NUM_THREADS=8, and call the resulting executable with the command mpirun -n 4 ./Hybrid, the executable will spawn four processes, each of which spawning eight threads and printing the following output:

    C hybrid. Process-thread: 0-0
    C hybrid. Process-thread: 0-1
    C hybrid. Process-thread: 0-4
    C hybrid. Process-thread: 0-5
    C hybrid. Process-thread: 0-7
    C hybrid. Process-thread: 0-6
    C hybrid. Process-thread: 0-2
    C hybrid. Process-thread: 0-3
    C hybrid. Process-thread: 1-6
    C hybrid. Process-thread: 1-0
    C hybrid. Process-thread: 1-1
    C hybrid. Process-thread: 1-3
    C hybrid. Process-thread: 1-5
    C hybrid. Process-thread: 1-4
    C hybrid. Process-thread: 1-7
    C hybrid. Process-thread: 1-2
    C hybrid. Process-thread: 2-6
    C hybrid. Process-thread: 2-0
    C hybrid. Process-thread: 2-3
    C hybrid. Process-thread: 2-7
    C hybrid. Process-thread: 2-2
    C hybrid. Process-thread: 2-4
    C hybrid. Process-thread: 2-5
    C hybrid. Process-thread: 2-1
    C hybrid. Process-thread: 3-0
    C hybrid. Process-thread: 3-1
    C hybrid. Process-thread: 3-3
    C hybrid. Process-thread: 3-4
    C hybrid. Process-thread: 3-6
    C hybrid. Process-thread: 3-7
    C hybrid. Process-thread: 3-2
    C hybrid. Process-thread: 3-5
    

    The code above can be used to have one process per node (and therefore one copy of system_inputs per node) using as many threads as there are cores in that node — depending on the number of cores, it is sometimes more efficient to have two or three MPI processes per node, each using half or a third of the cores in that node. Below is a PBS job submission script that can be used to generate the output above — job submission scripts for Slurm to be used on Stampede 2 can be found here.

    #!/bin/bash
    #PBS -N test_hybrid
    #PBS -l nodes=2:ppn=16
    #PBS -l walltime=:00:30
    #PBS -o ./output/hybrid_c.out
    #PBS -e ./error/hybrid_c.err
    module load openmpi-1.10.7-gnu-x86_64
    export OMP_NUM_THREADS=8
    
    set -x
    cd "$PBS_O_WORKDIR"
    
    # Construct a copy of the hostfile with only 16 entries per node.
    # MPI can use this to run 16 tasks on each node.
    export TASKS_PER_NODE=2
    uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile
    
    #cat nodefile
    mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS ./Hybrid
    

    Parallelizing Python on a Cluster

    Python was not designed with shared-memory parallelization in mind, so its native parallelization library, the Multiprocessing library, does distributed-memory parallelization instead of shared. The issue with the Multiprocessing library is that it does not work across nodes, so you still need something like MPI. Here, we will use MPI4Py.

    The code below parallelizes the function call across cores within the same node — please refer to this blog post for details about and other ways to use the Multiprocessing library.

    from multiprocessing import Process, cpu_count
    
    def do_something_useful(rank, shared_process_number):
        # Do something useful here.
        print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)
    
    # Create node-local processes
    shared_processes = []
    #for i in range(cpu_count()):
    for i in range(8):
        p = Process(target=do_something_useful, args=(i,))
        shared_processes.append(p)
    
    # Start processes
    for sp in shared_processes:
        sp.start()
    
    # Wait for all processes to finish
    for sp in shared_processes:
        sp.join()
    

    By adding MPI4Py to the code above, following the same logic of the C/C++ example, we get the following result:

    from mpi4py import MPI
    from multiprocessing import Process, cpu_count
    
    def do_something_useful(rank, shared_process_number):
        # Do something useful here.
        print 'Python hybrid, MPI_Process-local_process (not quite a thread): {}-{}'.format(rank, shared_process_number)
    
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    
    print 'Calling Python multiprocessing process from Python MPI rank {}'.format(rank)
    
    # Create shared-ish processes
    shared_processes = []
    #for i in range(cpu_count()):
    for i in range(8):
        p = Process(target=do_something_useful, args=(rank, i))
        shared_processes.append(p)
    
    # Start processes
    for sp in shared_processes:
        sp.start()
    
    # Wait for all processes to finish
    for sp in shared_processes:
        sp.join()
    
    comm.Barrier()
    

    Calling the code above with mpirun -n 4 python hybrid_pure_python.py will first result in four MPI processes being spawned with ranks 1 through 4, each spawning eight local processes. The output should look similar to the one below:

    Calling Python multiprocessing process from Python MPI rank 0
    Calling Python multiprocessing process from Python MPI rank 1
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-3
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-3
    Calling Python multiprocessing process from Python MPI rank 2
    Calling Python multiprocessing process from Python MPI rank 3
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 1-7
    Python hybrid, MPI_Process-local_process (not quite a thread): 0-7
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-0
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-1
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-2
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-3
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-3
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-4
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-5
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 3-7
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-6
    Python hybrid, MPI_Process-local_process (not quite a thread): 2-7
    

    Below is a PBS submission script that can be used with the code above:

    #!/bin/bash
    #PBS -N test_hybrid
    #PBS -l nodes=2:ppn=16
    #PBS -l walltime=00:01:00
    #PBS -o ./output/hybrid_python.out
    #PBS -e ./error/hybrid_python.err
    module load python-2.7.5
    export OMP_NUM_THREADS=8
    
    set -x
    cd "$PBS_O_WORKDIR"
    
    # Construct a copy of the hostfile with only 16 entries per node.
    # MPI can use this to run 16 tasks on each node.
    export TASKS_PER_NODE=2
    uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile
    
    #cat nodefile
    mpiexec --hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_pure_python.py
    

    Parallelizing Monte Carlo Runs of C/C++ (or anything else) with Python

    Very often we have to run the same model dozens to thousands of times for Monte-Carlo-style analyses. One option is to go on a cluster and submit dozens to thousand of jobs, but this becomes a mess for the scheduler to manage and cluster managers tend to not be very fond of such approach. A cleaner way of performing Monte Carlo runs in parallel is to write a code in Python or other language that spawns MPI processes which then call the model to be run. This can be done in Python with the subprocess library, as in the example below:

    from mpi4py import MPI
    import subprocess
    
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    
    print 'Calling OpenmpTest from Python MPI rank {}'.format(rank)
    function_call = ['./OpenmpTest', '{}'.format(rank)]
    subprocess.call(function_call)
    
    comm.Barrier()
    

    Here, OpenmpTest is an executable compiled from the C-OpenMP code below:

    #include
    #include 
    
    int main(int argc, char *argv[])
    {
    #pragma omp parallel for
        for (int i = 0; i < omp_get_num_threads(); ++i)
        {
            printf("C OpenMP. Process-thread: %s-%d\n", argv[1], i);
        }
        return 0;
    }
    

    Submitting the code above with following PBS job submission script should result in the output below.

    #!/bin/bash
    #PBS -N test_hybrid
    #PBS -l nodes=2:ppn=16
    #PBS -l walltime=00:01:00
    #PBS -o ./output/hybrid_python_c.out
    #PBS -e ./error/hybrid_python_c.err
    module load python-2.7.5
    export OMP_NUM_THREADS=8

    set -x
    cd "$PBS_O_WORKDIR"

    # Construct a copy of the hostfile with only 16 entries per node.
    # MPI can use this to run 16 tasks on each node.
    export TASKS_PER_NODE=2
    uniq "$PBS_NODEFILE"|awk -v TASKS_PER_NODE="$TASKS_PER_NODE" '{for(i=0;i nodefile

    #cat nodefile
    mpiexec –hostfile nodefile -n 4 -x OMP_NUM_THREADS python hybrid_calls_c.py

    Output:

    Calling OpenmpTest from Python MPI rank 0
    Calling OpenmpTest from Python MPI rank 1
    Calling OpenmpTest from Python MPI rank 2
    Calling OpenmpTest from Python MPI rank 3
    C OpenMP. Process-thread: 0-7
    C OpenMP. Process-thread: 0-5
    C OpenMP. Process-thread: 0-0
    C OpenMP. Process-thread: 0-2
    C OpenMP. Process-thread: 0-4
    C OpenMP. Process-thread: 0-1
    C OpenMP. Process-thread: 0-3
    C OpenMP. Process-thread: 0-6
    C OpenMP. Process-thread: 1-1
    C OpenMP. Process-thread: 1-2
    C OpenMP. Process-thread: 1-6
    C OpenMP. Process-thread: 1-4
    C OpenMP. Process-thread: 1-5
    C OpenMP. Process-thread: 1-0
    C OpenMP. Process-thread: 1-3
    C OpenMP. Process-thread: 1-7
    C OpenMP. Process-thread: 2-3
    C OpenMP. Process-thread: 2-7
    C OpenMP. Process-thread: 2-0
    C OpenMP. Process-thread: 2-4
    C OpenMP. Process-thread: 2-6
    C OpenMP. Process-thread: 2-5
    C OpenMP. Process-thread: 2-1
    C OpenMP. Process-thread: 2-2
    C OpenMP. Process-thread: 3-5
    C OpenMP. Process-thread: 3-1
    C OpenMP. Process-thread: 3-4
    C OpenMP. Process-thread: 3-2
    C OpenMP. Process-thread: 3-7
    C OpenMP. Process-thread: 3-3
    C OpenMP. Process-thread: 3-0
    C OpenMP. Process-thread: 3-6
    

    End of the Story

    As highlighted in the examples above (all of which can be download from here), creating parallel code is not rocket science and requires very few lines of code. The OpenMP and Multiprocessing parallelization tricks can save hours if you have to run a model several times on your personal computer, and paired with MPI these tricks can save you from hours to days in time and thousands of node-hours on clusters. I hope you make good use of it.

    Making Valgrind Easy

    Some of this blog’s readers and authors (most notably, Joe Kasprzik) read the title of this post and though “wait, there already is a post about Valgrind in this blog.” And you are right, so in this blog post I will build on the legacy Joe has left us on his post about Valgrind and get into the details of how to use its basic functionalities to get your code right.

    Common mistakes when coding in C++

    Suppose we have the following code:

    #include <stdio.h>
    
    int main() {
        int *var = new int[5]; // you wouldn't do this if the size was always 5, but this makes the argument clear.
        int n = 5;
        int m;
    
        if (m > n) {
            printf("Got into if statement.\n");
            for (int i = 0; i < 6; ++i) {
                var[i] = i;
            }
        }
    
        printf("var[5] equals %d\n", var[n]);
    }
    

    Saving the code above in a file called test.cpp, compiling it with g++ to create an executable called "test," and running it with "./test" will return the following output:

    bernardoct@DESKTOP-J6145HK ~
    $ g++ test.cpp -o test
    
    bernardoct@DESKTOP-J6145HK ~
    $ ./test
    Got into if statement.
    var[5] equals 5
    

    Great, it ran and did not crash (in such a simple code gcc's flag -Wall would have issued a warning saying m was not initialized, but in more complex code such warning may not be issued). However, it would be great if this code had crashed because this would make us look into it and figure out it actually has 3 problems:

    1. We did not assign a value to variable m (it was created but not initialized), so how did the code determine that m was greater than n to get into the code inside the if statement?
    2. The pointer array var was created as having length 5, meaning its elements are numbered 0 to 4. If the for-loop runs from 0 to 5 but element 5 does not exist, how did the code fill it in with the value of variable i when i was 5 in the loop? From the printf statement that returned 5 we know vars[5] equals 5.
    3. The pointer array var was not destroyed after the code did not need it any longer. This is not necessarily a problem in this case, but if this was a function that is supposed to be called over and over within a model there is a change the RAM would be filled with these seemingly inoffensive pointer arrays and the computer would freeze (or the node, if running on a cluster, would possibly crash and have to be rebooted).

    Given C++ will not crash even in the presence of such errors, one way of making sure your code is clean is by running it through Valgrind. However, most people who has used Valgrind on a model that has a few hundreds or thousands of lines of code has gotten discouraged by its possibly long and cryptic-looking output. However, do not let this intimidate you because the output is actually fairly easy to read once you either learn what to look for or use Valkyrie, a graphical user interface for Valgrind.

    Generating and interpreting Valgrind’s output

    The first think that needs to be done for Valgrind to give you a meaningful output is to re-compile your code with the -O0 and -g flags, the former to prevent the compiler from modifying your code to make it more efficient but unintelligible to Valgrind (or to debuggers), and the latter for Valgrind (and debuggers) to be able to pinpoint the line of code where issues happen and are originated. Therefore, the code should be compiled as shown below:

    bernardoct@DESKTOP-J6145HK ~
    $ g++ -O0 -g test.cpp -o test
    
    

    Now it is time to run your code with Valgrind to perform some memory checking. Valgrind itself will take flags that will dictate the type of analysis to be performed. Here we are interested in checking memory misuse (instead profiling, checking for thread safety, etc.), so the first flag (not required, but good to keep things for yourself) should be --tool=memcheck. Now that we specified that we want Valgrind to run a memory check, we should specify that we want it to look in detail for memory leaks and tell us where the erros are happening and originating, which can done by passing flags --leak-check=full and --track-origins-yes. This way, the complete function call to run Valgrind on our test program is:

    bernardoct@DESKTOP-J6145HK ~
    $ valgrind --tool=memcheck --leak-check=full --track-origins=yes ./test
    

    Important: Beware that your code will take orders of magnitude longer to run with Valgrind than it would otherwise. This means that you should run something as small as possible but still representative — e.g. instead of running your stochastic model with 1,000 realizations and a simulation time of 50 years, consider running 2 realizations simulating 2 years, so that Valgrind analyzes the year-long simulation and the transition between realizations and years. Also, if running your code on a cluster, load the valgrind module with module load valgrind-xyz on your submission script and replace the call to your model on the submission script by the valgrind call above you can find the exact name of the Valgrind module by running module avail on the terminal. If running valgrind with a code that used MPI, use mpirun valgrind ./mycode -flags.

    When called, valgrind will instrument our test.cpp and based on the collected information will print the following on the screen:

    ==385== Memcheck, a memory error detector
    ==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==385== Command: ./test
    ==385==
    ==385== Conditional jump or move depends on uninitialised value(s)
    ==385==    at 0x4006A9: main (test.cpp:9)
    ==385==  Uninitialised value was created by a stack allocation
    ==385==    at 0x400686: main (test.cpp:3)
    ==385==
    Got into if statement.
    ==385== Invalid write of size 4
    ==385==    at 0x4006D9: main (test.cpp:12)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    ==385==
    ==385== Invalid read of size 4
    ==385==    at 0x4006F5: main (test.cpp:16)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    ==385==
    var[5] equals 5
    ==385==
    ==385== HEAP SUMMARY:
    ==385==     in use at exit: 20 bytes in 1 blocks
    ==385==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
    ==385==
    ==385== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    ==385==
    ==385== LEAK SUMMARY:
    ==385==    definitely lost: 20 bytes in 1 blocks
    ==385==    indirectly lost: 0 bytes in 0 blocks
    ==385==      possibly lost: 0 bytes in 0 blocks
    ==385==    still reachable: 0 bytes in 0 blocks
    ==385==         suppressed: 0 bytes in 0 blocks
    ==385==
    ==385== For counts of detected and suppressed errors, rerun with: -v
    ==385== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
    

    Seeing Valgrind’s output being 5 times as long as the test code itself can be somewhat disheartening, but the information contained in the output is really useful. The first block of the output is the header it will always be printed so that you know the version of Valgrind you have been using, the call for your own code it used, and so on. In our example, the header is:

    ==385== Memcheck, a memory error detector
    ==385== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==385== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==385== Command: ./test
    

    After that, Valgrind report the errors it found during the execution of your code. Errors are always reported as a description of the error in good old English, followed by where it happens in your code. Let’s look at the first error found by Valgrind:

    ==385== Conditional jump or move depends on uninitialised value(s)
    ==385==    at 0x4006A9: main (test.cpp:9)
    ==385==  Uninitialised value was created by a stack allocation
    ==385==    at 0x400686: main (test.cpp:3)
    

    This tells us that there is an if statement (conditional statement) on line 9 of test.cpp in which at least one of the sides of the logical test has at least one uninitialized variable. As pointed out by Valgrind, line 9 of test.cpp has our problematic if statement which compares initialized variable n to uninitialized variable m, which will have whatever was put last in that memory address by the computer.

    The second error block is the following:

    ==385== Invalid write of size 4
    ==385==    at 0x4006D9: main (test.cpp:12)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    

    This means that your code is writing something in a location of memory that it did not allocated for its use. This block says that the illegal write, so to speak, happened in line 12 of test.cpp through a variable created in line 5 of test.cpp using the new[] operator. These lines correspond to var[i] = i; and to int *var = new int[5];. With this, we learned that either var was created too short on line 5 of test.cpp or that the for loop that assigns values to var goes one or more steps too far.

    Similarly, the next block tells us that our printf statement used to print the value of var[5] on the screen has read past the amount of memory that was allocated to var in its declaration on line 5 of test.cpp, as shown below:

    ==385== Invalid read of size 4
    ==385==    at 0x4006F5: main (test.cpp:16)
    ==385==  Address 0x5ab4c94 is 0 bytes after a block of size 20 alloc'd
    ==385==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==385==    by 0x400697: main (test.cpp:5)
    

    The last thing Valgrind will report is the information about memory leaks, which are accounted for when the program is done running. The output about memory leaks for our example is:

    ==409== HEAP SUMMARY:
    ==409==     in use at exit: 20 bytes in 1 blocks
    ==409==   total heap usage: 3 allocs, 2 frees, 73,236 bytes allocated
    ==409==
    ==409== 20 bytes in 1 blocks are definitely lost in loss record 1 of 1
    ==409==    at 0x4C2E8BB: operator new[](unsigned long) (vg_replace_malloc.c:423)
    ==409==    by 0x400697: main (test.cpp:5)
    ==409==
    ==409== LEAK SUMMARY:
    ==409==    definitely lost: 20 bytes in 1 blocks
    ==409==    indirectly lost: 0 bytes in 0 blocks
    ==409==      possibly lost: 0 bytes in 0 blocks
    ==409==    still reachable: 0 bytes in 0 blocks
    ==409==         suppressed: 0 bytes in 0 blocks
    

    The important points to take away from this last block are that:

    1. there were 20 bytes of memory leaks, meaning that if this were a function in your code every time it was run it would leave 20 bytes of garbage sitting in the RAM. This may not sound like a big deal but imagine if your code leaves 1 MB of garbage in the RAM for each of the 100,000 times a function is called. With this, there went 100 GB of RAM and everything else you were doing in your computer at that time because the computer will likely freeze and have to go through a hard-reset.
    2. the memory you allocated and did not free was allocated in line line 5 of test.cpp when you used the operator new[] to allocate the integer pointer array.

    It is important to notice here that if we increase the amount of allocated memory by the new[] operator on line 5 to that corresponding to 6 instead of 5 integers, the last two errors (invalid read and invalid write) would disappear. This means that if you run your code with Valgrind and see hundreds of errors, chances are that it will take modifying a few lines of code to get rid of most of these errors.

    Valkyrie — a graphical user interface for Valgrind

    Another way of going through Valgrind’s output is by using Valkyrie (now installed in the login node of Reed’s cluster, The Cube). If you are analyzing your code from your own computer with a Linux terminal (does not work with Cygwin, but you can install a native Ubuntu terminal on Windows 10 by following instructions posted here) and do not have Valkyrie installed yet, you can install it by running the following on your terminal:

    bernardoct@DESKTOP-J6145HK ~
    $ sudo apt-get install valkyrie
    

    Valkyrie works by reading an xml file exported by Valgrind containing the information about the errors it found. To export this file, you need to pass the flags --xml=yes and --xml-file=valgring_output.xml (or whatever name you want to give the file) to Valgrind, which would make the call to Valgrind become:

    bernardoct@DESKTOP-J6145HK ~
    $ valgrind --tool=memcheck --leak-check=full --track-origins=yes --xml=yes --xml-file=valgring_output.xml ./test
    

    Now, you should have a file called “valgrind_output.xml” in the directory you are calling Valgrind from. To open it with Valkyrie, first open Valkyrie by typing valkyrie on your terminal if on Windows 10 you need to have Xming installed and running, which can be done by following the instructions in the end of this post. If on a cluster, besides having Xming open you also have to have ssh’ed into the cluster with the -X flag (e.g. by running ssh -X username@my.cluster.here) with either Cygwin or from a native Linux terminal. After opening Valkyrie, click on the green folder button and select the xml file, as in the screenshot below.

    valkyrie_screenshot.png

    After opening the xml file generated by Valgrind, Valkyrie should look like in the screenshot below:valkyrie_screenshot2

    Now you can begin from a collapsed list of errors and unfold each error to see its details. Keep in mind that Valkyrie is not your only option of GUI for Valgrind, as IDEs like JetBrains’ CLion and QTCreator come integrated with Valgrind. Now go check your code!

    PS: Thanks to folks on Redit for the comments which helped improve this post.

    A completely non-exhaustive list of tutorial resources for scientific computing

    This is a short blog post to put together a list of resources with tutorials (similar to what’s usually found on this blog) for various programming languages. It is by no means exhaustive, so please comment if you feel there’s an important one I left out.

    Matlab:

    https://www.arnevogel.com/

    Kinda new blog with Matlab tutorials on numerical methods

    https://blogs.mathworks.com/loren/

    One of the MathWorks blogs, great tutorials.

    https://www.mathworks.com/support/learn-with-matlab-tutorials.html

    Matlab tutorials from MathWorks

    matlab

    More Matlab tutorials, a lot of material on many topics

    Python:

    https://glowingpython.blogspot.com/

    Not updated very frequently, but good data analysis and visualization tutorials are posted

    https://pythonprogramming.net/

    Updated regularly, some great data visualization and analysis tutorials. The tutorials come with videos. I really like this site.

    http://treyhunner.com/

    Updated regularly (about once a month) with python tutorials and general python coding insights. I really like the writing style of this author. 

    https://docs.scipy.org/doc/scipy/reference/tutorial/

    Tutorials on using SciPy

    C and C++:

    https://www.cprogramming.com/

    C and C++ programming tutorials, tips and tricks

    http://www.cplusplus.com/articles/

    Not really updated anymore but some good basic tutorials are listed

    https://blog.knatten.org/

    Hadn’t been updated in a while, but it looks like it’s been picked up again. Good for general C++ programming and good practice.

    http://www.bfilipek.com/

    C++ tutorials

    General:

    https://towardsdatascience.com/

    This is a great general resource not devoted to a particular language. They cover topics in data science, machine learning and programming in general. Some great tutorials for Python and R. 

    https://projecteuler.net/

    Mathematical programming problems to help with learning any language

    https://github.com/EbookFoundation/free-programming-books/blob/master/free-programming-books.md

    Free programming books repository

    Reddits (some of the bigger ones):

    /r/matlab (General on matlab, community provides help with coding)

    /r/programming (General on programming)

    /r/learnprogramming (Community provides help with debugging questions)

    /r/python (General on python, community provides help with coding)

    /r/learnpython (Community provides help with python questions, smaller than /r/python)

    /r/cpp (General on C++)

    /r/cpp_questions (Community provides help with C++ questions)

    I’ve also recently made /r/sci_comp which has very little activity for the moment, but the aim is to create a community with general resources on coding for scientific applications.

     

    Simple tricks to make your C/C++ code run faster

    Us, engineers with no formal computer science training, for a myriad of good reasons tend to favor friendly programming languages such as Python over evil C/C++. However, when our application is performance sensitive and we cannot wait for results sitting in our chairs for as long as a Python code would require us to, we are sometimes forced to us C/C++. Why then not making the most of it when it this is the case?

    Here are some simple tips to make your C/C++ code run even faster, and how to get some advice about further performance improvements. The last idea (data locality) is transferable to Python and other languages.

    Most important trick

    Improve your algorithm. Thinking if there is a simpler way of doing what you coded may reduce your algorithm’s complexity (say, from say n3 to n*log(n)), which would:

    • yield great speed-up when you have lots of data or needs to run a computation several times in a row, and
    • make you look smarter.

    Compiler flags

    First and foremost, the those who know what they are doing — compiler developers — do the work for you by calling the compiler with the appropriate flags. There are an incredible amount of flags you can call for that, but I would say that the ones you should have on whenever possible are -O3 and –march=native.

    The optimization flags (-O1 to -O3, the latter more aggressive than the former) will perform a series of modification on your code behind the scenes to speed it up, some times by more than an order of magnitude. The issue is that this modifications may eventually make your code behave differently than you expected, so it’s always good to do a few smaller runs with -O0 and -O3 and compare their results before getting into production mode.

    The –march=native flag will make the compiler fine tune your code to the processor it is being compiled on (conversely, –march=haswell would fine tune it to haswell architectures, for example). This is great if you are only going to run your code on your own machine or in another machine known to have a similar architecture, but if you try to run the binary on a machine with a different architecture, specially if it is an older one, you may end up with illegal instruction errors.

    Restricted pointer array

    When declaring a pointer array which you are sure will not be subject to pointer aliasing — namely there will be no other pointer pointing to the same memory address –, you can declare that pointer as a restrict pointer, as below:

    • GCC: double* __restrict__ my_pointer_array
    • Intel compiler: double* restrict my_pointer_array

    This will let the compiler know that it can change order of certain operations involving my_pointer_array to make your code faster without changing some read/write order that may change your results. If you want to use the restricted qualifier with the intel compiler the flag -restrict must be passed as an argument to the compiler.

    Aligned pointer array

    By aligning an array, you are making sure the data is going to lie in the ideal location in memory for the processor to fetch it and perform the calculations it needs based on that array. To help your compiler optimize your data alignment, you need to (1) align your array when it is declared by a specific number of bytes and (2) tell your the compiler the array is aligned when using it to perform calculations — the compiler has no idea whether or not arrays received as arguments in function are aligned. Below are examples in C, although the same ideas apply to C++ as well.

    GCC

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    
    void sum(double *__restrict__ a, double *__restrict__ b, double *__restrict__ c, int n) {
        a = (double*) __builtin_assume_aligned(a, ALIGN);
        b = (double*) __builtin_assume_aligned(b, ALIGN);
        c = (double*) __builtin_assume_aligned(c, ALIGN);
        for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
    }
    
    int main(void) {
    
        double a[SIZE_ARRAY] __attribute__((aligned(ALIGN )));
        double b[SIZE_ARRAY] __attribute__((aligned(ALIGN )));
        double c[SIZE_ARRAY] __attribute__((aligned(ALIGN )));
    
        for (int i = 0; i < SIZE_ARRAY; ++i) {
            a[i] = 5.;
            b[i] = 2.;
        }
        double start_time = omp_get_wtime();
        sum(a, b, c, SIZE_ARRAY);
        double time = omp_get_wtime() - start_time;
        printf("%0.6fs", time);
    }
    

    Intel compiler

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    
    void sum(double* restrict a, double* restrict b, double* restrict c, int n) {
        __assume_aligned(a, ALIGN);
        __assume_aligned(b, ALIGN);
        __assume_aligned(c, ALIGN);
        for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
    }
    
    int main(void) {
    
        __declspec(align(ALIGN )) float a[SIZE_ARRAY];
        __declspec(align(ALIGN )) float b[SIZE_ARRAY];
        __declspec(align(ALIGN )) float c[SIZE_ARRAY];
    
        for (int i = 0; i < SIZE_ARRAY; ++i) {
            a[i] = 5.;
            b[i] = 2.;
        }
        double start_time = omp_get_wtime();
        sum(a, b, c, SIZE_ARRAY);
        double time = omp_get_wtime() - start_time;
    
        printf("%0.6fs", time);
    }
    

    Edit: In a comment to this post, Krister Walfridsson not only caught an issue with my GCC code, for which mention I thank him, but he also shows the differences in machine code generated with and without alignment.

    Data Locality

    Computers are physical things, which means that data is physically stored and needs to be physically moved around in memory and between cache and processor in order to be used in calculations. This means that, if your data is stored all over the place in memory — e.g. in multiple pointer arrays in different parts of memory –, the processor will have to reach out to several parts of memory to fetch all your data before performing any computations. By having the data intelligently laid out in memory you ensure all data required for each computation is stored close to each other and in cache at the same time, which becomes even more important if your code uses too much data to fit in the cache at once.

    In order to making your processor’s life easier, it is a good idea to ensure that all data required for a calculation step is close together. For example, if a given computation required three arrays of fixed sizes, it is always a good idea to merge them into one long array, as in the example below for the Intel compiler.

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    
    void sum(double* restrict a, double* restrict b, double* restrict c, int n) {
        __assume_aligned(a, ALIGN);
        __assume_aligned(b, ALIGN);
        __assume_aligned(c, ALIGN);
        for (int i = 0; i < n; ++i)
            c[i] = a[i] + b[i];
    }
    
    int main(void) {
    
        __declspec(align(ALIGN )) float abc[3 * SIZE_ARRAY];
    
        for (int i = 0; i < 2 * SIZE_ARRAY; ++i) {
            a[i] = 5.;
            b[i] = 2.;
        }
        double start_time = omp_get_wtime();
        sum(abc, abc + SIZE_ARRAY, abc + 2 * ARRAY, SIZE_ARRAY);
        double time = omp_get_wtime() - start_time;
    
        printf("%0.6fs", time);
    }
    

    or even, since c[i] depends only on b[i] and a[i], we can have the values of a, b and c intercalated to assure that all computations will be performed on data that is right next to each other in memory:

    #include <stdio.h>
    #include <omp.h>
    
    #define SIZE_ARRAY 100000
    #define ALIGN 64
    #define STRIDE 3
    
    void sum(double* restrict abc, int n, int stride) {
        __assume_aligned(abc, ALIGN);
    
        for (int i = 0; i < n; i += stride)
            abc[i+2] = abc[i] + abc[i+1];
    }
    
    int main(void) {
        __declspec(align(ALIGN )) double abc[3 * SIZE_ARRAY];
    
        for (int i = 0; i < 3 * SIZE_ARRAY; i += STRIDE) {
            abc[i] = 5.;
            abc[i+1] = 2.;
                                                }
        double start_time = omp_get_wtime();
        sum(abc, SIZE_ARRAY, STRIDE );
        double time = omp_get_wtime() - start_time;
    
        printf("%0.6fs", time);
    }
    

    Conclusion

    According a class project in which we had to write C code to perform matrix multiplication, the improvements suggested should may improve the performance of your code by 5 or 10 times. Also, the idea of data locality can be transferred to other languages, such as Python.

    Let your Makefile make your life easier

    This semester I’m taking my first official CS class here at Cornell, CS 5220 Applications of Parallel Computers taught by Dave Bindel (for those of you in the Reed group or at Cornell, I would definitely recommend taking this class, which is offered every other year, once you have some experience coding in C or C++). In addition to the core material we’ve been learning in the class, I’ve been learning a lot by examining the structure and syntax of the code snippets written by the instructor and TA that are offered as starting points and examples for our class assignments. One element in particular  that stood out to me from our first assignment was the many function calls made through the makefile. This post will first take a closer look into the typical composition of a makefile and then examine how we can harness the structure of a makefile to help improve workflow on complicated projects.

    Dissecting a typical makefile

    On the most basic level, a makefile simply consists of series of rules that each have an associated set of actions. Makefiles are how you use the “make” utility, a software package installed on all linux systems. Make has its own syntax similar to bash but with some distinct idiosyncrasies. For example, make allows you to store a snippet of code in whats called a “macro” (these are pretty much analogous to variables in most other languages).  A macro to store the flags you would like to run with your compiler could be defined like this:

    CFLAGS  = -g -Wall

    To referenced the CFLAGS macro, use a dollar sign and brackets, like this:

     $(CFLAGS)

    There are a series of “special” predifined macros that can be used in any makefile which are fairly common, you can find them here.

    Now that we’ve discussed makefile syntax, lets take a look at how rules are structured within a makefile. A rule specified by a makefile has the following shape:

    target: prerequisites
        recipe
        ...
        ...

    The target is usually the name of the file that is generated by  a program, for example an executable or object file. A prerequisite is the specified input used to create the target (which can often depend on several files). The recipe is the action that make carries out for the intended target (note: this must be indented at every line).

    For example, a rule to build an executable called myProg from a c file called myProg.c using the gcc compiler with flags defined in CFLAGS might look like this:

    myProg: myProg.c
        gcc $(CFLAGS) $<

    Make the makefile do the work

    The most common rules within makefiles call the compiler to build code (hence the name “makefile”) and many basic makefiles are used for this sole purpose. However, a rule simply sends a series commands specified by its recipe to the command line and a rule can actually specify any action or series of actions that you want. A ubiquitous example of a rule that specifies an action is “clean”, which may be implemented like this:

    clean:
        rm -rf *.o $(PROGRAM)

    Where PROGRAM is a macro containing the names of the executable compiled by the makefile.

    In this example, the rule’s target is an action rather than an output file. You can call “clean” by simply typing “make clean” into the command line and you will remove all .o files and the executable PROGRAM from your working directory.

    Utilizing this application of rules, we can now have our makefile do a lot of our command line work for us. For example, we could create a rule called “run” which submits a series of pbs jobs into a cluster.

    run:
        qsub job-$*.pbs

    We can then enter “make run” into the command line to execute this rule, which will submit the .pbs jobs for us (note that this will not perform any of the other rules defined in the makefile). Using this rule may be handy if we have a large number of jobs to submit.

    Alternatively we could  make a rule called “plot” which calls a plotting function from python:

    plot: 
        python plotter.py $(PLOTFILES)

    Where PLOTFILES is a macro containing the name of files to be plotted and plotter.py is a python function that takes the file names as input.

    Those are just two examples (loosely based on a makefile given in CS 5220) of how you can use a makefile to do your command line work for you, but the possibilities are endless!! Ok, maybe that’s getting a bit carried away, but I do find this functionality to be a simple and elegant way to improve the efficiency of your workflow on complex projects.

    For some background on makefiles, I’d recommend taking a look a Billy’s post from last year. I also found the GNU make user manual helpful as well as this tutorial from Swarthmore that has some nice example makefiles.

    Compiling Code using Makefiles

    For people new to coding or using supercomputers to submit jobs, compiling can be a new concept. In programming, a compiler takes source code (e.g. written in C/C++, Python, Fortran, etc.) and translates it into a lower-level programming language (e.g. assembly language or machine code). When a compiler runs successfully, the source code will be converted to an executable program which the computer understands how to run.   For more info on compilers, check out this video

    To compile a program, we use the ‘make’ command. When we have multiple source files (which we often do when running complex water management models), a makefile file helps organize directions to give to the compiler. If you are not creating a model from scratch, you may already have an existing makefile which are conventionally named ‘makefile’ or ‘Makefile’. If that is the case, compiling is easy if all the files are in the proper directory. Simply type ‘make’ in the command line interface to compile your code.

    If you would like to edit your makefile, create one from scratch, or just want to learn more about the ‘make’ command and makefiles, check out the resources below:

    Introduction to ‘make’ and compiler options:

    Introductory tutorials for makefiles:

    Makefile naming:

    Makefile macros:

    Compiler flags:

    For a list of compiler flags, see the manual or ‘man’ page for the compiler: e.g. $man <compiler_name>. The convention for naming makefile macros for compiler flags for gcc and g++ are CFLAGS and CPPFLAGS, respectively.

    Example Makefile (C or C++):

    The conventional file organization for this work is to create a src (or source) and bin (or binary) directory. The source code will go in /src while the makefile and any input files will go in /bin. Once the executable is created, it will be located in /bin as well. Below is a truncated version of a makefile I made for a water treatment plant model (written in C) based on a makefile I found for LRGV (written in C++). In general, there is little difference between a makefile for a program written in C or C++, so this template file can be used for either language by simply commenting or uncommenting the makefile macros for the language for which you want to compile (e.g. CC, CXX, CFLAGS, CPPFLAGS). One special difference between compiling C and C++ is that a common library, math.h, is automatically linked for C++ but must be explicitly linked for C. You’ll notice that I’ve explicitly linked it in the code below, so as long as the C code I’ve written is also valid C++, I should be able to use either compiler since C is a subset of C++. Enjoy!

    MAIN_DIR=./.. #from within /bin go to main directory which contains /bin and /src directories
    
    SOURCE_DIR = $(MAIN_DIR)/src #navigate to directory which contains source code
    
    SOURCES = \
    $(SOURCE_DIR)/basin.c    \
       $(SOURCE_DIR)/breakpt.c  \
       #I’ll leave out some files for the sake of space
       $(SOURCE_DIR)/unittype.c \
       $(SOURCE_DIR)/uptable.c  \
       $(SOURCE_DIR)/writewtp.c \
    
    OBJECTS=$(SOURCES:.c=.o) #name object files based on .c files
    CC=gcc #select the type of compiler: this once in for C
    #CXX=g++
    CFLAGS=-c -O3 -Wall -I. -I$(SOURCE_DIR) #set flags for gcc
    #CPPFLAGS=-c -O3 -Wall -I. -I$(SOURCE_DIR) #set flags for g++
    LIBS=-lm  # link libraries (e.g. math.h with -lm: goo.gl/Bgq75V)
    EXECUTABLE=wtp_v2-2_borg-mp #name of the executable file
    
    all: $(SOURCES) $(EXECUTABLE)
        rm -rf $(SOURCE_DIR)/*.o
    
    $(EXECUTABLE): $(OBJECTS)
        $(CC) $(OBJECTS) -o $@ $(LIBS)
    
    .c.o:
        $(CC) $(CFLAGS) $^ -o $@
    
    clean: #’make clean’ will remove all compilation files
        rm -f $(SOURCE_DIR)/*.o $(EXECUTABLE)