Get your research workflow on

I have done some research on research workflow and that includes interviewing some of my peers at Cornell grad school to get a sense of what increases their productivity and what are their strategies for accomplishing long-term research goals.  In addition to this, I also gathered good advice from my PI, who is the most ultra-efficient human that I know.  Had I taken the following advice, I would’ve written this blog post a week ago.

The Get your research workflow on series, consists of two parts:

Part 1 covers General research workflow tips and Part 2. Setting up your technical workflow for people training in with the Decision Analytics crew (a.k.a. the best crew in town).

General research workflow tips

Disclosure: some of the contents of this list may be easier said than done.

First of all, a research workflow can be very personal and it is definitely tailored to each person’s requirements, personality and interests,  but here are some general categories that I think every researcher can relate to:

Taking notes, organizing and reflecting on ideas

I was gifted with the memory of a tuna fish, so I need to take notes for everything.  Unfortunately, taking notes with paper notebooks resulted in disaster for me in the past, it was very hard to  keep information organized, and occasionally my notebooks would either disappear or get  coffee stains all over.   Luckily, my office mate Dave, introduced me to the best application for note taking ever: Evernote,  this app allows you to keep your notes categorized, so you can keep the information that you need indexed and searchable across every single platform you have, that means that you can keep your notes synchronized  with your smartphone, laptop, desktop, etc, and have it accessible anywhere you go.

In addition, the Evernote web clipper tool allows you to save and categorize articles or webpages within your notes and make annotations on them.  Additionally, you can tag your notes, this is useful if you have notes that could fit into multiple notebooks.  You can also share  and invite people to edit notes and you can connect it with Google drive.  I would probably still flock to Google docs or Dropbox Paper for collaborative documents, but for personal notes, I prefer the Evernote interface.   There’s no limit on the amount of notebooks that you can have.  I’ve found this app very useful for brainstorming and developing ideas, I also use it to keep a research log to track my research progress.

Reading journal papers and reference management 

Keeping up with the scientific literature can be very challenging specially with the overwhelming amount of journal papers out there, but you can make things manageable for yourself if you find a reference manager that allows you to build a library that makes it easy to find, add, organize, read, prioritize and annotate papers  that you can later cite.  Additionally, you may want to set up smart notifications about new papers on topics that interest you, and get notified via e-mail. A couple of popular free and open source reference managers that allow you to do the previous are Zotero and Mendeley,  also Endnote basic, its free but you would need to upgrade to Endnote Desktop for unlimited storage.  These reference managers also allow you to export BibTex files for its integration with LaTeX.  You can check out the Grand Reference Management Comparison  table for all the reference management software available out there.

In addition to reference manager software,  a couple of popular subscription-based multidisciplinary databases are Web of Science and Scopus, they  differ from Google scholar,  by the fact that these are human curated databases, they are selected by scholarly and quality criteria by literature review committees, and they let you build connections between topics.

Finally,  I came across this article on How to keep up with the scientific literature, where a number of scientists were interviewed on the subject, and they all agree that it can be overwhelming but it is key to stay up to date with the literature as its the only way to contextualize your work and identify the knowledge gaps.  The article provided advice on how to prioritize what to read despite the overwhelming amount of scientific literature.

 

Time management and multi-tasking

This is my Achilles heel, and its a skill that requires a lot of practice and discipline.   Sometimes, progress in research can seem hard to accomplish, specially when you are involved in several projects, dealing with hard deadlines, taking many classes, TA-ing, or you’re simply busy being a socialité,  but there are several tricks to be on top of research while avoiding getting overwhelmed in a multi-tasking world.   Some, or most of this tips came from a time-manager master-mind:

Tip # 1.  Schedule everything and time everything

Schedule everything from hard, set-in-stone deadlines to casual meetings, that way you’ll know for sure how much time you’ll have to spare on different projects and you can block time for those projects in a weekly basis.  Keep track of the time that you spend on different projects/tasks.   There’s a very popular app among 3 economists, Julie’s brother and my friend Justyna called be focused that allows you to manage tasks and time them.  You can use it to keep track, for instance, of the time it takes you to read a paper,  some people use it to time the time it takes them to write a paper till completion, right now I’m tracking the amount of time its taking me to write this blogpost.  Timing everything will allow you to get better at predicting the time it will take you to accomplish something and reflect on how you can improve.   I always tend to underestimate my timings but this app is giving me a reality check.. very annoying.

Tip # 2. Different mindsets for different time slots

When your schedule is full of small time gaps, fill them doing tasks that involve less concentration, probably reading, answering e-mails, organizing yourself, and leave larger time slots for the most creative and challenging part of your work.

Also, a general recommendation of multi-tasking is don’t do it,  trying to do multiple things at once can hurt your productivity, instead,  block times to carry specific tasks, were you focus on that one thing, and then move on to the next. Remember to prioritize and tackle the most important thing first.

Tip #4. Visualize long-term research goals and work backwards

Picture what you want to accomplish in a year or in a semester, and work your way backwards, till you refine the accomplishments of each month, each week and each day to hit your long-term target.  Setup to-do lists for the week and for the day.

Tip #3. Set aside time for new skills that you want to acquire

Even if you set aside one or two hours a week devoted to that skill that you want to develop, it will pay off, you’ll have come a long way at the end of the year.  Challenge yourself and continue to develop new skills.

Tip #5. Don’t leave e-mails sitting in your inbox

There are a couple of strategies for this, you can either allocate time each day specifically for replying to e-mails or you can tackle each e-mail as it comes.   If it’s something that will require you more time, move it to a special list, if it’s a meeting, put it in your calendar, if it’s for reference, save it. No matter what your strategy is,  take action on an e-mail as soon as you read it.

Collaborative work

Some tools for collaborative work :

Overleaf– for writing LaTeX files collaboratively and visualizing the changes live, the platform has several journal templates and can track changes easily.

Github– platform for collaborative code development and management.

Slack – organize conversations with teams, and organize your collaborative workflow

A final recommendation is to have a consistent and intuitive organization of your research.  Document everything, and have reproducible code.  If you get hit by a bus and your colleagues are able to continue research were you left off in less than a week, then you’re in good shape, organization-wise.

I hope this helps, let me know if there are some crucial topics that I missed, I can always come back and edit.

Special thanks to all of my grad/postdoc friends that participated in the brief research workflow interview.

 

 

 

 

Advertisements

Reading CSV files in C++

If you are an engineer used to coding in Python or Matlab who is transitioning to C++, you will soon find out that even the most innocent task will now require several lines of code. A previous post has already shown how to export data to a CSV file. In order to facilitate your transition to C++, see below for an example of how to read your new CSV file.

Utils.cpp

#include <string>
#include <vector>
#include <sstream> //istringstream
#include <iostream> // cout
#include <fstream> // ifstream

using namespace std;

/**
 * Reads csv file into table, exported as a vector of vector of doubles.
 * @param inputFileName input file name (full path).
 * @return data as vector of vector of doubles.
 */
vector<vector<double>> parse2DCsvFile(string inputFileName) {

    vector<vector<double> > data;
    ifstream inputFile(inputFileName);
    int l = 0;

    while (inputFile) {
        l++;
        string s;
        if (!getline(inputFile, s)) break;
        if (s[0] != '#') {
            istringstream ss(s);
            vector<double> record;

            while (ss) {
                string line;
                if (!getline(ss, line, ','))
                    break;
                try {
                    record.push_back(stof(line));
                }
                catch (const std::invalid_argument e) {
                    cout << "NaN found in file " << inputFileName << " line " << l
                         << endl;
                    e.what();
                }
            }

            data.push_back(record);
        }
    }

    if (!inputFile.eof()) {
        cerr << "Could not read file " << inputFileName << "\n";
        __throw_invalid_argument("File not found.");
    }

    return data;
}

int main()
{
    vector<vector<double>> data = parse2DCsvFile("test.csv");

    for (auto l : data) {
    	for (auto x : l)
    		cout << x << " ";
    	cout << endl;
    }

    return 0;
}

Enhance your (Windows) remote terminal experience with MobaXterm

Jazmin and Julie recently introduced me to a helpful program for Windows called “MobaXterm” that has significantly sped up my workflow when running remotely on the Cube (our cluster here at Cornell). MobaXterm bills itself as an “all in one” toolbox for remote computing. The program’s interface includes a terminal window as well as a graphical SFTP browser. You can link the terminal to the SFTP browser so that as you move through folders on the terminal the browser follows you. The SFTP browser allows you to view and edit files using your text editor of choice on your windows desktop, a feature that I find quite helpful for making quick edits to shell scripts or pieces of code as go.

mobaxtermsnip

A screenshot of the MobaXterm interface. The graphical SFTP browser is on the left, while the terminal is on the right (note the checked box in the center of the left panel that links the browser to the terminal window).

 

You can set up a remote Cube session using MobaXterm with the following steps:

  1. Download MobaXterm using this link
  2.  Follow the installation instructions
  3. Open MobaXterm and select the “Session” icon in the upper left corner.
  4. In the session popup window, select a new SSH session in the upper left, enter “thecube.cac@cornell.edu” as the name of the remote host and enter your username.
  5. When the session opens, check the box below the SFTP browser on the left to link the browser to your terminal
  6. Run your stuff!

Note that for a Linux system, you can simply link your file browser window to your terminal window and get the same functionality as MobaXterm. MobaXterm is not available for Mac, but Cyberduck and Filezilla are decent alternatives. An alternative graphical SFTP browser for Windows is WinSCP, though I prefer MobaXterm because of its linked terminal/SFTP interface.

For those new to remote computing, ssh or UNIX commands in general, I’d recommend checking out the following posts to get familiar with running on a remote cluster:

 

 

 

Developing parallelised code with MPI for dummies, in C (Part 2/2)

My last post introduced MPI and demonstrated a simple example for using it to parallelize a code across multiple nodes. In the previous example, we created an executable that could be run in parallel to complete the same task multiple times. But what if we want use MPI to on a code that has both parallel and serial sections, this is inevitable if we want everything to be self-contained.

As I tried to stress last time, MPI runs multiple versions of the same executable each with independent memory (please read this sentence three times, it is very different from how you learned to code). If you wish to share memory, you must explicitly send it. This allows no scope for a serial section!

We must, instead, imitate serial sections of code by designating a ‘root’ processor, conventionally the processor with rank = 0. We trap the ‘serial section’ inside an if-statement designating the root and send data to it from other processors when required.

Sending Data

I will build on the previous two examples by creating a loop that calculates the mean of a set of random numbers, we will parallelize the random number generation but leave the mean calculation in ‘serial’ (i.e. to be calculated by the root processor).

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(){
int size, rank,i;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double randSum = 0;
srand(rank + 1);
double myRand = (double)rand()/(double)RAND_MAX;
printf("I evaluated rank = %d, myRand = %f\n",rank,myRand);
if (rank == 0){
   for (i=0;i<size;i++){
      if (i > 0){
         MPI_Recv(&myRand, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      }
      randSum = randSum + myRand;
   }
   printf("Mean random number = %f\n",randSum/size);
}
else{
   MPI_Send(&myRand, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}

For contrast with a regular serial version:

#include <stdio.h>
#include <stdlib.h>
int main(){
int rank,size = 10;
MPI_Init(NULL,NULL);
double randSum = 0;
srand(rank + 1);
double myRand = (double)rand()/(double)RAND_MAX;
printf("I evaluated rank = %d, myRand = %f\n",rank,myRand);
if (rank == 0){
   for (rank = 0; rank < size; ++rank){
      srand(rank + 1);
      randSum = randSum + (double)rand()/(double)RAND_MAX;
      printf("I evaluated rank = %d, myRand = %f\n",rank,myRand);}
      printf("Mean random number = %f\n",randSum/size);
   }
}
return 0;
}

We introduce here two new MPI functions:

MPI_Send(data address, size of data, MPI type of data, processor destination (by rank), tag, communicator) sends the random number to the root (rank 0).

MPI_Recv(data address, size of data, MPI type of data, processor source (by rank), tag, communicator, status suppression) tells a processor, in our case the root, to receive data from a processor source.

Both MPI_Send and MPI_Recv prevent code from progressing further until the send-> receive is resolved. i.e. when rank = 5 reaches send, it will wait until rank = 0 has received data from ranks 1:4 before sending the data and progressing further.

Broadcasting data

Sending data between processors in MPI is moderately expensive, so we want to call send/recv as few times as possible. This means that vectors should be sent in one, rather than in a loop. It also means that when sending data from one processor to all (most commonly from the root), it is more efficient to use the built in ‘broadcast’ rather than sending to each processor individually (the reason for this is explained in: http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/).

Below we will introduce an example where the root broadcasts how many random numbers each processor should create, these vectors of random numbers are then sent back to the root for mean calculation.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(){
int size, rank,i,j;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double randSum = 0;
int numRands;
srand(rank+1);
if (rank == 0){
   numRands = 5;
   MPI_Bcast(&numRands,1,MPI_INT,0,MPI_COMM_WORLD);
}
else{
   MPI_Bcast(&numRands,1,MPI_INT,0,MPI_COMM_WORLD);
} 
double *myRand = calloc(numRands,sizeof(double));
for (i =0;i<numRands;++i){
   myRand[i] = (double)rand()/(double)RAND_MAX;
}
if (rank == 0){
   for (i=0;i<size;i++){
      printf("root received from rank %d the vector: ",i);
      if (i > 0){
         MPI_Recv(myRand, numRands, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      }
      for (j=0;j<numRands;j++){
         printf("%f ",myRand[j]);
         randSum = randSum + myRand[j];
      }
      printf("\n");
   }
   printf("Mean random number = %f\n",randSum/(size*numRands));
}
else{
   MPI_Send(myRand, numRands, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
}
free(myRand);
MPI_Finalize();
return 0;
}

We have new used the new MPI function:

MPI_Bcast(data address, size of data, MPI data type, processor source, communicator) broadcasts from the processor source (in our case the root) to all other processors, readers should note the common mistake of using MPI_Recv instead of MPI_Bcast to receive the data; MPI_Bcast is the function to both send and receive data.

Another simple but common mistake that readers should note is the passing of dynamically sized data; note how myRand is sent without the & address operator (because the variable itself is an address) while numRands is sent with the & operator.

Concluding remarks

This tutorial should set you up to use much of the MPI functionality you need to parallelise your code. Some natural questions that may have arisen while reading this tutorial that we did not cover:

MPI_Barrier – while MPI_Send/Recv/Bcast require processors to ‘catch up’, if you are writing and reading data to files (particularly if a processor must read data written by another processor) then you need to force the processors to catch up; MPI_Barrier achieves this.

tags – you can contain metadata that can be described by integers (e.g. vector length or MPI data type) in the ‘tag’ option for MPI_Send/Recv.

MPI_Status – this structure can contain details about the data received (rank, tag and length of the message), although much of the time this will be known in advance. Since receiving the status can be expensive, MPI_STATUS_IGNORE is used to supress the status structure.

All of the MPI functions described in this tutorial are only a subset of those available that I have found useful in parallelizing my current applications. An exhaustive list can be found at: http://www.mpich.org/static/docs/latest/. If you want to go beyond the functions described in this post (or you require further detail) I would recommend: http://mpitutorial.com/tutorials/.

Part (1/2): https://waterprogramming.wordpress.com/2017/07/21/developing-parallelised-code-with-mpi-for-dummies-in-c-part-12/

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Parallel computing allows for faster implementation of a code by enabling the simultaneous execution of multiple tasks. Before we dive in to how parallelisation of a code is achieved, let’s briefly review the components that make up a high performance computing (HPC) cluster (it should be noted that you can parallelise code on your own computer, but this post will focus on parallelisation on clusters).  High performance computing clusters are usually comprised of a network of individual computers known as nodes that function together as a single computing resource as shown in Figure 1. Each node has some number of processors (the chip within a node that actually executes instructions) and modern processors may contain multiple cores, each of which can execute operations independently. Processors performing tasks on the same node have access to shared memory, meaning they can write and reference the same memory locations as they execute tasks. Memory is not shared between nodes however, so operations that run on multiple nodes use what’s known as distributed-memory programming. In order to properly manage tasks using distributed memory, nodes must have a way to pass information to each other.

garbage

Figure 1: One possible configuration of a HPC cluster, based on the Cornell CAC presentation linked in the following paragraph.

Parallelization is commonly performed using OpenMP or MPI.  OpenMP (which stands for Open Multi-Processing) parallelises operations by multithreading, running tasks on multiple cores/units within a single node using shared memory.  MPI (which stands for Message Passing Interface) parallelises tasks by distributing them over multiple nodes (also possible over multiple processors) within a network, utilizing the distributed memory of each node. This has two practical implications; MPI is needed to scale to more nodes but communication between tasks is harder. The two parallelisation methods are not mutually exclusive; you could use OpenMP to parallelise operations on individual network nodes and MPI to communicate between nodes on the network (example: http://www.slac.stanford.edu/comp/unix/farm/mpi_and_openmp.html). Both OpenMP and MPI support your favourite languages (C, C++, FORTRAN, Python, but not Java – perfect!). The remainder of this post will focus on implementing MPI in your code, for references on using OpenMP, see this presentation by the Cornell Center for Advanced Computing: https://www.cac.cornell.edu/education/training/ParallelMay2012/ProgOpenMP.pdf.

So how does MPI work?

MPI creates multiple instances of your executable and runs one on each processor you have specified for use. These processors can communicate with each other using specific MPI functions. I will explain a few of the more basic functions in this post.

What does MPI code look like?

A parallel loop (compiles with: mpicc -O3 -o exampleParallel.exe exampleParallel.c -lm)

#include 
#include 
#include 
int main(){ 
int size, rank; 
MPI_Init(NULL,NULL);    
MPI_Comm_size(MPI_COMM_WORLD, &size); 
MPI_Comm_rank(MPI_COMM_WORLD, &rank); 
printf("I evaluated rank = %d\n",rank); 
MPI_Finalize(); 
return 0;
}

The parallel loop can be distributed over 10 processors (ppn) in one node and submitted to the Cube as a Unix script:

#!/bin/bash
#PBS -N exampleParallel
#PBS -l nodes=1:ppn=10
#PBS -l walltime=0:00:05
#PBS -j oe
#PBS -o output
cd $PBS_O_WORKDIR
mpirun ./exampleParallel.exe

This can be contrasted with a serial loop (compiles with: gcc -O3 -o exampleSerial.exe exampleSerial.c):

#include 
#include 
int main(){ 
int rank, size = 10; 
for (rank = 0; rank < size; rank++){ 
   printf("I evaluated rank = %d\n",rank);
} 
return 0;}

Let’s have a look at what each line of the parallel code is doing;

MPI_Init is the first MPI function that must be called in a piece of code in order to initialize the MPI environment. NOTE! MPI_Init does not signify the start of a ‘parallel section’, as I said earlier, MPI does not have parallel sections, it runs multiple instances of the same executable in parallel.

MPI_Comm_size populates an integer address with the number of processors in a group, in our example, 10 (i.e. num nodes * processors per node).

MPI_Comm_rank populates an integer address with the processor number for the current instance of the executable. This ‘rank’ variable is the main way to differentiate between different instances of your executable, it is equivalent to a loop counter.

MPI_Finalize is the last MPI function that must be called in a piece of code, it terminates the MPI environment. As far as I can tell it is only recommended to have return functions after MPI_Finalize.

This simple example highlights the difference between MPI and serial code; that each executable is evaluated separately in parallel. While this makes MPI hard to code, and sharing data between parallel processes expensive, it also makes it much easier to distribute across processors.

Next week we will present examples demonstrating how to send data between nodes and introduce serial sections of code.

Part (2/2): https://waterprogramming.wordpress.com/2017/07/28/developing-parallelised-code-with-mpi-for-dummies-in-c-part-22/

References (each of these are a useful link if you would like to learn more about parallel computing and HPC):

Parallel Programming Concepts and High-Performance Computing, a module in the Cornell Virtual Workshop

CAC Glossary of HPC Terms: https://cvw.cac.cornell.edu/main/glossary

Reed Group’s basic C++ code style conventions

It is always good practice for programmers to adopt some sort of style convention when developing new code. This helps keeping the code readable for both authors and collaborators, as well as for people that read your code on online repositories. Here I will set a precedent for a minimal C++ code style for Reed’s group encompassing C++ features we normally use based on the most common practices out there, so that we can more easily help each other with out codes and keep consistency when publishing them. This post may be updated if somebody sets precedents for C++ features didn’t account for (e.g. namespaces).

Naming conventions

  • Classes: Uppercase first letter. If the class name is comprised of more than one word, all words should be written together (no dashes, underscores, etc.) and the first letter of each word should be capitalized. E.g.: MyAwesomeClass.
  • Functions: Lowercase first letter. If the function name is comprised of more than one word, all words should be written together (no dashes, underscores, etc.) and the first letter of each word except the first should be capitalized. E.g.: myFantasticFunction. If you are creating a getter or a setter, be sure to follow the this standard. E.g. the getter for variable “thisVariable” would be “getThisVariable.”
  • Variables: Same as Functions. Acronyms should also follow this rule — e.g. a variable containing a short-term ROF (for risk-of-failure) value for a utility should be called something like “shortTermRof.”
  • Constants: All letters capitalized and words separated by underscores. E.g. MY_GREAT_CONSTANT.

Other naming rules

Besides naming conventions, there are other good practices when it comes to coming up with names in your code:

  • Do not assign one letter names, unless it is a temporary variable such as i, j, k used as indexes.
  • Assign informative names to your classes, functions, variables and constants. If you have a variable called “length,” another called “thisLength” and a third one called “realLength” your code will be really hard to follow.
  • Being concise is great (nobody reads code for its poetic variable names) but avoid shortening your names too much. Calling a variable “catchmentFallCreekIthaca” makes it much easier for someone else to know the information contained in that variable than calling it “catfacreith.”
  • We all get really frustrated with our codes at times, and want to curse it really bad. It’s fine to do it in your office when nobody is hearing, but be sure to not let that leak into your code and to keep some decency: e.g. avoid having in your code “this&%$*%DoesNot&$%*#@Work = true” or anything of the sort.

Other rules

  • Avoid magic numbers (hard coded numbers). Codes like the one below not only are hard to understand but also make the reader question if the results of the code are actually right:
    if (312 * evaporation + inflow / 52 - 7 * demand) {
        // Do something here
    }
    

    Now imagine if the value 212 is the value of an area and is used in 83 different parts of your code: that’s a problem. Instead, declaring those numbers as constants would be preferred:

    const double DRYVILLE_RESERVOIR_AREA = 312.0;
    const double NUMBER_OF_WEEKS_IN_YEAR = 52.14;
    const double NUMBER_OF_DAYS_IN_WEEK = 7.0;
    
    // Lots of code here, since constants are normally declared at the top of the code.
    
    if (DRYVILLE_RESERVOIR_AREA * evaporation + inflow / NUMBER_OF_WEEKS_IN_YEAR - NUMBER_OF_DAYS_IN_WEEK * demand) {
        // Do something here
    }
    
  • Keep your cpp files shorter than 500 lines. If you start approaching 500 lines, it may be the case that your class can be broken into parent and multiple children classes, or into two completely different classes.
  • Have only the main.cpp file in the root directory. All other files, if any, should be in directories so that the code is easy to navigate through.
  • If there is an issue or simplification to be fixed at some point in the future, use the “//FIXME:” comment to indicate it, as in the code below:
    //FIXME: replace constant area below by storage vs. area curve.
    if (DRYVILLE_RESERVOIR_AREA * evaporation + inflow / NUMBER_OF_WEEKS_IN_YEAR - NUMBER_OF_DAYS_IN_WEEK * demand) {
    // Do something here
    }
    

Note that different languages have different standards. If coding in Python or Matlab, for example, be sure to follow the best practices for these languages. Also, if developing code in collaboration with another research group, be sure to negotiate a convention.

Profiling C++ code with Callgrind

Often times, we have to write code to perform tasks whose complexity vary from mundane, such as retrieving and organizing data, to highly complex, such as simulations CFD simulations comprising the spine of a project. In either case, depending on the complexity of the task and amount of data to be processed, it may happen for the newborn code to leave us staring at an underscore marker blinking gracefully for hours on a command prompt during its execution until the results are ready, leading to project schedule delays and shortages of patience. Two standard and preferred approaches to the problem of time intensive codes are to simplify the algorithm and to make the code more efficient. In order to better select the parts of the code to work on, it is often useful to first find the parts of the code in which more time by profiling the code.

In this post, I will show how to use Callgrind, part of Valgrind, and KCachegrind to profile C/C++ codes on Linux — unfortunately, Valgrind is not available for Windows or Mac, although it can be ran on cluster from which results can be downloaded and visualized on Windows with QCachegrind. The first step is to install Valgrind and KCachegrind by typing the following commands in the terminal of a Debian based distribution, such as Ubuntu (equivalent yum commands area available for Red Hat based distributions):

$ sudo apt-get install valgrind
$ sudo apt-get install kcachegrind

Now that the required tools are installed, the next step is to compile your code with GCC/G++ (with a make file, cmake, IDE or by running the compiler directly from the terminal) and then run the following command in a terminal (type ctrl+shift+T to open the terminal):

$ valgrind --tool=callgrind path/to/your/compiled/program program_arguments

Callgrind will then run your program with some instrumentation added to its execution to measure time expenditures and cache use by each function in your code. Because of the instrumentation, Your code will take considerably longer to run under Callgrind than it typically would, so be sure to run a representative task that is as small as possible when profiling your code. During its execution, Callgrind will output a report similar to the one below on terminal itself:

==12345== Callgrind, a call-graph generating cache profiler
==12345== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==12345== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==12345== Command: path/to/your/compiled/program program_arguments
==12345==
==12345== For interactive control, run 'callgrind_control -h'.
IF YOUR CODE OUTPUTS TO THE TERMINAL, THE OUTPUT WILL BE SHOWN HERE.
==12345== 
==12345== Events    : Ir
==12345== Collected : 4171789731
==12345== 
==12345== I   refs:      4,171,789,731

The report above shows that it collected 4 billion events in order to generate the comprehensive report saved in the file callgrind.out.12345 — 12345 is here your process id, shown in the report above. Instead of submerging your soul into a sea of despair by trying to read the output file in a text editor, you should load the file into KCachegrind by typing:

$ kcachegrind calgrind.out.12345

You should now see a screen like the one below:

kcachegrind_initial.png

The screenshot above shows the profiling results for my code. The left panel shows the functions called by my code sorted by total time spent inside each function. Because functions call each other, callgrind shows two cost metrics as proxies for time spent in each function: Incl., showing the total cost of a function, and self, showing the time spent in each function itself discounting the callees. By clicking on “Self” to order to functions by the cost of the function itself, we sort the functions by the costs of their own codes, as shown below:

Untitled_sorted

Callgrind includes functions that are native to C/C++ in its analysis. If one of them appears in the highest positions of the left panel, it may be the case to try to use a different function or data structure that performs a similar task in a more efficient way. Most of the time, however, our functions are the ones in most of the top positions in the list. In the example above, we can see that a possible first step I can take to improve the time performance of my code is to make function “ContinuityModelROF::shiftStorage” more efficient. A few weeks ago, however, the function “ContinuityModel::continuityStep” was ranked first with over 30% of the cost, followed by a C++ map related function. I then replaced a map inside that function by a pointer vector, resulting in the drop of my function’s cost to less than 5% of the total cost of the code.

In case KCachegrind shows that a given function that is called from multiple places in the code is costly, you may want to know which function is the main culprit behind the costly calls. To do this, click on the function of interest (in this case, “_memcpy_sse2_unalight”) in the left panel, and then click on “Callers” in the right upper panel and on “Call Graph” in the lower right panel. This will show in list and graph forms the calls made to the function by other functions, and the asociated percent costs. Unfortunately, I have only the function “ContinuityModelROF::calculateROF” calling “_memcpy_sse2_unalight,” hence the simple graph, but the graph would be more complex if multiple functions made calls to “_memcpy_sse2_unalight.”

I hope this saves you at least the time spend reading this post!