Debugging MPI By Dave Hadka

Dave wrote the following instructions on how to debug MPI in an email recently, and I thought I’d post it here as a private post on the blog.

In case this isn’t already known, here’s instructions I came up with for running gdb and valgrind on MPI programs:

Debugging MPI with GDB

1) Run an interactive PBS job:

qsub -I -l walltime=16:00:00 -l nodes=1:ppn=4

The interactive job will start you in your home folder. CD to your working directory.

2) Load the OpenMPI module with GNU GCC support:

module load openmpi/gnu

3) Compile your code with the -ggdb flag to include GDB debugging info in the executable.

4) Create the GDB script, gdbscript.txt, to run when GDB is launched.
This is needed since the program will not start running until the
GDB ‘run’ command is called, and we need to automatically run all
jobs on remote nodes. This will also enable logging to gdb.txt.

set logging on

5) Run the MPI program with GDB:

mpirun gdb -x gdbscript.txt ./mpiprog.exe

6) When the program exits or an error is detected, you will be left in
GDB. You can now use any GDB commands, or quit by typing ‘quit’.

Memory Checking MPI Programs

First, follow steps 1-3 above.

4) When the interactive PBS job starts, run the MPI program with Valgrind:

mpirun valgrind –tool=memcheck –log-file=valgrind_%p.txt ./mpiprog.exe

5) Look at the valgrind_NNNN.txt files that were created, one for each process,
to determine if any memory leaks occurred. Valgrind often detects
uninitialized values in the Open MPI code, which should be ignored.

Using gdb, and notes from the book “Beginning Linux Programming”

I just started thumbing through Beginning Linux Programming by Matthew and Stones.  It covers in great detail a lot of the issues we talk about on this blog often — how to debug code, how to program using the BASH shell, file input/output, different development environments, and making makefiles.

Using gdb

One tool I haven’t talked about much is the debugger, gdb.  It is a text-based debugging tool, but it gives you some powerful ways to step through a program.  You can set a breakpoint, and then make rules for what variables are being displayed in each iteration.

Let’s say you have a file, myCode.c that you compile into an executable, myCode.  Compile using the -g flag, and then start gdb on your code by entering “gdb ./myCode”.  If your code has command line arguments, you need to specify an argument to gdb like this:

gdb –args ./myCode -a myArgument1 -b myArgument2

The important phrase here is “–args”, two dashes and the word args, that appears after gdb.  That lets gdb know that your ./myCode program itself has arguments.

You can also set a breakpoint inside gdb (you’d need to do that before you actually run the code).  To do this, say at line 10, simply type “break 10”.  This will be breakpoint 1.  To create rules to display data at each breakpoint type “display”.  It will ask what commands you’d like… for example, to display 5 values of an array, the command is “display array[0]@5”, then “cont” to continue, and “end” to end.

After setting up your breakpoints, simply type “run” to run the code.

If your program has a segmentation fault, it will let you know what line the segmentation fault occurred at, and using “backtrace” you can see what functions called that line.

If you have a segfault and the program is halted, the nice thing is that all the memory is still valid and you can see the value of certain variables.  To see the value of variables say “print myVariableName”.  It is quite informative.  For example, if a variable has a “NAN” next to it, you know there may be something wrong with that variable, that could cause an error somewhere else.

Here’s one example of a possible problem in pseudocode:

levelA = 0;

levelB = 0;

myLevel = 0.5;

myFrac = myLevel / (levelA + levelB);

The fourth line there looks innocuous enough, but this will cause a “divide by zero” error given the levelA and levelB value.  In gdb, you may get a segfault on the fourth line, but a simple “print levelA” and “print levelB” will help you solve the problem.

Here’s a short link that explains the basics of gdb with more detail.

Other notes

Also interesting are several C preprocessor macros that can tell you what line, file, date, and time the code was compiled at.   Predictably, these are __LINE__ __FILE__ __DATE__ and __TIME__ (that’s two underscores for each).

I also like the bash scripting examples that are contained in the book.  They taught me about some Linux utilities like “cut” that are very helpful, and covered elsewhere on this blog.

Any additional tips and tricks are welcome in the comments!