My last post introduced MPI and demonstrated a simple example for using it to parallelize a code across multiple nodes. In the previous example, we created an executable that could be run in parallel to complete the same task multiple times. But what if we want use MPI to on a code that has both parallel and serial sections, this is inevitable if we want everything to be self-contained.
As I tried to stress last time, MPI runs multiple versions of the same executable each with independent memory (please read this sentence three times, it is very different from how you learned to code). If you wish to share memory, you must explicitly send it. This allows no scope for a serial section!
We must, instead, imitate serial sections of code by designating a ‘root’ processor, conventionally the processor with rank = 0. We trap the ‘serial section’ inside an if-statement designating the root and send data to it from other processors when required.
Sending Data
I will build on the previous two examples by creating a loop that calculates the mean of a set of random numbers, we will parallelize the random number generation but leave the mean calculation in ‘serial’ (i.e. to be calculated by the root processor).
#include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(){ int size, rank,i; MPI_Init(NULL,NULL); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double randSum = 0; srand(rank + 1); double myRand = (double)rand()/(double)RAND_MAX; printf("I evaluated rank = %d, myRand = %f\n",rank,myRand); if (rank == 0){ for (i=0;i<size;i++){ if (i > 0){ MPI_Recv(&myRand, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } randSum = randSum + myRand; } printf("Mean random number = %f\n",randSum/size); } else{ MPI_Send(&myRand, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }
For contrast with a regular serial version:
#include <stdio.h> #include <stdlib.h> int main(){ int rank,size = 10; MPI_Init(NULL,NULL); double randSum = 0; srand(rank + 1); double myRand = (double)rand()/(double)RAND_MAX; printf("I evaluated rank = %d, myRand = %f\n",rank,myRand); if (rank == 0){ for (rank = 0; rank < size; ++rank){ srand(rank + 1); randSum = randSum + (double)rand()/(double)RAND_MAX; printf("I evaluated rank = %d, myRand = %f\n",rank,myRand);} printf("Mean random number = %f\n",randSum/size); } } return 0; }
We introduce here two new MPI functions:
MPI_Send(data address, size of data, MPI type of data, processor destination (by rank), tag, communicator) sends the random number to the root (rank 0).
MPI_Recv(data address, size of data, MPI type of data, processor source (by rank), tag, communicator, status suppression) tells a processor, in our case the root, to receive data from a processor source.
Both MPI_Send and MPI_Recv prevent code from progressing further until the send-> receive is resolved. i.e. when rank = 5 reaches send, it will wait until rank = 0 has received data from ranks 1:4 before sending the data and progressing further.
Broadcasting data
Sending data between processors in MPI is moderately expensive, so we want to call send/recv as few times as possible. This means that vectors should be sent in one, rather than in a loop. It also means that when sending data from one processor to all (most commonly from the root), it is more efficient to use the built in ‘broadcast’ rather than sending to each processor individually (the reason for this is explained in: http://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/).
Below we will introduce an example where the root broadcasts how many random numbers each processor should create, these vectors of random numbers are then sent back to the root for mean calculation.
#include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(){ int size, rank,i,j; MPI_Init(NULL,NULL); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); double randSum = 0; int numRands; srand(rank+1); if (rank == 0){ numRands = 5; MPI_Bcast(&numRands,1,MPI_INT,0,MPI_COMM_WORLD); } else{ MPI_Bcast(&numRands,1,MPI_INT,0,MPI_COMM_WORLD); } double *myRand = calloc(numRands,sizeof(double)); for (i =0;i<numRands;++i){ myRand[i] = (double)rand()/(double)RAND_MAX; } if (rank == 0){ for (i=0;i<size;i++){ printf("root received from rank %d the vector: ",i); if (i > 0){ MPI_Recv(myRand, numRands, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } for (j=0;j<numRands;j++){ printf("%f ",myRand[j]); randSum = randSum + myRand[j]; } printf("\n"); } printf("Mean random number = %f\n",randSum/(size*numRands)); } else{ MPI_Send(myRand, numRands, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); } free(myRand); MPI_Finalize(); return 0; }
We have new used the new MPI function:
MPI_Bcast(data address, size of data, MPI data type, processor source, communicator) broadcasts from the processor source (in our case the root) to all other processors, readers should note the common mistake of using MPI_Recv instead of MPI_Bcast to receive the data; MPI_Bcast is the function to both send and receive data.
Another simple but common mistake that readers should note is the passing of dynamically sized data; note how myRand is sent without the & address operator (because the variable itself is an address) while numRands is sent with the & operator.
Concluding remarks
This tutorial should set you up to use much of the MPI functionality you need to parallelise your code. Some natural questions that may have arisen while reading this tutorial that we did not cover:
MPI_Barrier – while MPI_Send/Recv/Bcast require processors to ‘catch up’, if you are writing and reading data to files (particularly if a processor must read data written by another processor) then you need to force the processors to catch up; MPI_Barrier achieves this.
tags – you can contain metadata that can be described by integers (e.g. vector length or MPI data type) in the ‘tag’ option for MPI_Send/Recv.
MPI_Status – this structure can contain details about the data received (rank, tag and length of the message), although much of the time this will be known in advance. Since receiving the status can be expensive, MPI_STATUS_IGNORE is used to supress the status structure.
All of the MPI functions described in this tutorial are only a subset of those available that I have found useful in parallelizing my current applications. An exhaustive list can be found at: http://www.mpich.org/static/docs/latest/. If you want to go beyond the functions described in this post (or you require further detail) I would recommend: http://mpitutorial.com/tutorials/.