Here is the result of some discussions I’ve had lately on a new project of mine. I tried to make the insights as generic as possible, so you can apply this to your own work!
Why will you need to parallelize your search?
Consider an engineering simulation that takes about 2.5 minutes on a brand new personal computer. You’ve done some timing on your code, and you realize about 80% of the time is in the model’s actual calculation. Sure, there may be some added benefit that you could gain by making the input and output to the model more efficient, etc. But you’re not going to get the time down that much (you could argue the lowest amount of time the simulation would take is 2 minutes, removing all i/o overhead). Let’s assume a 2.5 minute evaluation time and run some numbers:
In 60 minutes, we can do 24 evaluations. Therefore, in 24 hours we can do 576 evaluations.
In some prior work, we saw that you could get a very good answer to the problem in about 80,000 function evaluations. This new problem may actually be harder than the old one! So I am guessing we would like to do at least 80,000 function evaluations per trial here. You can see that it would take much longer than a day to do a search run (in fact, it would take 138 days, which is longer than we have to run a single job on the computer).
Let’s say we were able to cut down the computational time by 50% (which would be a huge undertaking), then we would only gain 2x speedup of time, and it would take 69 days.
Not to mention the fact that really long jobs that have a duration in the order of days, take a long time to queue on the system.
As a final final thought about this, we could concievably cut down the number of function evaluations. But my guess is we would have to (at the very least) run about 10,000 function evaluations, which would take 17 days. And, it would probably require a lot more justification than we’re prepared to give. Right now, our argument is that we look at the Pareto approximate set and cut the search off when it stops improving. If we cut down to 10,000 function evaluations, that argument would no longer be valid.
Running the numbers for parallelized search
Borg has been shown to to have scalable performance for 10’s of 1000’s of cores, in a forthcoming paper by Dave Hadka.
Scalable means two things. First, just thinking about scalability in parallel computing, perfect scalability means that with N processors, you can do the work N times faster. What takes 10 hours on a single pc, would take 1 hour on 10 nodes. Computer scientists rarely get perfect scalability, mainly because of the overhead of passing messages back and forth or coordinating processes (more on that later). In parallel MOEA work, scalability must also include the quality of the solutions you are getting out of the search. It is very possible MOEA search will fail, so if the search does poorly, having the former definition of scalability doesn’t make much sense. So Dave et al. have developed ways to measure scalability that also take into account solution quality.
But, to make a long story short, let’s assume perfect scalability and run the numbers again.
We can do 24 evaluations in one hour. But now, scaling to 512 processors, we can do 24*512 = 12,288 evaluations in one hour, and the calculations of run time become similar to our last project. We will get a halfway decent answer in 1 hour, and we can run the same amount of computation in 8 hours or so. In fact, using the max queue time of 24 h, we could get 294,912 evaluations done in one day.
Parallel Computing 101
Shared Memory: You are likely reading this on a shared memory machine as we speak. In shared memory, there are N processors, and each of them shares access to the same memory. There are some programming paradigms such as a system called ‘pthreads’ which are used for these systems. The challenge of shared memory programming is preventing so-called “race” conditions. That is, Processor 1 tries to edit a variable, and Processor 2 is working on the same variable.
Although there are multi-core chips on the processor, they are not used as such, and we are dealing with…
Distributed Memory: Each of the N processors in a distributed memory system has its own memory. Processor 1 never “sees” what Processor 2 is doing unless Processor 2 shares what she is doing with Processor 1. The paradigm of programming here is called Message Passing Interface (MPI). A processor only receives information from another processor if a ‘message’ is received from somewhere. The challenge is that coordinating the messages can be tricky, and all processors work at different rates, so you could have a situation where Processor 1 is waiting for slowpoke Processor 8 to do his job, and he never sends the message, and the program locks. This happens when all processors are homogeneous, but you can imagine it’s even worse when you have heterogeneous processors (i.e., one is faster than the other).
An additional challenge of MPI programming is that all processors see the same filesystem. Two processors can’t edit the same file at once, without massive, horrible errors. Keep this in mind in the next section.
Some pseudocode of the algorithm process
Parallel MOEAs can work in a number of different configurations, but the important one for us is called Master-Worker. Basically, there is one and only one Master or Coordinator, and then a whole bunch of workers. The Master’s job is to do the search calculations. The Workers’ job is to evaluate the simulation model — take decision variable values and calculate objectives. The Workers have no other job, and the workers cannot “see” what the others do.
Code to implement a Parallel MOEA is described in the following sections.
The main program calls several functions. First, you need code that initializes your own problem. For example, read in the input data that doesn’t change over the run (a function called something like, myProblemInit). The algorithm is set up, for example, to set the random seed, and define the algorithm parameters.
Finally, the algorithm is run, with a pointer to a function that contains your problem (see myProblem below)
myProblem(obj, vars, consts)
Ok, so now you need to write your own version of the “problem” function. There may be only one task to do, or multiple. For example, if you need to perform some kind of transformation on the decision variables you can do that here. But otherwise, all you’re doing is actually calling the real simulation model. Perhaps this simulation code comes from somewhere else… it’s for a version of the model that has never been attached to an optimization framework before. It looks like:
callSim(obj, vars, consts)
Here, you may very well have an executable call to a completely external program that runs your simulation. Maybe that external program even has a separate wrapper attached to it, in Python or MATLAB. Steps in such a function will probably look like this:
1. Note that parameters from initialization are already in memory. A lot of parameters don’t change from evaluation to evaluation.
2. Write input files, such as a file called inputData.dat
3. Call an external program that does the calculations
4. Interpret the output file, which is called something like outputData
5. Assign the objective values to the variable, obj which is passed by pointer into this function.
Hey, wait, there’s a problem with steps 2 and 4 above!
Ahh, good. You realized it. If we recall from our discussion of MPI on clusters… all processors share the same filesystem! So each worker is going to be creating a file called basicReaction.well, and all of them are going to be editing the same file at the same time, which cannot work.
The solution is to create a separate input and output file for each processor to ensure that this filesystem bug doesn’t happen to us. MPI allows this because of a built-in function called MPI_Comm_Rank which tells each processor who he is in the world (existential, isn’t it?)
You can use the processor’s rank directly in the myProblem function above, and that way the processor will always use his own unique file and not confuse data from the other processors.
I hope this has been helpful! Increased computing power with technologies like clusters and cloud computing are opening up more and more types of problems for many objective analysis. With a little bit of skill in writing system calls and wrappers, you can use tools like Borg to do all sorts of fun things. Just be careful to understand the unique properties of your parallel system when creating the code for your project. Feel free to add comments and questions in the comments below!