Simulation-optimization is an increasingly popular tool in many engineering fields. As the name implies, simulation-optimization requires two major components: a simulation model of the system we want to optimize (e.g., a water supply system) and an algorithm for iteratively tuning the decision levers within the system to optimize the simulated performance. For example, Multi-Objective Evolutionary Algorithms (MOEAs), such as the Borg MOEA, have been successfully employed across a variety of challenging stochastic multi-objective simulation-optimization problems in water resources and other fields. However, simulation-optimization can require millions of iterative simulation trials in order to adequately explore the decision and uncertainty spaces, especially under multi-objective framings. Given the ever-increasing size and complexity of simulation models, this can become a significant computational challenge.
Running the MOEA in parallel on high-performance computing clusters can greatly alleviate this challenge, but not eliminate it. For example, in my last blog post on hybrid parallelization of the Borg MOEA, I gave the example of optimizing infrastructure investments using a simulation model that takes 10 minutes to run a 30-year realization. Let’s assume we run 48 Monte Carlo realizations per function evaluation (FE) to account for natural hydrologic variability. Even using the maximum job size of 128 nodes on the Stampede2 supercomputer, with 48 CPU each, we will only be about to complete ~20,000 FEs per 24 hours of wall clock. To complete 100,000 FE, a relatively low target compared to other similarly-complex studies, the MOEA will need to run for ~5 days. This raises a major problem: many computing clusters have strict wall clock limits. On Stampede2, for example, each submission can run for a maximum of 48 hours. As a result, we would be limited to ~40,000 FE for a single Stampede2 run, which may be insufficient based on the difficulty of the problem.
Fortunately, the Borg MOEA’s “checkpointing” and “restoring” functionalities allow us to circumvent this issue. Checkpointing is the process of saving “snapshots” of the MOEA’s internal state at regular intervals, and restoring is the act of reading a snapshot into memory at the beginning of a run and continuing the optimization from that starting point. This allows us to effectively split a 5-day optimization into three sequential submissions, in order to comply with a 2-day wall clock limit. This functionality can also be helpful for iteratively assessing convergence, since we don’t necessarily know a priori how many FE will need to be run.
In this blog post, I will demonstrate the use of checkpointing and restoring with the Borg MOEA using a less computationally-demanding example: the Python-based DPS Lake Problem. However, the methods and results described here should be applicable to much larger experiments such as the one described above. See my last blog post and references therein for more details on the Python implementation of the DPS Lake Problem.
For this post, I ran the following short proof of concept: first I ran five random seeds of the Borg MOEA for 5 minutes each. Each seed was assigned 2 nodes (16 CPU each) on the Cube cluster at Cornell. Each seed completed approximately 6100 FE in this first round. Next, I ran a second round in which each seed began from the 6100-FE checkpoint and ran for another 5 minutes. All code for this post, along with instructions for downloading the correct branch of the Borg MOEA, can be found in this GitHub repository.
Setting up checkpointing only takes a few extra lines of code in the file “borg_lake_msmp.py”. First, we need to define the baseline file name used to store the checkpoint files:
newCheckpointFileBase = 'results/checkpoints/seed' + str(seed)
Within the simulation, checkpoints will be made at regular intervals. The checkpoint file names will automatically have the FE appended; for example, the checkpoint for seed 5 after 6100 FE is written to “results/checkpoints/seed5_nfe6100.checkpoint”. The write frequency for both checkpoints and runtime metrics can be changed based on the “runtimeFrequency” parameter in “submit_msmp.sh”.
We can also provide a previous checkpoint to read in at the beginning of the run:
if maxEvalsPrevious > 0: oldCheckpointFile = 'results/checkpoints/seed' + str(seed) + '_nfe' + str(maxEvalsPrevious) + '.checkpoint'
where “maxEvalsPrevious” is the number of FE for the prior checkpoint that we want to read in. This parameter is input in “submit_msmp.sh”. The previous checkpoint file must be placed in the “checkpoints” folder prior to runtime.
We provide both checkpoint file names to the Borg MOEA, along with other important parameters such as the maximum number of evaluations.
result = borg.solveMPI(maxEvaluations=maxEvals, runtime=runtimeFile, frequency=runtimeFreq, newCheckpointFileBase=newCheckpointFileBase, oldCheckpointFile=oldCheckpointFile, evaluationFile=evaluationFile)
The Python wrapper for the Borg MOEA will use this command to initialize and run the MOEA across all available nodes. For more details on the implementation of the experiment, see the GitHub repository above.
The checkpoint file
So what information is saved in the checkpoint file? This file includes all internal state variables needed to initialize a new instance of the Borg MOEA that begins where the previous run left off. As an example, I will highlight select information from “seed5_nfe6100.checkpoint”, the last checkpoint file (at 6100 FE) for seed 5 after the first round.
First, the file lists basic problem formulation information such as the number of objectives, the decision variable bounds, and the epsilon dominance parameters:
Version: 108000 Number of Variables: 6 Number of Objectives: 4 Number of Constraints: 1 Lower Bounds: -2 0 0 -2 0 0 Upper Bounds: 2 2 1 2 2 1 Epsilons: 0.01 0.01 0.0001 0.0001 Number of Operators: 6
Next, it lists current values for several parameters which evolve over the course of the optimization, such as the selection probabilities for the six recombination operators, the number of times the algorithm has triggered a “restart” due to stalled progress, and the number of solutions in the population:
Operator Selection Probabilities: 0.93103448275862066 0.0086206896551724137 0.017241379310344827 0.025862068965517241 0.0086206896551724137 0.0086206896551724137 Number of Evaluations: 6100 Tournament Size: 3 Window Size: 200 ... Number of Restarts: 0 ... Population Size: 196 Population Capacity: 196
The file then lists each of the 196 solutions currently stored in the population:
Population: 0.52276605276517329 1.1277550312248441 0.57174902357263913 -0.053873968914707165 1.5491763609435476 0.68045375364276195 -0.2810269062318374 0.18479149938317199 -1 -1 0 0 ... 0.27892075855585108 0.45130064043648516 0.77550656566943632 0.31799415686665755 1.0881848609123497 0.60273027430733062 -0.25193459430454329 0.16022123986158276 -0.98989898989898994 -1 0 0
Each line in this section represents a different solution; the numbers represent the 6 decision variables, 4 objectives, and 1 constraints for this problem, as well as the index of the operator used to create each solution.
Similarly, the file lists the current archive, which is the subset of 64 solutions from the larger population that are non-dominated:
Archive Size: 64 Archive: 0.65953806401594717 1.3878689423124047 0.5978484500419472 0.15387554544946802 1.1931240566680981 0.62731909408989983 -0.32254774259834024 0.21682247784367689 -1 -1 0 0 ... 0.28214763078942073 0.44992540083980864 0.77495402511037126 0.1548097106329358 1.0881858398758053 0.6850068814635788 -0.2456560390454271 0.15760082516545762 -0.98989898989898994 -1 0 0
The checkpointing file then lists several parameters associated with the most recently used recombination operators:
Number of Improvements: 194 Recency List Size: 50 Recency List Position: 47 Recency List: 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
We see that after 6100 FE, the algorithm is almost exclusively using the operator with index 0; this is consistent with operator 0’s 93% selection probability listed earlier in the file.
Finally, the checkpoint file closes with the current state of the random number generator:
RNG State Length: 624 RNG State: 2103772999 2815404811 3893540187 ... RNG Index: 200 Next Gaussian: -0.13978937466755278 Have Next Gaussian: 0
So does it work? Figure 1 shows the convergence of the five seeds according to six different measures of quality for the evolving archive. Performance for all seeds is found to be continuous across the two rounds, suggesting that the checkpointing and restoring is working properly. Additionally, Figure 2 demonstrates the continuity of the auto-adaptive selection probabilities for the Borg MOEA’s different recombination operators.
The checkpointing process thus allows for continuity of learning across rounds: not only learning about which solutions are most successful (i.e., continuity of the population and archive), but also learning about which strategies for generating new candidate solutions are most successful (i.e., continuity of the recombination operator histories and probabilities). This capability allows for seamlessly sequencing separate “blocks” of function evaluations in situations where wall clock limits are limiting, or where the convergence behavior of the problem is not well understood a priori.