June 30, 2021 by Antonia Hadjimichael

How to schedule massively parallel jobs on clusters – some basic ways

Massively (or embarrassingly) parallel are processes that are either completely separate or can easily be made to be. This can be cases where tasks don’t need to pass information from one to another (they don’t share memory) and can be executed independently of another on whatever resources are available, for example, large Monte Carlo runs, each representing different sets of model parameters.

There isn’t any guidance on how to do this on the blog, besides an older post on how to do it using PBS, but most of our current resources use SLURM. So I am going to show two ways: a) using SLURM job arrays; and b) using the GNU parallel module. Both methods allow for tasks to be distributed across multiple cores and across multiple nodes. In terms of how it affects your workflow, the main difference between the two is that GNU parallel allows you to automatically resume/rerun a task that has failed, whereas using SLURM job arrays you have to resubmit the failed tasks manually.

Your first step using either method is to configure the function representing each task to be able to receive as arguments a task id. For example, if I would like to run my model over 100 parameter combinations, I would have to create my model function as function_that_executes_model(sample=i, [other_arguments]), where the sample number i would correspond to one of my parameter combinations and the respective task to be submitted.

For python, this function needs to be contained within a .py script which will be executing this function when called. Your .py script could look like this, using argparse to parse the function arguments but there are alternatives:

import argparse
import ...

other_arg1= 1
other_arg2= 'model'

def function_that_executes_model(sample=i, other_arg1, other_arg2):
    #do stuff pertaining to sample i
    return

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='This function executes the model with a sample number')
    parser.add_argument('i', type=int,
                        help='sample number')
    args = parser.parse_args()
    function_that_executes_model(args.i)

To submit this script (say you saved it as function_executor.py) using SLURM job arrays:

#!/bin/bash
#SBATCH --partition=compute   # change to your own cluster partition
#SBATCH --cpus-per-task       # number of cores for each task
#SBATCH -t 0:45:00            # max wallclock time
#SBATCH --array=1-100         # array of tasks to execute

module load python
srun python3 function_executor.py $SLURM_ARRAY_TASK_ID

This will submit 100 1-core jobs to the cluster’s scheduler, and in your queue they will be listed as JOB_ID-TASK_ID.

Alternatively, you can use your cluster’s GNU parallel module to submit this like so:

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --ntasks=100
#SBATCH --time=00:45:00

module load parallel
module load python

# This specifies the options used to run srun. The "-N1 -n1" options are
# used to allocates a single core to each task.
srun="srun --export=all --exclusive -N1 -n1"

# This specifies the options used to run GNU parallel:
#
#   -j is the number of tasks run simultaneously.

parallel="parallel -j $SLURM_NTASKS"

$parallel "$srun python3 function_executor.py" ::: {1..100}

This will instead submit 1 100-core job where each core executes one task. GNU parallel also allows for several additional options that I find useful, like the use of a log to track task execution (--joblog runtask.log) and --resume which will identify the last unfinished task and resume from there the next time you submit this script.

June 21, 2021 by Lillian Lau

The ABCs of MOEAs

We have recently begun introducing multi-objective evolutionary algorithms (MOEAs) to a few new additions to our research group, and it reminded me of when I began learning the relevant terminology myself barely a year ago. I recalled using Antonia’s glossary of commonly-used terms as I was getting acquainted with the group’s research in general, and figured that it might be helpful to do something similar for MOEAs in particular.

This glossary provides definitions and examples in plain language for terms commonly used to explain and describe MOEAs, and is intended for those who are just being introduced to this optimization tool. It also has a specific focus on the Borg MOEA, which is a class of algorithms used in our group. It is by no means exhaustive, and since the definitions are in plain language, please do leave a comment if I have left anything out or if the definitions and examples could be better written.

Greek symbols

ε-box

Divides up the objective space into n-dimensional boxes with side length ε. Used as a “filter” to prevent too many solutions from being “kept” by the archive. The smaller the size of the ε-box, the finer the “mesh” of the filter, and the more solutions are kept by the archive. Manipulating the value of ε affects convergence and diversity.

Each ε-box can only hold one solution at a time. If two solutions are found that reside in the same ε-box, the solution closest to the lower left corner of the box will be kept, while the other will be eliminated.

ε-dominance

A derivative of Pareto dominance. A solution x is said to ε-dominate solution y if it lies in the lower left corner of an ε-box for at least one objective, and is not ε-dominated by solution y for all other objectives.

ε-progress

ε-progress occurs when the current solution x lies in an different ε-box that dominates the previous solution. Enforces a minimum threshold ( ε ) over which an MOEA’s solution must exceed to avoid search stagnation.

ε-value

The “resolution” of the problem. Can also be interpreted a measure of the degree of error tolerance of the decision-maker. The ε-values can be set according to the discretion of the decision-maker.

A

A posteriori

Derived from Latin for “from the latter”. Typically used in multi-objective optimization to indicate that the search for solutions precedes the decision-making process. Exploration of the trade-offs resulting from different potential solutions generated by the MOEA is used to identify preferred solutions. Used when uncertainties and preferences are not known beforehand.

A priori

Derived from Latin for “from the former”. Typically used in multi-objective optimization to indicate that a set of potential solutions have already been decided beforehand, and that the function of a search is to identify the best solution(s). Used when uncertainties and preferences are known (well-characterized).

Additive ε-indicator

The distance that the known Pareto front must “move” to dominate the true Pareto front. In other words, the gap between the current set of solutions and the true (optimal) solutions. A performance measure of MOEAs that captures convergence. Further explanation can be found here.

Archive

A “secondary population” that stores the non-dominated solutions. Borg utilizes ε-values to bound the size of the archive (an ε-box dominance archive) . That is, solutions that are ε-dominated are eliminated. This helps to avoid deterioration.

C

Conditional dependencies

Decision variables are conditionally dependent on each other if the value of one decision variable affects one or more if its counterparts.

Control maps

Figures that show the hypervolume achieved in relation to the number of objective function evaluations (NFEs) against the population size for a particular problem. Demonstrates the ability of an MOEA to achieve convergence and maintain diversity for a given NFE and population size. An ideal MOEA will be able to achieve a high hypervolume for any given NFE and population size.

Controllability

An MOEA with a high degree of controllability is one that results in fast convergence rates, high levels of diversity, and a large hypervolume regardless of the parameterization of the MOEA itself. That is, a controllable MOEA is insensitive to its parameters.

Convergence

Two definitions:

An MOEA is said to have “converged” at a solution when the subsequent solutions are no better than the previous set of solutions. That is, you have arrived at the best set of solutions that can possibly be attained.
The known Pareto front of the MOEA is approximately the same as the true Pareto front. This definition requires that the true Pareto front be known.

D

Decision variables

Variables that can be adjusted and set by the decision-maker.

Deterioration

Occurs when elements of the current solution are dominated by a previous set of solutions within the archive. This indicates that the MOEA’s ability to find new solutions is diminishing.

Diversity

The “spread” of solutions throughout the objective space. An MOEA is said to be able to maintain diversity if it is able to find many solutions that are evenly spread throughout the objective space.

Dominance resistance

A phenomenon in which an MOEA struggles to produce offspring that dominate non-dominated members of the population. That is, the current set of solutions are no better than the worst-performing solutions of the previous set. An indicator of stagnation.

E

Elitist selection

Refers to the retention of a select number of ‘best’ solutions in the previous population, and filling in the slots of the current generation with a random selection of solutions from the archive. For example, the Borg MOEA utilizes elitist selection during the randomized restarts when the best k-solutions from the previous generation are maintained in the population.

Epistasis

Describes the interactions between the different operators used in Borg MOEA. Refers to the phenomenon in which the heavier applications of one operator suppresses the use of other operators, but does not entirely eliminate the use of the lesser-used operators. Helps with finding new solutions. Encourages diversity and prevents pre-convergence.

G

Generation

A set of solutions generated from one iteration of the MOEA. Consists of both dominated and non-dominated solutions.

Generational

Generational MOEAs apply the selection, crossover and mutation operators all at once to an entire generation of the population. The result is a complete replacement of the entire generation at the next time-step.

Generational distance

The average distance between the known Pareto front and the true Pareto front. The easiest performance metric to meet, and captures convergence of the solutions. Further explanation can be found here.

Genetic algorithm

An algorithm that uses the principles of evolution – selection, mutation and crossover – to search for solutions to a problem given a starting point, or “seed”.

H

Hypervolume

The n-dimensional “volume” covered by the known Pareto front with respect to the total n-dimensional volume of all the objectives of a problem, bounded by a reference point. Captures both convergence and diversity. One of the performance measures of an MOEA. Further explanation can be found here.

I

Injection

The act of “refilling” the population with members of the archive after a restart. Injection can also include filling the remaining slots in the current population with new, randomly-generated solutions or mutated solutions. This helps to maintain diversity and prevent pre-convergence.

L

Latin hypercube sampling (LHS)

A statistical method of sampling random numbers in a way that reflects the true underlying probability distribution of the data. Useful for high-dimensional problems such as those faced in many-objective optimization. More information on this topic can be found here.

M

Many-objective problem

An optimization problem that involves more than three objectives.

Mutation

One of the three operators used in MOEAs. Mutation occurs when a solution from the previous population is slightly perturbed before being injected into the next generation’s population. Helps with maintaining diversity of solutions.

Multi-objective

An optimization problem that traditionally involves two to three objectives.

N

NFE

Number of function evaluations. The maximum number of times an MOEA is applied to and used to update a multi (or many)-objective problem.

O

Objective space

The n-dimensional space defined by the number, n, of objectives as stated by the decision-maker. Can be thought of as the number of axes on an n-dimensional graph.

Offspring

The result of selection, mutation, or crossover in the current generation. The new solutions that, if non-dominated, will be used to replace existing members in the current generation’s population.

Operator

Genetic algorithms typically use the following operators – selection, crossover, and mutation operators. These operators introduce variation in the current generation to produce new, evolved offspring. These operators are what enable MOEAs to search for solutions using random initial solutions with little to no information.

P

Parameters

Initial conditions for a given MOEA. Examples of parameters include population-to-archive ratio, initial population size, and selection ratio.

Parameterization

An MOEA with a high degree of parameterization implies that it requires highly-specific parameter values to generate highly-diverse solutions at a fast convergence rate.

Parents

Members of the current generation’s population that will undergo selection, mutation, and/or crossover to generate offspring.

Pareto-dominance

A solution x is said to Pareto-dominate another solution y if x performs better than y in at least one objective, and performs at least as well as y in all other objectives.

Pareto-nondominance

Both solutions x and y are said to be non-dominating if neither Pareto-dominates the other. That is, there is at least one objective in which solution x that is dominated by solution y and vice versa.

Pareto front

A set of solutions (the Pareto-optimal set) that are non-dominant to each other, but dominate other solutions in the objective space. Also known as the tradeoff surface.

Pareto-optimality

A set of solutions is said to have achieved Pareto-optimality when all the solutions within the same set non-dominate each other, but are dominant to other solutions within the same objective space. Not to be confused with the final, “optimal” set of solutions.

Population

A current set of solutions generated by one evaluation of the problem by an MOEA. Populated by both inferior and Pareto-optimal solutions; can be static or adaptive. The Borg MOEA utilizes adaptive population sizing, of which the size of the population is adjusted to remain proportional to the size of the archive. This prevents search stagnation and the potential elimination of useful solutions.

Pre-convergence

The phenomenon in which an MOEA mistakenly converges to a local optima and stagnates. This may lead the decision-maker to falsely conclude that the “best” solution set has been found.

R

Recombination

One of the ways that a mutation operator acts upon a given solution. Can be thought of as ‘shuffling’ the current solution to produce a new solution.

Rotation

Applying a transformation to change the orientation of the matrix (or vector) of decision variables. Taking the transpose of a vector can be thought of as a form of rotation.

Rotationally invariant

MOEAs that utilize rotationally invariant operators are able to generate solutions for problems and do not require that the problem’s decision variables be independent.

S

Search stagnation

Search stagnation is said to have occurred if the set of current solutions do not ε-dominate the previous set of solutions. Detected by the ε-progress indicator (ref).

Selection

One of the three operators used in MOEAs. The selection operator chooses the ‘best’ solutions from the current generation of the population to be maintained and used in the next generation. Helps with convergence to a set of optimal solutions.

Selection pressure

A measure of how ‘competitive’ the current population is. The larger the population and the larger the tournament size, the higher the selection pressure.

Steady-state

A steady-state MOEA applies its operators to single members of its population at a time. That is, at each step, a single individual (solution) is selected as a parent to be mutated/crossed-over to generate an offspring. Each generation is changed one solution at each time-step.

T

Time continuation

A method in which the population is periodically ’emptied’ and repopulated with the best solutions retained in the archive. For example, Borg employs time continuation during its randomized restarts when it generates a new population with the best solutions stored in the archive and fills the remaining slots with randomly-generated or mutated solutions.

Tournament size

The number of solutions to be ‘pitted against each other’ for crossover or mutation. The higher the tournament size, the more solutions are forced to compete to be selected as parents to generate new offspring for the next generation.

References

Coello, C. C. A., Lamont, G. B., & Van, V. D. A. (2007). Evolutionary Algorithms for Solving Multi-Objective Problems Second Edition. Springer.

Hadjimichael, A. (2017, August 18). Glossary of commonly used terms. Water Programming: A Collaborative Research Blog. https://waterprogramming.wordpress.com/2017/08/11/glossary-of-commonly-used-terms/.

Hadka, D., & Reed, P. (2013). Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework. Evolutionary Computation, 21(2), 231–259. https://doi.org/10.1162/evco_a_00075

Kasprzyk, J. R. (2013, June 25). MOEA Performance Metrics. Water Programming: A Collaborative Research Blog. https://waterprogramming.wordpress.com/2013/06/25/moea-performance-metrics/.

Li, M. (n.d.). Many-Objective Optimisation. https://www.cs.bham.ac.uk/~limx/MaOP.html.

What is Latin Hypercube Sampling? Statology. (2021, May 10). https://www.statology.org/latin-hypercube-sampling/.

June 14, 2021 by Rohini Gupta

CNNs for Time Series Applications

This post is meant to be an introduction to convolutional neural networks (CNNs) and how they can be applied to continuous prediction problems, such as time series predictions. CNNs have historically been utilized in image classification applications. At a high level, CNNs use small kernels (filters) that can slide over localized regions of an image and detect features from edges to faces, much in the same way as the visual cortex of a brain (Hubel and Wiesel, 1968). The basic concepts of a CNN were first introduced by Kunihiko Fukushima in 1980 and the first use of CNNs for image recognition were carried out by Yann LeCun in 1988. The major breakthrough for the algorithm didn’t happen until 2000 with the advent of GPUs and by 2015, CNNs were favored to win image recognition contests over other deep networks.

It is believed that recurrent style networks such as LSTMs are the most appropriate algorithms for time series prediction, but studies have been conducted that suggest that CNNs can perform equivalently (or better) and that appropriate filters can extract features that are coupled across variables and time while being computationally efficient to train (Bai et al., 2018, Rodrigues et al., 2021). Below, I’ll demonstrate some of the key characteristics of CNNs and how CNNs can be used for time series prediction problems.

Architecture

Everything You Need to Know About Convolutional Neural Networks

Figure 1: CNN schematic for image classification (Sharma, 2018)

Figure 1 shows a schematic of a CNN’s architecture. The architecture is primarily comprised of a series of convolution and pooling layers followed by a fully connected network. In each convolution layer are kernel matrices that are convolved with the input into the convolution layer. It is up to the user to define the number of kernels and size of the kernels, but the weights in the kernel are learned using backpropagation. A bias is added to the output of the convolution layer and then passed through an activation function, such as ReLU function to yield feature maps. The feature maps are stacked in a cuboid of a depth that equals the number of filters. If the convolution layer is followed by a pooling layer, the feature maps are down-sampled to produce a lower dimensional representation of the feature maps. The output from the final pooling or convolutional layer is flattened and fed to the fully connected layers.

We will now look at the components of the architecture in more detail. To demonstrate how the convolutional layer works, we will use a toy example shown in Figure 2.

Figure 2: Convolution of a 3×3 kernel with the original image

Let’s say that our input is an image is represented as a 5×5 array and the filter is a 3×3 kernel that will be convolved with the image. The result is the array termed Conv1 which is just another array where each cell is the dot product between the filter and the 3×3 subsections of the image. The numbers in color represent the values that the filter is centered on. Note that the convolution operation will result in an output that is smaller than the input and can result in a loss of information around the boundaries of the image. Zero padding, which constitutes adding border of zeros around the input array, can be used to preserve the input size. The kernel matrices are the mechanisms by which the CNN is able to identify underlying patterns. Figure 3 shows examples of what successive output from convolution layers, or feature maps, can look like.

Figure 3: Convolutional layer output for a CNN trained to distinguish between cats and dogs (Dertat, 2017)

The filters in the first convolutional layer of a CNN retain most of the information of the image, particularly edges. The brightest colors represent the most active pixels. The feature maps tend to become more abstract or focused on specific features as you move deeper into the network (Dertat, 2017). For example, Block 3 seems to be tailored to distinguish eyes.

The other key type of layer is a pooling layer. A pooling layer is added after convolution to reduce dimensionality, which can both reduce computational time to train by reducing parameters but can also reduce the chances of overfitting. The most common type of pooling is max pooling which returns the max value in a NxN matrix pooling filter. This type of pooling retains the most active pixels in the feature map. As demonstrated in Figure 4, max pooling, using a 2×2 filter with a stride (or shift) of 2 pixels, reduces our Conv1 layer into a 2×2 lower dimensional matrix. One can also do average pooling instead of max pooling which would take the average of the values in each 2×2 subsection of the Conv1 layer.

Figure 4: Max pooling example

Application to Regression

CNNs are easiest to understand and visualize for image applications which provide a basis for thinking about how we can use CNNs in a regression or prediction application for time series. Let’s use a very simple example of a rainfall-runoff problem that uses daily precipitation and temperature to predict outflow in an ephemeral sub-basin within the Tuolumne Basin. Because the sub-basin features a creek that is ephemeral, this means that the creek can dry up across the simulation period and there can be extended periods of zero flow. This can make predictions in the basin very difficult. Here, we also implement a lag which allows us to consider the residence time of the basin and that precipitation/temperature from days before likely will contribute to predicting the outflow today. We use a lag of 18, meaning that we use the previous 18 values of precipitation and temperature to predict outflow. The CNN model is implemented within Keras in the code below.

#import modules

import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense
from keras.layers.convolutional import Conv1D, Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Dropout, Activation, Flatten
from keras.optimizers import SGD
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import seaborn as sns
import os

os.getcwd()
os.chdir("C:/Users/Rohini/Documents/")
df_ge = pd.read_csv("Sub_0_daily.csv", index_col=0) 
print(df_ge.head())

#Check for nulls
print("checking if any null values are present\n", df_ge.isna().sum())

#Specify the training columns by their names
train_cols = ["Precipitation","Temperature"]
label_cols = ["Outflow"]



# This function normalizes the input data
def Normalization_Transform(x):
    x_mean=np.mean(x, axis=0)
    x_std= np.std(x, axis=0)
    xn = (x-x_mean)/x_std
    return xn, x_mean,x_std




# This function reverses the normalization 
def inverse_Normalization_Transform(xn, x_mean,x_std):
    xd = (xn*x_std)+x_mean
    return xd



# building timeseries data with given timesteps (lags)
def timeseries(X, Y, Y_actual, time_steps, out_steps):
    input_size_0 = X.shape[0] - time_steps
    input_size_1 = X.shape[1]
    X_values = np.zeros((input_size_0, time_steps, input_size_1))
    Y_values = np.zeros((input_size_0,))
    Y_values_actual = np.zeros((input_size_0,))
    
    for i in tqdm_notebook(range(input_size_0)):
        X_values[i] = X[i:time_steps+i]
        Y_values[i] = Y[time_steps+i-1, 0]
        Y_values_actual[i] = Y_actual[time_steps+i-1, 0]
        
    print("length of time-series i/o",X_values.shape,Y_values.shape)
    return X_values, Y_values, Y_values_actual


df_train, df_test = train_test_split(df_ge, train_size=0.8, test_size=0.2, shuffle=False)
x_train = df_train.loc[:,train_cols].values
y_train = df_train.loc[:,label_cols].values
x_test = df_test.loc[:,train_cols].values
y_test = df_test.loc[:,label_cols].values    
   
#Normalizing training data
x_train_nor = xtrain_min_max_scaler.fit_transform(x_train)
y_train_nor = ytrain_min_max_scaler.fit_transform(y_train)

# Normalizing test data
x_test_nor = xtest_min_max_scaler.fit_transform(x_test)
y_test_nor = ytest_min_max_scaler.fit_transform(y_test)

# Saving actual train and test y_label to calculate mean square error later after training
y_train_actual = y_train
y_test_actual = y_test

#Building timeseries
X_Train, Y_Train, Y_train_actual = timeseries(x_train_nor, y_train_nor, y_train_actual, time_steps=18, out_steps=1)
X_Test, Y_Test, Y_test_actual = timeseries(x_test_nor, y_test_nor, y_test_actual, time_steps=18, out_steps=1)

#Define CNN model

def make_model(X_Train):
    input_layer = Input(shape=(X_Train.shape[1],X_Train.shape[2]))

    conv1 = Conv1D(filters=16, kernel_size=2, strides=1,
                    padding='same',activation='relu')(input_layer)
    conv2 = Conv1D(filters=32, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv1)
    conv3 = Conv1D(filters=64, kernel_size=3,strides = 1,
                          padding='same', activation='relu')(conv2)
    flatten = Flatten()(conv3)
    dense1 = Dense(1152, activation='relu')(flatten)
    dense2 = Dense(576, activation='relu')(dense1)
    output_layer = Dense(1, activation='linear')(dense2)
    
    return Model(inputs=input_layer, outputs=output_layer)

model = make_model(X_Train)
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_Train, Y_Train, epochs=10)


#Prediction and inverting results 
ypred = model.predict(X_Test)
predict =inverse_Normalization_Transform(ypred,y_mean_train, y_std_train)


#Plot results
plt.figure(figsize=(11, 7))

plt.plot(y_test)
plt.plot((predict))

plt.title('Outflow Prediction (Precipitation+Temperature,Epochs=10, Lag=18 hours)')
plt.ylabel('Outflow (cfs)')
plt.xlabel('Day')
plt.legend(['Actual Values','Predicted Values'], loc='upper right')
plt.show()

Just as with any algorithm, we normalize the input data and split it into testing and training sets. The CNN model is implemented in Keras and consists of three convolutional layers with kernel sizes that are explicitly defined to extract patterns that are coupled across variables and time. A schematic of the setup is shown in Figure 5.

Figure 5: Convolution layer setup for the Tuolumne case

Layer 1 uses a 1D convolutional layer with 16 filters of size 1×2 in order to extract features and interactions across the precipitation and temperature time series as demonstrated in the top left of Figure 5. The result of this is an output layer of 1x18x16. The second convolution layer uses 32, 3×1 filters which now will further capture temporal interactions down the output column vector. The third layer uses 64, 3×1 filters to capture more complex temporal trends which is convolved with the output from the Conv2 layer. Note that zero padding is added (padding =”same” in the code) to maintain the dimensions of the layers. The three convolutional layers are followed by a flattening layer and a three-layer dense network. The CNN was run 20 times and the results from the last iteration are shown in Figure 6. We also compare to an LSTM that has an equivalent 3-layer setup and that is also run 20 times. The actual outflow is shown in blue while predictions are shown in red.

Figure 6: CNN vs LSTM prediction

For all purposes, the visual comparison yields that CNNs and LSTMs work equivalently, though the CNN was considerably faster to train. Notably, the CNN does a better job of capturing the large extremes recorded on day 100 and day 900, while still capturing the dynamics of the lower flow regime. While these results are preliminary and largely un-optimized, the CNN shows the ability to outperform an LSTM for a style of problem that it is not technically designed for. Using the specialized kernels, the CNN learns the interactions (both across variables and temporally) without needing a mechanism specifically designed for memory, such as a cell state in an LSTM. Furthermore, CNNs can greatly take advantage of additional speedups from GPUs which doesn’t always produce large gain in efficiency for LSTM training. For now, we can at least conclude that CNNs are fast and promising alternatives to LSTMs that you may not have considered before. Future blog posts will dive more into the capabilities of CNNs in problems with more input variables and complex interactions, particularly if there seems to be a benefit from CNNs in resolving complex relationships that help to predict extremes.

References

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1), 215-243.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Rodrigues, N. M., Batista, J. E., Trujillo, L., Duarte, B., Giacobini, M., Vanneschi, L., & Silva, S. (2021). Plotting time: On the usage of CNNs for time series classification. arXiv preprint arXiv:2102.04179.

Sharma, V. (2018). https://vinodsblog.com/2018/10/15/everything-you-need-to-know-about-convolutional-neural-networks/

Dertat, A. (2017). https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

June 7, 2021 by David Gold

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel computing allows us to speed up code by performing multiple tasks simultaneously across a distributed set of processors. On high performance computing (HPC) systems, an efficient parallel code can accomplish in minutes what might take days or even years to perform on a single processor. Not all code will scale well on HPC systems however. Most code has inherently serial components that cannot be divided among processors. If the serial component is a large segment of a code, the speedup gained from parallelization will greatly diminish. Memory and communication bottlenecks also present challenges to parallelization, and their impact on performance may not be readily apparent.

To measure the parallel performance of a code, we perform scaling experiments. Scaling experiments are useful as 1) a diagnostic tool to analyze performance of your code and 2) a evidence of code performance that can be used when requesting allocations on HPC systems (for example, NSF’s XSEDE program requires scaling information when requesting allocations). In this post I’ll outline the basics for performing scaling analysis of your code and discuss how these results are often presented in allocation applications.

Amdahl’s law and strong scaling

One way to measure the performance a parallel code is through what is known as “speedup” which measures the ratio of computational time in serial to the time in parallel:

$speedup = \frac{t_s}{t_p}$

Where t_s is the serial time and t_p is the parallel time.

The maximum speedup of any code is limited the portion of code that is inherently serial. In the 1960’s programmer Gene Amdahl formalized this limitation by presenting what is now known as Amdahl’s law:

$Speedup = \frac{t_s}{t_p} = \frac{1}{s+(1-s)/p} < \frac{1}{s}$

Where p is the number of processors, and s is the fraction of work that is serial.

On it’s face, Amdahl’s law seems like a severe limitation for parallel performance. If just 10% of your code is inherently serial, then the maximum speedup you can achieve is a factor of 10 ( s= 0.10, 1/.1 = 10). This means that even if you run your code over 1,000 processors, the code will only run 10 times faster (so there is no reason to run across more than 10 processors). Luckily, in water resources applications the inherently serial fraction of many codes is very small (think ensemble model runs or MOEA function evaluations).

Experiments that measure speedup of parallel code are known as “strong scaling” experiments. To perform a strong scaling experiment, you fix the amount of work for the code to do (ie. run 10,000 MOEA function evaluations) and examine how long it takes to finish with varying processor counts. Ideally, your speedup will increase linearly with the number of processors. Agencies that grant HPC allocations (like NSF XSEDE) like to see the results of strong scaling experiments visually. Below, I’ve adapted a figure from an XSEDE training on how to assess performance and scaling:

Plots like this are easy for funding agencies to assess. Good scaling performance can be observed in the lower left corner of the plot, where the speedup increases linearly with the number of processors. When the speedup starts to decrease, but has not leveled off, the scaling is likely acceptable. The peak of the curve represents poor scaling. Note that this will actually be the fastest runtime, but does not represent an efficient use of the parallel system.

Gustafson’s law and weak scaling

Many codes will not show acceptable scaling performance when analyzed with strong scaling due to inherently serial sections of code. While this is obviously not a desirable attribute, it does not necessarily mean that parallelization is useless. An alternative measure of parallel performance is to measure the amount of additional work that can be completed when you increase the number of processors. For example, if you have a model that needs to read a large amount of input data, the code may perform poorly if you only run it for a short simulation, but much better if you run a long simulation.

In the 1980s, John Gustafson proposed a relationship that notes relates the parallel performance to the amount of work a code can accomplish. This relationship has since been termed Gustafson’s law:

$speedup = s+p*N$

Where s and p are once again the portions of the code that are serial and parallel respectively and N is the number of core.

Gustafson’s law removes the inherent limits from serial sections of code and allows for another type of scaling analysis, termed “weak scaling”. Weak scaling is often measured by “efficiency” rather than speedup. Efficiency is calculated by proportionally scaling the amount of work with the number of processors and measure the ratio of completion times:

$efficiency = \frac{t_1}{t_N}$

Ideally, efficiency will be close to one (the time it take one processor to do one unit of work is the same time it takes N processors to do N units of work). For resource allocations, it is again advisable to visualize the results of weak scaling experiments by creating plots like the one shown below (again adapted from the XSEDE training).

Final thoughts

Scaling experiments will help you understand how your code will scale and give you a realistic idea of computation requirements for large experiments. Unfortunately however, it will not diagnose the source of poor scaling. To improve scaling performance, it often helps to improve the serial version of your code as much as possible. A helpful first step is to profile your code. Other useful tips are to reduce the frequency of data input/output and (if using compiled code) to check the flags on your compiler (see some other tips here).

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

Month: June 2021

How to schedule massively parallel jobs on clusters – some basic ways

The ABCs of MOEAs

Greek symbols

A

C

D

E

G

H

I

L

M

N

O

P

R

S

T

References

CNNs for Time Series Applications

Architecture

Application to Regression

Scaling experiments: how to measure the performance of parallel code on HPC systems

Amdahl’s law and strong scaling

Gustafson’s law and weak scaling

Final thoughts