Parallel Programming Part 2: Demonstrating OpenMP constructs with a simple test case

Parallel Programming Part 2: Demonstrating OpenMP constructs with a simple test case

In Parallel Programming Part 1, I introduced a number of basic OpenMP directives and constructs, as well as provided a quick guide to high-performance computing (HPC) terminology. In Part 2 (aka this post), we will apply some of the constructs previously learned to execute a few OpenMP examples. We will also view an example of incorrect use of OpenMP constructs and its implications.

TheCube Cluster Setup

Before we begin, there are a couple of steps to ensure that we will be able to run C++ code on the Cube. First, check the default version of the C++ compiler. This compiler essentially transforms your code into an executable file that can be easily understood and implemented by the Cube. You can do this by entering

g++ --version

into your command line. You should see that the Cube has G++ version 8.3.0 installed, as shown below.

Now, create two files:

  • openmp-demo-print.cpp this is the file where we will perform a print demo to show how tasks are distributed between threads
  • openmp-demo-access.cpp this file will demonstrate the importance of synchronization and the implications if proper synchronization is not implemented

Example 1: Hello from Threads

Open the openmp-demo-print.cpp file, and copy and paste the following code into it:

#include <iostream>
#include <omp.h>
#include <stdio.h>

int main() {

    int num_threads = omp_get_max_threads(); // get max number of threads available on the Cube
    omp_set_num_threads(num_threads);

    printf("The CUBE has %d threads.\n", num_threads);


    #pragma omp parallel 
    {
        int thread_id = omp_get_thread_num();
        printf("Hello from thread num %d\n", thread_id);
    }

    printf("Thread numbers written to file successfully. \n");

    return 0;
}

Here is what the code is doing:

  • Lines 1 to 3 are the libraries that need to be included in this script to enable the use of OpenMP constructs and printing to the command line.
  • In main() on Lines 7 and 8, we first get the maximum number of threads that the Cube has available on one node and assign it to the num_threads variable. We then set the number of threads we want to use.
  • Next, we declare a parallel section using #pragma omp parallel beginning on Line 13. In this section, we are assigning the printf() function to each of the num_threads threads that we have set on Line 8. Here, each thread is now executing the printf() function independently of each other. Note that the threads are under no obligation to you to execute their tasks in ascending order.
  • Finally, we end the call to main() by returning 0

Be sure to first save this file. Once you have done this, enter the following into your command line:

g++ openmp-demo-print.cpp -o openmp-demo-print -fopenmp

This tells the system to compile openmp-demo-print.cpp and create an executable object using the same file name. Next, in the command line, enter:

./openmp-demo-print

You should see the following (or some version of it) appear in your command line:

Notice that the threads are all being printed out of order. This demonstrates that each thread takes a different amount of time to complete its task. If you re-run the following script, you will find that the order of threads that are printed out would have changed, showing that there is a degree of uncertainty associated with the amount of time they each require. This has implications for the order in which threads read, write, and access data, which will be demonstrated in the next example.

Example 2: Who’s (seg)Fault Is It?

In this example, we will first execute a script that stores each thread number ID into an array. Each thread then accesses its ID and prints it out into a text file called access-demo.txt. See the implementation below:

#include <iostream>
#include <omp.h>
#include <vector>

using namespace std;

void parallel_access_demo() {
    vector<int> data;
    int num_threads = omp_get_max_threads();

    // Parallel region for appending elements to the vector
    #pragma omp parallel for
    for (int i = 0; i < num_threads; ++i) {
        // Append the thread ID to the vector
        data.push_back(omp_get_thread_num()); // Potential race condition here
    }
}

int main() {
    // Execute parallel access demo
    parallel_access_demo();
    printf("Code ran successfully. \n");

    return 0;
}

As per the usual, here’s what’s going down:

  • After including the necessary libraries, we use include namespace std on Line 5 so that we can more easily print text without having to used std:: every time we attempt to write a statement to the output file
  • Starting on Line 7, we write a simple parallel_access_demo() function that does the following:
    • Initialize a vector of integers very creatively called data
    • Get the maximum number of threads available per node on the Cube
    • Declare a parallel for region where we iterate through all threads and push_back the thread ID number into the data vector
  • Finally, we execute this function in main() beginning Line 19.

If you execute this using a similar process shown in Example 1, you should see the following output in your command line:

Congratulations – you just encountered your first segmentation fault! Not to worry – this is by design. Take a few minutes to review the code above to identify why this might be the case.

Notice Line 15. Where, each thread is attempting to insert a new integer into the data vector. Since push_back() is being called in a parallel region, this means that multiple threads are attempting to modify data simultaneously. This will result in a race condition, where the size of data is changing unpredictably. There is a pretty simple fix to this, where we add a #pragma omp critical before Line 15.

As mentioned in Part 1, #pragma omp critical ensures that each thread executes its task one at a time, and not all at once. It essentially serializes a portion of the parallel for loop and prevents race conditions by ensuring that the size of data changes in a predictable and controlled manner. An alternative would be to declare #pragma omp single in place of #pragma omp parallel for on Line 12. This essentially delegates the following for-loop to a single thread, forcing the loop to be completed in serial. Both of these options will result in the blissful outcome of

Summary

In this post, I introduced two examples to demonstrate the use of OpenMP constructs. In Example 1, I demonstrated a parallel printing task to show that threads do not all complete their tasks at the same time. Example 2 showed how the lack of synchronization can lead to segmentation faults due to race conditions, and two methods to remedy this. In Part 3, I will introduce GPU programming using the Python PyTorch library and Google Colaboratory cloud infrastructure. Until then, happy coding!

Geocoding Using Google API Key in Python

Introduction

In this post, I will delve into the transformative realm of geocoding, addressing the challenges posed by historical data laden with addresses. Throughout the discussion, I’ll guide you through the intricacies of leveraging the Google API key to seamlessly integrate location information.

Key points include:

  - Preparing your digital workspace for geocoding with essential tools.

  - Obtaining a Google API key to unlock the power of precise coordinates.

  - Applying practical steps to effortlessly transform addresses into valuable spatial insights using Pandas DataFrame.

ArcGIS Developers. (n.d.). ArcGIS API for Python. Retrieved from here.

As we study the realm of data-driven exploration, a significant challenge surfaces—one that personally resonated with me during my recent project. Historical data, often a treasure trove of insights, can be a double-edged sword. The challenge arises when this wealth of information arrives with addresses instead of coordinates, adding a layer of complexity to the analysis. In my own experience, this meant that a substantial portion of the received data was, unfortunately, unusable for my intended analysis without a transformative solution.

This is where geocoding steps in as a crucial ally, reshaping the landscape of historical datasets and making them analysis-ready. The inconsistency in reporting formats and the absence of standardized coordinates pose hurdles that can impede meaningful analysis. Geocoding becomes an indispensable tool, allowing me to bridge this gap, harmonize the data, and unlock its full potential for analysis.

This post focuses on the intricacies of geocoding—a transformative process that transcends mere addresses, providing the geographic insights necessary for a comprehensive understanding of your dataset. Equipped with the Google API key, we’ll delve into the practical steps needed to seamlessly integrate location information. The following sections outline the key stages of this geocoding journey, ensuring your dataset is prepared for advanced analysis.

Preparing your environment

Prepare your digital workspace for this impactful data transformation by ensuring you have all the necessary tools. Seamlessly install GoogleMaps and Pandas using the commands provided below in your terminal.

#Install necessary packages
pip install GoogleMaps
pip install pandas

Formatting your data

In the realm of geocoding, the importance of formatting data before initiating the geocoding process cannot be overstated, especially when leveraging the capabilities of the Google API key. The formatting step is a vital prerequisite for several reasons. Firstly, it ensures consistency and standardization in the structure of addresses, aligning them with the expectations of geocoding services. This consistency allows for more accurate interpretation and processing of addresses. Additionally, formatted addresses provide a clear separation of components such as city, state, and ZIP code, facilitating efficient matching during the geocoding algorithm’s execution. The optimization of API usage is another benefit, as well-formatted addresses contribute to accurate results, minimizing unnecessary costs associated with incorrect or ambiguous requests. By addressing these considerations in the formatting stage, one sets the stage for a more reliable and cost-effective geocoding journey, unlocking the full potential of spatial insights. Now, let’s delve into the practical implementation of this formatting process with the following code, which creates a new column containing the consolidated and formatted addresses within your dataset.

# Create new dataframe just with the address
df['ADDRESS'] = df['CITY'] + ',' + \
                df['STATE'] + ' ' + \
                df['ZIP'].astype(str)

Obtain a Google API Key

  1. Navigate to the Google Cloud console
  2. Under the Project drop-down menu, select the project that you want to work on. If one does not already exist, create a new project 
  3. Under the Library tab on the side menu, search “geocoding api” and ensure that it is enabled
  4. Under the Credentials tab on the side menu, select “Create Credentials” and create a new API key to generate a new key
  5. Your Google API key will then appear on the screen, save this key for future use

Geocoding with Pandas DataFrame

With your environment set up, your data formatted, and your Google API key in hand, it’s time to wield this powerful tool. Dive into the practical realm of geocoding by seamlessly applying the sample code below to your dataset, effortlessly transforming addresses into precise coordinates. Follow the code through each practical step, and witness your location data evolve into a valuable asset, ready for advanced analysis. 

# Import necessary libraries
import googlemaps
import pandas as pd

# Define your API key
# Replace "YOUR_GOOGLE_API_KEY" with your actual Google Maps API key that you saved previously.
gmaps_key = googlemaps.Client(key="YOUR_GOOGLE_API_KEY")

# Extract the last column (ADDRESS) to create a new dataframe (df_address)
df_address = df.iloc[:, -1:]

# Geocoding with Pandas DataFrame
# Define a function to geocode addresses and extract latitude and longitude
def geocode_and_extract(df_address):
    # Add empty columns for latitude and longitude
    df_address['LATITUDE'] = ""
    df_address['LONGITUDE'] = ""
    
    # Loop through each row in the df_address dataframe
    for i in range(len(df_address)):
        address = df_address['ADDRESS'][i]
        # Geocode the address using the Google Maps API
        geocode_result = gmaps_key.geocode(address)
        
        # Check if geocoding was successful
        if geocode_result:
            # Extract latitude and longitude from the geocoding result
            df_address['LATITUDE'][i] = geocode_result[0]['geometry']['location']['lat']
            df_address['LONGITUDE'][i] = geocode_result[0]['geometry']['location']['lng']

# Apply the geocoding function to your dataframe
geocode_and_extract(df_address)

# Print an example to verify correctness
example_address = df_address['ADDRESS'][0]
example_result = gmaps_key.geocode(example_address)
example_lat = example_result[0]["geometry"]["location"]["lat"]
example_long = example_result[0]["geometry"]["location"]["lng"]
print(f'Example Geocoded Address: {example_address}, Latitude: {example_lat}, Longitude: {example_long}')

# Let's join the coordinates with the original dataset
df['LATITUDE'] = df_address['LATITUDE']
df['LONGITUDE'] = df_address['LONGITUDE']

Replace "YOUR_GOOGLE_API_KEY" with your actual Google Maps API key that you saved previously.

Dealing with incomplete data

In the midst of my geocoding journey, I encountered a common challenge: incomplete data. Some addresses in my dataset were detailed with full street information, while others provided only city and state. The beauty of the Google API key revealed itself as it effortlessly transformed both types of data, allowing for a seamless integration of geographic insights.

This flexibility became especially crucial as I navigated through points where only city and state information existed. While it may seem like a potential hurdle, the Google API key’s ability to interpret varying address formats ensured that no data was left behind. For my specific analysis, the key only needed to fall within a designated radius of interest, and the results proved to be accurate enough for my purposes.

In handling diverse data formats, the Google API key stands as your ally, harmonizing and geocoding all pieces of the puzzle, ensuring that each contributes to the rich narrative your dataset is meant to tell.

Conclusion

Your geocoding journey has now equipped you to handle diverse datasets, turning a potential challenge into a transformative opportunity. As you venture forth, may your analyses unveil fresh perspectives and illuminate the hidden stories within your spatial data. The Google API key has been the linchpin throughout this exploration, ensuring that every piece of information contributes to the bigger picture. From handling varying addresses to pinpointing detailed locations, the key’s flexibility has proven invaluable, providing accurate insights.

Congratulations on mastering the art of geocoding—may your data-driven adventures persist in revealing the intricate layers of time, geography, and the untold stories that lie within.

An introduction to linear programming for reservoir operations (part 2) – Implementation with Pyomo

Introduction

Previously, in Part 1 I used a very simple reservoir operations scenario to demonstrate some linear programming (LP) concepts.

After some feedback on my initial formulation I went back and revised the formulation to make sure that (1) both reservoir releases and storage levels are optimized simultaneously and (2) the LP handles decisions over multiple timesteps (1,…,N) during optimization. Definitely look back at Part 1 for more context.

The current LP formulation is as follows:

In this post, I show a simple implementation of this LP using the Pyomo package for solving optimization problems in Python.

I have shared the code used in this demo in a repository here: Reservoir-LP-Demo


Constructing the LP model with Pyomo

While Pyomo can help us construct the LP model, you will need access to a separate solver software in order to actually run the optimization. I don’t get into the details here on how to set up these solvers (see their specific installation instructions), but generally you will need this solver to be accessible on you PATH.

Two solvers that I have had good experience with are:

As always, it’s best to consult the Pyomo documentation for any questions you might have. Here, I highlight a few things that are needed for our implementation.

We start by importing the pyomo.environ module:

import pyomo.environ as pyo

From this module we will need to use the following classes to help build our model:

  • pyo.ConcreteModel()
  • pyo.RangeSet()
  • pyo.Var()
  • pyo.Objective()
  • pyo.Constraint()
  • pyo.SolverFactory()

The nice thing about using pyomo rather than trying to manage the LP matrices yourself is that you can specify objectives and constraints as functions.

For example, the objective function is defined as:

# Objective Function
def objective_rule(m):
    return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)

And a constraint used to enforce the lower limit of the storage mass balance can defined as:

def S_balance_lower(m, t):
    if t == 0:
        return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
    return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

Rather than picking the full implementation apart, I present the entire function below, and encourage you to compare the code implementation with the problem definition above.

def pyomo_lp_reservoir(N, S_min, S_max, R_min, R_max, 
                       eta_0, I, D,  
                       initial_storage):

    # Model
    model = pyo.ConcreteModel()

    # Time range
    model.T = pyo.RangeSet(0, N-1)  

    # Decision Variables
    model.S = pyo.Var(model.T, bounds=(S_min, S_max))  # Storage
    model.R = pyo.Var(model.T, bounds=(R_min, R_max))  # Release

    # Objective Function
    def objective_rule(m):
        return -sum((eta_0 * m.R[t]) + (m.S[t]/S_max*100) for t in m.T)
    model.objective = pyo.Objective(rule=objective_rule, sense=pyo.minimize)

    # Constraints
    def S_balance_lower(m, t):
        if t == 0:
            return m.S[t] + m.R[t] <= initial_storage + I[t] - D[t]
        return m.S[t] + m.R[t] <= m.S[t-1] + I[t] - D[t]

    def S_balance_upper(m, t):
        if t == 0:
            return -(m.S[t] + m.R[t]) <= -(initial_storage + I[t] - D[t])
        return -(m.S[t] + m.R[t]) <= -(m.S[t-1] + I[t] - D[t])
    model.S_lower = pyo.Constraint(model.T, rule=S_balance_lower)
    model.S_upper = pyo.Constraint(model.T, rule=S_balance_upper)
    model.S_final = pyo.Constraint(expr=model.S[N-1] == initial_storage)

    # Solve
    solver = pyo.SolverFactory('scip')
    results = solver.solve(model)

    if results.solver.status == pyo.SolverStatus.ok:
        S_opt = np.array([pyo.value(model.S[t]) for t in model.T])
        R_opt = np.array([pyo.value(model.R[t]) for t in model.T])
        return S_opt, R_opt
    else:        
        raise ValueError('Not able to solve.')

Note that in this implementation, pyomo will optimize all of the reservoir release and storage decisions simultaneously, returning the vectors of length N which prescribe the next N days of operations.

Usage

Now we are ready to use our LP reservoir simulator. In the code block below, I set some specifications for our operational constraints, generate fake inflow and demand timeseries, run the LP solver, and plot the simulated results:

# spcifications
N_t = 30
S_min = 2500
S_max = 5000
R_min = 10
R_max = 1000
eta = 1.2

# Generate simple inflow and demand data
I, D = generate_data(N_t, correlation_factor = -0.70,
                     inflow_mean=500, inflow_std=100,
                     lag_correlation=0.2)


# Run LP operation simulation
S_sim, R_sim = pyomo_lp_reservoir(N_t, S_min, S_max, R_min, R_max, eta, I, D, 
                                  initial_storage=S_max)

# Plot results
plot_simulated_reservoir(I, D,
                         R_sim, S_sim, 
                         S_max, eta=eta)

The operations are shown:

Under this LP formulation, with a perfect inflow forecast, the reservoir operates as a “run of river” with the release rates being very close to the inflow rate.

In practice, operators may need to limit the difference in release volume from day-to-day. I added an optional parameter (R_change_limit) which adds a constraint on the difference subsequent releases from between each day.

The operations, with the daily release change rate limited to 50 is shown below.

Conclusions

The way you formulate the an LP problem dictates the “optimal” decisions that you will generate. The purpose of these past two posts was to make some attempt at giving a practice demo of some basic LP concepts. I hope for some that it might be useful as a starting point.

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Constructing interactive Ipywidgets: demonstration using the HYMOD model

Last week, Lillian and I published the first post in a series of training post studying the “Fisheries Game, which is a decision making problem within a complex, non-linear, and uncertain ecological context.

In preparing for that post, I learned about the Ipywidgets python library for widget construction. It stood out to me as a tool for highlighting the influence of parametric uncertainty on model performance. More broadly, I think it has great as an educational or data-narrative device.

This blogpost is designed to highlight this potential, and provide a basic introduction to the library. A tutorial demonstration of how an interactive widget is constructed is provided, this time using the HYMOD rainfall-runoff model.

This post is intended to be viewed through a Jupyter Notebook for interaction, which can be accessed through a Binder at this link!

The Binder was built with an internal environment specification, so it should not be necessary to install any packages on your local machine! Because of this, it may take a minute to load the page.

Alternatively, you can pull the source code and run the Jupyter Notebook from your local machine. All of the source code is available in a GitHub repository: Ipywidget_Demo_Interactive_HYMOD.

If using your local machine, you will first need to install the Ipywidget library:

pip install ipywidgets

Let’s begin!

HYMOD Introduction

HYMOD is a conceptual rainfall-runoff model. Given some observed precipitation and evaporation, a parameterized HYMOD model simulates the resulting down-basin runoff.

This post does not focus on specific properties or performance of the HYMOD model, but rather uses the model as a demonstration of the utility of the Ipywidget library.

I chose to use the HYMOD model for this, because the HYMOD model is commonly taught in introductory hydrologic modeling courses. This demonstration shows how an Ipywidget can be used in an educational context. The resulting widget can allow students to interact in real-time with the model behavior, by adjusting parameter values and visualizing the changes in the resulted streamflow.

If you are interested in the technical details of implementing the HYMOD model, you can dig into the source code, available (and throughly commented/descriptive) in the repository for this post: Ipywidget_Demo_Interactive_HYMOD.

HYMOD represents surface flow as a series of several quick-flow reservoirs. Groundwater flow is represented as a single slow-flow reservoir. The reservoirs have constant flow rates, with the quick-flow reservoir rate, Kq, being greater than the slow-flow reservoir rate, Ks.

Image source: Sun, Wenchao & Ishidaira, Hiroshi & Bastola, Satish. (2010)

HYMOD Parameters:

Like any hydrologic model, the performance of HYMOD will be dependent upon the specified parameter values. There are several parameters that can be adjusted:

  • Cmax: Max soil moisture storage (mm) [10-2000]
  • B: Distribution of soil stores [0.0 – 7.0]
  • Alpha: Division between quick/slow routing [0.0 – 1.0]
  • Kq: Quick flow reservoir rate constant (day^-1) [0.15 – 1.0]
  • Ks: Slow flow reservoir rate constant. (day^-1) [0.0 – 0.15]
  • N: The number of quick-flow reservoirs.

Interactive widget demonstration

I’ve constructed an Ipywidets object which allows a user to visualize the impact of the HYMOD model parameters on the resulting simulation timeseries. The user also has the option to select from three different error metrics, which display in the plot, and toggle the observed timeseries plot on and off.

Later in this post, I will give detail on how the widget was created.

Before provided the detail, I want to show the widget in action so that you know the expectation for the final product.

The gif below shows the widget in-use:

Demonstration of the HYMOD widget.

Ipywidgets Introduction

The Ipywdiget library allows for highly customized widgets, like the one above. As with any new tool, I’d recommend you check out the documentation here.

Below, I walk through the process of generating the widget shown above.

Lets begin!

Import the library

# Import the library
import ipywidgets as widgets

Basic widget components

Consider an Ipywidget as being an arrangement of modular components.

The tutorial walks through the construction of five key widget components:

  1. Variable slider
  2. Drop-down selectors
  3. Toggle buttons
  4. Label objects
  5. Interactive outputs (used to connect the plot to the other three components)

In the last section, I show how all of these components can be arranged together to construct the unified widget.

Sliders

Sliders are one of the most common ipywidet tools. They allow for manual manipulation of a variable value. The slider is an object that can be passed to the interactive widget (more on this further down).

For my HYMOD widget, I would like to be able to manipulate each of the model parameters listed above. I begin by constructing a slider object for each of the variables.

Here is an example, for the C_max variable:

# Construct the slider
Cmax_slider = widgets.FloatSlider(value = 500, min = 10, max = 2000, step = 1.0, description = "C_max",
                                  disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')


# Display the slider
display(Cmax_slider)

Notice that each slider recieves a specified minmax, and step corresponding to the possible values. For the HYMOD demo, I am using the parameter ranges specified in Herman, J.D., P.M. Reed, and T. Wagener (2013), Time-varying sensitivity analysis clarifies the effects of watershed model formulation on model behavior.

I will construct the sliders for the remaining parameters below. Notice that I don’t assign the description parameter in any of these sliders… this is intentional. Later in this tutorial I will show how to arrange the sliders with Label() objects for a cleaner widget design.

# Construct remaining sliders
Cmax_slider = widgets.FloatSlider(value = 100, min = 10, max = 2000, step = 1.0, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
B_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Alpha_slider = widgets.FloatSlider(value = 0.30, min = 0.00, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Kq_slider = widgets.FloatSlider(value = 0.33, min = 0.15, max = 1.00, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
Ks_slider = widgets.FloatSlider(value = 0.07, min = 0.00, max = 0.15, step = 0.01, disabled = False, continuous_update = False, readout = True, readout_format = '0.2f')
N_slider = widgets.IntSlider(value = 3, min = 2, max = 7, disabled = False, continuous_update = False, readout = True)

# Place all sliders in a list
list_of_sliders = [Kq_slider, Ks_slider, Cmax_slider, B_slider, Alpha_slider, N_slider]

The Dropdown() allows the user to select from a set of discrete variable options. Here, I want to give the user options on which error metric to use when comparing simulated and observed timeseries.

I provide three options:

  1. RMSE: Root mean square error
  2. NSE: Nash Sutcliffe efficiency
  3. ROCE: Runoff coefficient error

See the calculate_error_by_type inside the HYMOD_components.py script to see how these are calculated.

To provide this functionality, I define the Dropdown() object, as below, with a list of options and the initial value:

# Construct the drop-down to select from different error metrics
drop_down = widgets.Dropdown(options=['RMSE','NSE','ROCE'], description='',
                                value = 'RMSE', disabled = False)

# Display the drop-down
display(drop_down)

ToggleButton

The ToggleButton() allows for a bool variable to be toggled between True and False. For my streamflow plot function, I have an option plot_observed = False which determines if the observed streamflow timeseries is shown in the figure.

# Construct the button to toggle observed data On/Off
plot_button = widgets.ToggleButton(value = False, description = 'Toggle', disabled=False, button_style='', tooltip='Description')

# Display the button
display(plot_button)

Labels

As mentioned above, I choose to not include the description argument within the slider, drop-down, or toggle objects. This is because it is common for these labels to get cut-off when displaying the widget object.

For example, take a look at this slider below, with a long description argument:

# Make a slider with a long label
long_title_slider = widgets.FloatSlider(value = 2.0, min = 0.0, max = 7.0, step = 0.1, description = 'This slider has a long label!', readout = True)

# Display: Notice how the label is cut-off!
display(long_title_slider)

The ipywidgets.Label() function provides a way of avoiding this while allowing for long descriptions. Using Label() will ultimately provide you with a lot more control over your widget layout (last section of the tutorial).

The Label() function generates a separate object. Below, I create a unique Label() object for each HYMOD parameter.

# Import the Label() function
from ipywidgets import Label

# Make a list of label strings
param_labs = ['Kq : Quick flow reservoir rate constant (1/day)',
            'Ks : Slow flow reservoir rate constant (1/day)',
            'C_max : Maximum soil moisture storage (mm)',
            'B : Distribution of soil stores',
            'Alpha : Division between quick/slow routing',
            'N : Number of quick-flow reservoirs']

# Make a list of Label() objects
list_of_labels = [Label(i) for i in param_labs]

# Display the first label, for example.
list_of_labels[0]

Interactive_output

Now that we have constructed interactive

The interactive_output function takes two inputs, the function to interact with, and a dictionary of variable assignments:

interactive_output( function, {‘variable_name’ : variable_widget, …} )

I have created a custome function plot_HYMOD_results which:

  1. Loads 1-year of precipitation and evaporation data for the Leaf River catchment.
  2. Runs the HYMOD simulation using the provided parameter values.
  3. Calculates the error of the simulated vs. observed data.
  4. Plots the timeseries of runoff.

The source code for this function can be found in the GitHub repository for this post, or specifically here.

The function receives parameter values for each of the HYMOD parameters discussed above, a bool indicator if observed data should be plotted, and a specified error metric.

plot_HYMOD_results(C_max, B, Alpha, Ks, Kq, N_reservoirs, plot_observed = False, error_type = ‘RMSE’):

I have already generated widget components corresponding to each of these variables! (If you are on the Jupyter Notebook version of this post, make sure to have Run every cell before this, or else the following code wont work.

I can now use the interactive_output function to link the widget components generated earlier with the function inputs:

# Import the interactive_output function
from ipywidgets import interactive_output

# Import my custom plotting function
from HYMOD_plots import plot_HYMOD_results

result_comparison_plot = interactive_output(plot_HYMOD_results, {'C_max' : Cmax_slider, 'B': B_slider, 'Alpha':Alpha_slider, 
                                                                 'Ks':Ks_slider, 'Kq':Kq_slider,'N_reservoirs':N_slider, 
                                                                 'plot_observed' : plot_button,'error_type': drop_down})

# Show the output
result_comparison_plot
Output generated by the interactive_output().

Displaying the interactive_output reveals only the plot, but does not include any of the widget functionality…

Despite this, the plot is still linked to the widget components generated earlier. If you don’t believe me (and are reading the Jupyter Notebook version of this post), scroll up and click the ToggleButton a few cells up, then come back and look at the plot again.

Using the interactive_output() function, rather than other variations of the interact() functions available, allows for cleaner widgets to be produced, because now the arrangment of the widget components can be entirely customizable.

Keep reading for more detail on this!

Arranging widget components

Rather than using widget features one-at-a-time, Ipywidgets allow for several widgets to be arranged in a unified layout. Think of everything that has been generated previously as being a cell within the a gridded widget; the best part is that each cell is linked with one another.

Once the individual widget features (e.g., sliders, buttons, drop-downs, and output plots) are defined, they can be grouped using the VBox() (vertical box) and HBox() (horizontal box) functions.

I’ve constructed a visual representation of my intended widget layout, shown below. The dashed orange boxes show those components grouped by the HBox() function, and the blue boxes show those grouped by the VBox() function.

Visual representation of the final widget layout.

Before getting started, import some of the basic layout functions:

# Import the various 
from ipywidgets import HBox, VBox, Layout

Before constructing the entire widget, it is good to get familiar with the basic HBox() and VBox() functionality.

Remember the list of sliders and list of labels that we created earlier?

# Stack the list of label objects vertically:
VBox(list_of_labels)

# Try the same thing with the sliders (remove comment #):
#VBox(list_of_sliders)

In the final widget, I want the column of labels to be located on the left of the column of sliders. HBox() allows for these two columns to be arrange next to one another:

# Putting the columns side-by-side
HBox([VBox(list_of_labels), VBox(list_of_sliders)])

Generating the final widget

Using the basic HBox() and VBox() functions shown above, I arrange all of the widget components I’ve defined previously. I first define each row of the widget using HBox(), and finally stack the rows using VBox().

The script below will complete the arrangement, and call the final widget!

# Define secifications for the widgets: center and wrap 
box_layout = widgets.Layout(display='flex', flex_flow = 'row', align_items ='center', justify_content = 'center')

# Create the rows of the widets
title_row = Label('Select parameter values for the HYMOD model:')
slider_row = HBox([VBox(list_of_labels), VBox(list_of_sliders)], layout = box_layout)
error_menu_row = HBox([Label('Choose error metric:'), drop_down], layout = box_layout)
observed_toggle_row = HBox([Label('Click to show observed flow'), plot_button], layout = box_layout)
plot_row = HBox([result_comparison_plot], layout = box_layout)


# Combine label and slider box (row_one) with plot for the final widget
HYMOD_widget = VBox([title_row, slider_row, plot_row, error_menu_row, observed_toggle_row])


# Call the widget and have fun!
HYMOD_widget

Concluding remarks

If you’ve made it this far, thank you for reading!

I hope that you are able to find some fun/interesting/educational use for the Ipywidget skills learned in this post.

Clustering basics and a demonstration in clustering infrastructure pathways

Introduction

When approaching a new dataset, the complexity may be daunting. If patterns in the data are not readily apparent, one potential first-step would be to cluster the data and search for groupings. Clustering data, or cluster analysis, can reveal relationships between the observations and provide insight into the data.

Figure: Graphical representation of how clustering can be used to identify patterns in a data set.

Motivation

The outcome of a cluster analysis can be highly dependent upon the clustering algorithm being used, the clustering criteria, and the specified number of clusters to be found. Anticipating these influences and how to manipulate them will set you up for success in your data analysis.

In this post, I introduce some fundamental clustering algorithms, the hierarchical and K-Means clustering algorithms. Then I will share some important considerations, such as how to select a suitable number of clusters. Finally, I will demonstrate the power of clustering by applying the methods to an infrastructure pathways dataset, to cluster infrastructure pathways generated for 1,000 different states of the world into different sets of ‘light’, ‘moderate’, and ‘heavy’ infrastructure development.

Let’s begin!

Methods

Hierarchical Clustering

Hierarchical clustering can be performed “bottom-up”, through an agglomerative approach, which starts by placing each observation in its own cluster and then successively links the observations into increasingly larger clusters. Alternatively, the clustering can take a “top-down”, or divisive, approach which begins with all observations within a single cluster and partitions the set into smaller clusters until each observation exists within its own cluster.

Figure: A graphical representation of the hierarchical clustering process. Red circles denote clusters. If read from left to right, then the figure is representative of the agglomerative approach. If read from right to left, then the figure is representative of the divisive approach.

Specifying how clusters are preferentially assigned will influence the final shape and contents of the clusters. When performing hierarchical clustering, the user must specify the linkage criteria, which determines how the distance between clusters is measured, and consequently which clusters will be group during the next iteration. Three common linkage criteria are:

Linkage TypeDescriptionFormulation
Complete (max)the maximum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Single (min)the minimum distance between a group
and observations in another group.
max{ d(a, b) : a in A, b in B}
Averagethe average distance between a group and
observations in another group.
average{ d(a, b) : a in A, b in B}

Additionally, when calculating the linkage criteria, different methods of measuring the distance can be used. For example, when working with correlated data, it may be preferable to use the Mahalanobis distance over the regular Euclidean norm.

The results of the clustering analysis can be viewed through a dendrogram, which displays a tree-like representation of the cluster structure. For more detail on dendrograms, and a step-by-step tutorial for creating color-coded dendrograms in R, see Rohini’s (2018) post here.

K-Means Clustering

The K-means algorithm follows three simple steps.

  1. Selection of k centroids within the sample space.

In the traditional naïve K-Means algorithms, centroids do not correspond to specific observations within the space. They can be randomly selected, or chosen through more informed methods (see, for example, K-Means++)

2. Assignment of each observation to the cluster with the nearest centroid.

3. Re-define each cluster centroid as the mean of the data points assigned to that centroid.

Steps 2 and 3 of the algorithm are repeated until the centroids begin to stabilize, such that subsequent iterations produce centroids located within some small, tolerable distance from one another.

Figure: A graphical representation of the k-means clustering process, with 3 clusters. X’s denote the centroids for that respective cluster. The dashed arrows denote the shift in the location of the centroid during the next iteration. Points are colored according to their assigned cluster during that iteration.

The centroids toward which the algorithm converges may be a locally optimal clustering solution, and can be highly dependent upon the initial centroid selection. Often, this sensitivity is handled by performing the clustering processes several times, and choosing the set of clusters which yield the smallest within-cluster sum of squares.

Selecting the number of clusters

In some contexts, it is possible to know how many clusters are required prior to the analysis. For example, when working with Fisher’s famous iris flower data set, three clusters are needed because there are three types of flowers observed within the data.

In an exploratory context, however, the appropriate number of clusters may not be readily apparent. One method of determining the preferred number of clusters is by testing multiple different values and evaluating the separation distances between the clusters. If the number of clusters is well-chosen, then the clusters will be both well-spaced and the intra-cluster distance will be small.

The Silhouette Coefficient, or silhouette score, is one method of measuring the goodness of fit for a set of clusters, and takes into consideration both the spacing between clusters (inter-cluster distance) and the distance between points within a cluster (intra-cluster spacing). The Silhouette scores is defined as:

Where:

a = average intra-cluster distance,

b = average inter-cluster distance.

The silhouette score ranges from [-1, 1]. A value near +1 suggests that the clusters are far from one another, with a small distances within the cluster, and thus the chosen number of clusters is well-suited for the data. A value near -1, however, suggests that the clusters are poorly describing the patterns in the data.

Example: Clustering Infrastructure Pathways

In water supply planning contexts, infrastructure pathways describe the sequencing of infrastructure investment decisions throughout time, in response to changes in system (construction of new reservoirs, expansions to water treatment capacity etc.). Adaptive infrastructure pathways use a risk-of-failure (ROF) based set of policy rules to trigger new infrastructure development. Adaptive infrastructure pathways have been shown to effectively coordinate water infrastructure decisions in complex multi-actor systems (Zeff et al., 2016). While the benefits of this adaptive strategy are clear, ROF based pathways also bring new challenges for decision makers – rather than specifying an single sequence of infrastructure investments, an adaptive rule system will generate a unique sequence of investments for every future state of the world. To understand how a rule system behaves, decision makers must have tools to visualize an ensemble of infrastructure pathways. This is where cluster analysis comes in. Using clustering, we can group similar infrastructure sequences together, allowing decision makers to summarize how a candidate infrastructure investment policy will behave across many future states of the world.

Here, I analyze a set of infrastructure pathways generated for a hypothetical water utility with four infrastructure options:

  • small water treatment plant expansion
  • large water treatment plant expansion
  • construction of a small new run of river intake
  • construction of a large new run of river intake

The utility is examining a candidate policy and has generated pathways for 1,000 different states of the world, or realizations, which are each characterized by unique hydrologic and population dynamics. Given the 1,000 different realizations, each with a unique infrastructure pathway, I am interested in clustering the pathways according the the timing of the infrastructure construction decisions to understand how the policy behaves.

The dataset used for this demonstration is courtesy of David Gold, who is also responsible for providing significant portions of the following code and assistance along the way. Thanks David!

Follow along with this demonstration by downloading .txt data file HERE!

Let’s begin by loading in the data.

import pandas as pd

# Load in data
pathways_df = pd.read_csv('./ClusterData.txt', sep = '\t', header = None)

The data contains information describing the timing of infrastructure construction for 1,000 different simulated realizations. Each column represents a different infrastructure option, and each row is 1 of the 1,000 different simulated realizations. The contents of each cell denote a standardized value corresponding to the timing of the infrastructure construction within each realization, ranging from [0, 1]. Infrastructure options containing the value 1 were not built during the 45-year simulation period, in that specific realization.

RealizationInfra. Opt. #1Infra. Opt. #2Infra. Opt. #M
1W1,1W1,2W1,M
2W2,1W2,2W2,M
NWN,2WN,2WN,M
Table: Example output data from the infrastructure pathways optimization process; within each realization (state of the world (SOW)) infrastructure construction decisions are made during different weeks within the simulation period (45-years), considering a subset of infrastructure options for each utility. Wi,j denotes a standardized value corresponding to the timing within realization i that infrastructure option j is built.

With the data available, we can begin clustering!

Here, I take advantage of the scikit-learn.cluster module, which has both hierarchical and K-Means clustering capabilities.

For the purposes of this demonstration, I am going to only show the process of producing the hierarchical clusters… the curious reader may choose to explore alternative clustering methods available in the scikit-learn toolbox.

### Perform hierarchical clustering with 3 clusters ###

# Import the necessary tools
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import pairwise_distances_argmin

# Begin by assuming 3 clusters
num_clusters = 3

# Initialize the hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters = 3)

# Cluster the data
hierarchical.fit(cluster_input)  

# Produce a array which specifies the cluster of each pathway
hierarchical_labels = hierarchical.fit_predict(cluster_input)

In order to interpret the significance of these clusters, we want find a way to visualize differences in cluster behavior through the infrastructure pathways.

To do this, I am going to consider the median timing of each infrastructure option within each cluster, across all 1,000 realizations. Additionally, I want to specify sort the clusters according to some qualitative characteristic. In this case, I define pathways with early infrastructure investments as “heavy”, and pathways within infrastructure investment later in the period as “light”. The “moderate” infrastructure pathways are contained within the cluster between these these.

The following script calculates the median of the clustered pathways, and sorts them according to their infrastructure timing.

### Interpreting clusters via pathways ###

# Group realizations by pathway
# Calculate the median week each infra. opt. is constructed in each cluster
import numpy as np
cluster_pathways = []
cluster_medians = []

# Loop through clusters 
for i in range(0,num_clusters):
    
    # Select realizations contained within i-th cluster
    current_cluster = cluster_input[hierarchical_labels ==i, :]
    
    # Multiply by 2344 weeks to convert from [0,1] to [0,2344] weeks
    current_cluster = current_cluster*2344
    
    cluster_pathways.append(current_cluster)
    
    # Find the median infrastructure build week in each realization
    current_medians = np.zeros(len(current_cluster[0,:]))
    
    # Loop through infrastructure options and calculate medians
    for j in range(0, len(current_cluster[0,:])):
        current_medians[j] = np.median(current_cluster[:,j])
        
    cluster_medians.append(current_medians)

# Sort clusters by average of median build week 
# to get light, moderate, and heavy infrastructure clusters
cluster_means = np.mean(cluster_medians, axis = 1)

sorted_indices = np.argsort(cluster_means)

# Re-order cluster medians according to sorted mean build week
cluster_medians = np.vstack((cluster_medians[sorted_indices[2]], 
                            cluster_medians[sorted_indices[1]],
                            cluster_medians[sorted_indices[0]]))

With the median timing of each infrastructure option specified for each cluster, I can begin the process of visualizing the behavior. The function below plots a single infrastructure pathway:

import numpy as np
import matplotlib.pyplot as plt

def plot_single_pathway(cluster_medians, cluster_pathways, inf_options_idx,
                        c, inf_names, y_offset, ylabeling, xlabeling, 
                        plot_legend, ax):
    
    """
    Makes a plot of an infrastructure pathway.
    
    PARAMETERS:
        cluster_medians: an array with median weeks each option is built
        cluster_pathways: an array with every pathway in the cluster
        inf_options: an array with numbers representing each option (y-axis vals)
        should start at zero to represent the "baseline"
        c: color to plot pathway
        y_offset: distance to offset each pathway from 0 (easier to visualize stacked paths)
        ylabeling: labeling for infrastructure options
        xlabeling: labeling x axis
        plot_legend: (bool) to turn on or off
        ax: axes object to plot pathway
    """
    
    # Extract all non-baseline infrastructure options
    inf_options_idx_no_baseline=inf_options_idx[1:]
    sorted_inf = np.argsort(cluster_medians)
    
    # Plot Pathways
    # create arrays to plot the pathway lines. To ensure pathways have corners 
    # we need an array to have length 2*num_inf_options
    pathway_x = np.zeros(len(cluster_medians)*2+2)
    pathway_y = np.zeros(len(cluster_medians)*2+2)
    
    # To have corners, each inf. opt. must be located
    # in both the cell it is triggered, and adjacent cell
    
    cluster_medians = np.rint(cluster_medians/45)
    for i in range(0,len(cluster_medians)):
        for j in [1,2]:
            pathway_x[i*2+j] = cluster_medians[sorted_inf[i]]
            pathway_y[i*2+j+1] = inf_options_idx_no_baseline[sorted_inf[i]]

    # Set the end of the pathway at year 45
    pathway_x[-1] = 45

    # plot the pathway line    
    ax.plot(pathway_x, pathway_y + y_offset, color=c, linewidth=5, 
            alpha = .9, zorder=1)

    # Formatting plot elements 
    ax.set_xlim([0,44])
    inf_name_lables = inf_names
    inf_options_idx = np.arange(len(inf_name_lables))
    ax.set_yticks(inf_options_idx)
    ax.set_xlabel('Week')
    
    if ylabeling:
        ax.set_yticklabels(inf_name_lables)
    else:
        ax.set_yticklabels([])
    
    ax.set_ylim([-0.5,len(cluster_medians)+.5])

    plt.show()

The above script can be used to plot the median pathway for a single cluster, but it would be much more worthwhile to compare the median pathways of the different clusters against one another. To do this, we can define another function which will use the plot_single_pathway() iteratively to plot each of the median cluster pathways on the same figure.

import numpy as np
import matplotlib.pyplot as plt
from plot_single_pathway import plot_single_pathway

def plot_many_paths(clust_meds, clust_pathways, n_clusters, cluster_colors, fig, 
                    gspec, fig_col, ylabeling, plot_legend):
    
    
    y_offsets = [-.15, 0, .15]
    
    # Specific information for Utility
    inf_options_idx = np.array([0, 1,2, 3, 4])
    utility_inf = ['Baseline', 'Small intake expans.', 'Large intake expans.',
                    'Small WTP expans.', 
                    'Large WTP expans.']
    
    ax = fig.add_subplot(gspec[0,0])
    
    for i in np.arange(n_clusters):
        plot_single_pathway(clust_meds[i], clust_pathways[i], inf_options_idx, cluster_colors[i],
                            utility_inf, y_offsets[i], ylabeling, False, True, ax)
        plt.show()
    
    if plot_legend and (n_clusters == 3):
        ax.legend(['Light inf.', 'Moderat inf.', 'Heavy inf.'], 
                  loc='upper left')
    elif plot_legend and (n_clusters == 2):
        ax.legend(['Light inf.', 'Heavy inf.'], loc='upper left')
        
    ax.tick_params(axis = "y", which = "both", left = False, right = False)
    
    plt.show()
    return fig

Time for the reveal! Execution of this final portion of the script will plot each of the three clusters, and color-code them according to classifications as “heavy”, “moderate”, or “light” infrastructure.

### Plotting ###
from plot_many_paths import plot_many_paths

fig = plt.figure(figsize = (5,5))
gspec = fig.add_gridspec(ncols=1, nrows=1)

# Colorscale
heavy_inf_color = '#132a13'
mid_inf_color = '#4f772d'
light_inf_color = '#90a955'
cluster_colors = [light_inf_color, mid_inf_color, heavy_inf_color]

plot_many_paths(cluster_medians, cluster_pathways, 3, cluster_colors, 
                fig, gspec, 0, True, True)
Figure: Median infrastructure pathways for clustered pathways.

Voilà!

The figure above shows a clear contrast in the behavior of the light and heavy infrastructure pathways; the heavy infrastructure pathway is not only characterized by early infrastructure construction, but also by more frequent construction throughout the simulation period, in comparison to the median of the light infrastructure cluster which only requires a single infrastructure expansion within the same simulation period.

From the plot above, it is not apparent that three clusters are necessarily appropriate for this data set. Although the heavy and light infrastructure clusters are clearly demarcated from one another, the moderate infrastructure cluster has characteristics of both: a late first-expansion similar to the light infrastructure cluster, but a high frequency of construction throughout the period similar to the heavy infrastructure. Should the moderate infrastructure really be a separate cluster?

Let’s take advantage of the knowledge we developed earlier, and consider the silhouette scores corresponding to different numbers of clusters! Fortunately, scikit-learn has a suite of tools that made the silhouette analysis very easy.

In the following script, I evaluate the silhouette score associated with clustering for 2, 3, and 4 different clusters.

### Performing silhouette analysis ###
from sklearn.metrics import silhouette_score

num_clust = [2, 3, 4]

# Set up hierarchical clustering with different numbers of clusters
ac2 = AgglomerativeClustering(n_clusters = num_clust[0])
ac3 = AgglomerativeClustering(n_clusters = num_clust[1])
ac4 = AgglomerativeClustering(n_clusters = num_clust[2])

# Extract the silhouette score for each hierarchical grouping
silhouette_scores = []
silhouette_scores.append(silhouette_score(cluster_input, ac2.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac3.fit_predict(cluster_input)))
silhouette_scores.append(silhouette_score(cluster_input, ac4.fit_predict(cluster_input)))
                                                  
plt.bar(num_clust, silhouette_scores)
plt.xlabel('Number of clusters', fontsize = 14)
plt.ylabel('Silhouette Score', fontsize = 14)
plt.xticks(num_clust)
plt.title("Silhouette analysis")
plt.show()
Figure: Silhouette analysis for the hierarchical clustering of infrastructure pathways.

Interestingly, the silhouette score was highest when the clustering was performed using only two clusters, suggesting that the moderate infrastructure class may not have been sufficiently unique to warrant a separate cluster.

Another method of visualizing the cluster structure is through the use of dendrograms. Below, I show dendrograms color-coded with only two clusters (left) and using three clusters (right).

Figure: Dendrograms showing the cluster structure of infrastructure pathways for a utility, using 2 (left) and 3 (right) clusters.

The dendrogram containing only two clusters (left) reveals that the moderate and light clusters can be combined into a single cluster, because the moderate cluster is less-different from the light infrastructure cluster than it is from the heavy infrastructure cluster. Although, it is important to note that dendrograms are not a reliable method of determining the number of clusters on their own. In fact, given that the difference in silhouette scores between the 2-cluster and 3-cluster hierarchies is relatively modest (~0.05) there is not sufficiently strong justification to choose the 2 clusters over 3 clusters.

For the sake of comparison, here are the 2-cluster and 3-cluster median pathway plots:

Conclusion

I hope you’ve gained some insight into clustering, infrastructure pathways (or preferably both) in this post.

Analysis of infrastructure pathways via statistical clustering presents a unique application of a very common statistical tool. In this analysis I clustered the pathways with respect to the timing of the infrastructure expansions, however there is potential for a much more in-depth analysis of how the hydrologic or urban contexts of each realization shape these infrastructure pathways… possibly justifying another blog post at a later date. Stay tuned!

Thanks again to David Gold, for providing the data and the foundational python code.

References

Zeff, H. B., Herman, J. D., Reed, P. M., & Characklis, G. W. (2016). Cooperative drought adaptation: Integrating infrastructure development, conservation, and water transfers into adaptive policy pathways. Water Resources Research52(9), 7327-7346.

Easy batch parallelization of code in any language using mpi4py

The simplest form of parallel computing is what’s known as “embarrassingly” parallel processes. These processes involve fully independent runs of a model or script where little or no communication is needed across parallel processes. A common example is Monte Carlo evaluation, when we run a model over an ensemble of inputs. To parallelize an embarrassingly parallel application we simply need to send a set of commands to the cluster telling it to run each sample on a different core (or set of cores). For small applications, this can be done by submitting each run individually. For larger applications, SLURM Job Arrays (which are nicely detailed in Antonia’s post, here) can efficiently batch large number of function calls to independent computing cores. While this method is efficient and effective, I find it sometimes can be hard to keep track of, as you may be submitting tens or hundreds of jobs at a time. An alternative approach to submitting embarrassingly parallel tasks is to utilize MPI with Python to dispatch and organize jobs.

I like the MPI / Python combo because it consolidates all parallel applications into a single job, meaning you have one job to keep track of on a cluster at a time, and one output file generated by the batch set of model runs. I also find Python slightly easier to edit and debug than Bash scripts (which are used to create job arrays). Additionally, it’s very easy to assign each computing core a set of function evaluations to run (this can also be done with Job arrays, but again, I find Python easier to work with). Though Python is the language used to coordinate parallel tasks, we can use it to parallelize code in any language, as I’ll demonstrate below.

In this post I’ll first provide some background on MPI and its Python implementation, mpi4py. Next I’ll provide an example I’ve developed to demonstrate how to batch run a Matlab code on a cluster. The examples presented here are derived from some of Bernardo’s code in his post on Parallel programming in C/C++, which you can find here.

A very light introduction to MPI

MPI stands for “Message Passing Interface” and is the standard library for distributed memory parallelization (for background, see this post). To understand how MPI works, it’s helpful to define some of it’s basic components.

  1. Tasks: I’ll use the term task to define a processor (or group of processors) assigned to perform a specific set of instructions. These instructions may by a single evaluation of a function, or a set of function evaluations
  2. Communicators: A communicator is a group of MPI task units that are permitted to communicate with each other. In advanced MPI applications you may have multiple communicators, but for embarrassingly parallel applications we’ll only use one. The default communicator is called “MPI_COMM_WORLD” (I don’t know why, if anyone does please feel free to share in the comments), and that’s what I’ll work with here.
  3. Ranks: Each MPI task is assigned a unique identifier within the communicator called a rank. The processors running each task can access their own rank number, which will play an important role in how we use MPI for embarrassingly parallel applications.

A example schematic of the MPI_COMM_WORLD communicator with six tasks and their associated ranks is shown below.

mpi4py

MPI is implemented in Python with the mpi4py library. When we run an MPI code on a cluster, MPI creates the communicator and assigns each task a rank, then each task unit independently load the script. The processor/s associated with a task can then access their own unique rank.

The following snip of code loads this library, accesses the communicator and stores the rank of the given process:

# load the mpi4py library
from mpi4py import MPI

# access the MPI COMM WORLD communicator and assign it to a variable
comm = MPI.COMM_WORLD

# get the rank of the current process (different for each process on the cluster)
rank = comm.Get_rank()

Example of using mpi4py to batch parallel jobs

Here, I’ll parallelize the submission of a Matlab script called demoScript.m. This script reads an input file from a specific file location and prints out the contents of that file. For example purposes I’ve created 20 input files, each in their own folders. The folders are called “input_sample_0”, “input_sample_1” etc.. Each input_sample folder contains a file called “sample_data.txt”, which contains one line of text reading: “This is data for run <sample_number>”.

All code for this example can be found on Github, here: https://github.com/davidfgold/mpi4py_blog.git

Batching runs of demoScript.m process involves three components:

  1. Write demoScript.m so that it reads the sample number from the input.
  2. Write a Python script that will use mpi4py to distribute calls of demoScript.m. Here I’ll call this script “callDemoScript.py”
  3. Write a Bash script that sets up your MPI run and calls the Python function. Here I’ll call this script “submitDemoScript.sh”

1. demoScript.m

The demo Matlab script is found below. It reads in two arguments that are called from the command line. The first argument is the rank, which will vary for each task, and the second is the sample number, which will specify which input folder to read from.

%%%%%%%%%%%%%%%%%%%%
% demoScript.m
%
% reads an input file from a given sample number (specified via command line)
% prints output from the sample file associated with the sample number
% also prints the rank for demonstration purposes
%%%%%%%%%%%%%%%%%%%%

% read in command line input
arg_list = argv();
rank = arg_list{1,1}; % rank is the first argument
sample = arg_list{2, 1}; % sample number is the second argument

% Create a string that contains the location of the proper sample directory
sample_out = fileread(strcat("input_sample_", sample, "/sample_data.txt"));

% create a string to print the rank number
rank_call = strcat("This is rank_", rank, ", recieving the following input: \n");

% format the output and print
output = strcat(rank_call, sample_out);
fprintf(output)

2. callDemoScript.py

The second component is a Python script that uses mpi4py to call demoScript.m many times across different tasks. Each task will run a number of samples equal to a variable called “N_SAMPLES_PER_TASK” which will be fed to this script when it is called.

'''
callDemoScript.py

Called to batch demoScript.m across multiple MPI tasks

Reads in the total tasks and number of samples per task from command line.
'''
# load necessary libraries
from mpi4py import MPI
import numpy as np
import sys
import os
import time

# locate the COMM WORLD communicator
comm = MPI.COMM_WORLD

# get the number of the current rank
rank = comm.Get_rank()

# read in arguments from the submission script
TOTAL_TASKS = int(sys.argv[1]) # number of MPI processes
N_SAMPLES_PER_TASK = int(sys.argv[2]) # number of runs per/task

# loop through samples assigned to current rank
for i in range(N_SAMPLES_PER_TASK):
	sample= rank + TOTAL_TASKS * i

	# write the command that will be sent to the terminal (here RUN will replace the {})
	terminal_command = "octave-cli ./demoScript.m {} {} ".format(rank, sample)

	# write the terminal command to the process
	os.system(terminal_command)

	# sleep before submitting the next command
	time.sleep(1) # optional, for memory intensive submissions

comm.Barrier()

submitDemoscript.sh

The final component is a Bash script that will send this MPI job to the cluster. Here I’ll use SLURM to create 4 MPI tasks across 2 Nodes (each node will have 2 associated task). This will create a total of 4 MPI tasks, and each task will be assigned 5 samples to run.

I wrote this for a local cluster at Cornell, note that I had to load two modules to run Python and a third to run Octave (which is used to call Matlab scripts on Linux). I’ll call the Python script with mpirun, and then specify the total number of MPI tasks before making the function call. The output of the script is printed to a text file called demoOutput.txt

# Set up your parallel runs
SAMPLES_PER_TASK=5 # number of runs for each MPI task
N_NODES=2 # number of nodes
TASKS_PER_NODE=2 # number of tasks per node

TOTAL_TASKS=$(($N_NODES*$TASKS_PER_NODE)) # total number of tasks

# Submit the parallel job
#!/bin/bash
#SBATCH -n $(TOTAL_TASKS) -N $(N_NODES)
#SBATCH --time=0:01:00
#SBATCH --job-name=demoMPI4py
#SBATCH --output=output/demo.out
#SBATCH --error=output/demo.err
#SBATCH --exclusive
module load py3-mpi4py
module load py3-numpy
module load octave/6.3.0

mpirun -np $TOTAL_TASKS python3 callDemoScript.py $TOTAL_TASKS $SAMPLES_PER_TASK > demoOutput.txt

Additional resources

Putting some thought into how you design a set of parallel runs can save you a lot of time and headache. The example above has worked well for me when submitting sets of embarrassingly parallel tasks, but each application will be different, so take the time to find the procedure that works best for you. Our blog and the internet are full of resources that can help you parallelize your code, below are some suggestions:

Performing Experiments on HPC Systems

Scaling experiments: how to measure the performance of parallel code on HPC systems

Parallel processing with R on Windows

How to automate scripts on a cluster

Parallelization of C/C++ and Python on Clusters

Developing parallelised code with MPI for dummies, in C (Part 1/2)

Cornell CAC glossery on HPC terms: https://cvw.cac.cornell.edu/main/glossary

A great MPI tutorial I found online: https://mpitutorial.com/tutorials/

Writing sharable Python code part II: Unit Testing

When writing Python functions that may be used by others, it is important to demonstrate that your functions work as intended. In my last post I discussed how a proper function specification establishes a sort of contract between developers and users of functions that delineates errors in implementation (user error) from bugs in the code (developer error). While this contract is provides an important chain of accountability, it does not actually contain any proof that the function works as advertised. This is where unit testing comes in. A unit test simply runs a function over a suite of test cases (set of inputs that produce known results) to verify performance. As its name implies, a single unit test is meant to test one basic component of a code. Large or complex codes will have many sets of unit tests, each testing different elements (usually individual functions). Unit testing provides users with proof that your code works as intended, and serves as a powerful tool for finding and removing any errors in your code.

In this post I’ll provide guidance on the development of unit tests and demonstrate an example for a simple python function. Material in this post represents my interpretation of content taught by Professor Walker White at Cornell, in a Python Fundamentals course that I proctored in the summer of 2020.

An example script

To illustrate unit testing I’ll examine the a function called “check_satisficing.py” which tests whether a set of performance objectives meets a set of satisficing criteria (for background on satisficing, see this post). The function returns a boolean and has three parameters:

  • objs: a numpy array containing the two objective values
  • criteria: a numpy array two criteria,
  • directions: a list of strings that specify whether the criteria is meant to be a lower or upper bound.

Note that the actual code for this function only takes eight lines, the rest is the function specification and precondition checks.

def check_satisficing(objs, criteria, directions):
    """
    Returns whether a set of objectives meets a set of satisficing criteria
    
    Value return has type bool
    
    Parameters:
        objs: objective values from a given solution
        criteria: satisficing criteria for each objective
        directions: a list of strings containing "ge" or "le" for each 
            criteria, where ge indicates satisfication for values greater than
            or equal to the criteria and le indicates satisfaction for values 
            less than or equal to the criteria
    
    
    Examples:
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'ge'] 
            returns True
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'le'] 
            returns False     
        objs = [.4, .4], criteria = [0.5, 0.5], directions = ['le', 'le'] 
            returns True
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['ge', 'le'] 
            returns False
        objs = [.5, .5], criteria = [0.4, 0.4], directions = ['le', 'ge'] 
            returns False
        objs = [.5, .5], criteria = [0.5, 0.5], directions = ['ge', 'ge'] 
            returns True
        objs = [.5, .5], criteria = [0.5, 0.5], directions = ['le', 'le'] 
            returns True
    
    Preconditions:
        objs is a numpy array of floats with length 2
        criteria is a numpy array of floats with length 2
        directions is a list of strings with length 2 containing either 
            "ge" or "le"
    """
    
    # check preconditions
    assert len(objs) == 2, 'objs has length ' + repr(len(objs)) + ', should be 2'
    assert len(criteria) == 2, 'criteria has length ' + repr(len(criteria)) + \
    ', should be 2'
    assert len(directions) == 2, 'directions has length ' + \
    repr(len(directions)) + ', should be 2'
    
    # check to make sure
    for i in range(2):
        assert type(objs[i])== np.float64, 'objs element ' + str(i) + ': ' + \
        repr(objs[i]) + ', is not a numpy array of floats'
        assert type(criteria[i])== np.float64, 'criteria element ' + str(i) + \
        ': ' + repr(criteria[i]) + ', is not a numpy array of floats'
        assert type(directions[i])== str, 'directions element ' + str(i) + \
        ': ' + repr(directions[i]) + ', is not a string'
        assert directions[i] == 'ge' or directions[i] == 'le', 'directions ' + \
        str(i) + ' is ' + repr(directions[i]) + ', should be either "ge" or "le"' 
    
    
    # loop through objectives and check if each meets the criteria
    meets_criteria = True
    for i in range(2):
        if directions[i] == 'ge':
            meets_criteria = meets_criteria and (objs[i] >= criteria[i])
        else:
            meets_criteria = meets_criteria and (objs[i] <= criteria[i])
    
    return meets_criteria

Developing test cases

If you read my last post, you may notice that this specification includes an extra section called “Examples”. These examples show the user how the function is supposed to perform. They also represent the suite of test cases used to validate the function. Creating test cases is more of an art than a science, and test cases will be unique to each function you write. However, there is a set of basic rule you can follow to guide your implementation of unit testing which I’ll outline below.

  1. Pick the simplest cases first. In this example, the simplest cases are when both objectives are both above or below the criteria
  2. Move to more complex cases. In this example, a more complex case could be when one objective is above and the other below, or vice versa
  3. Think about “weird” possibilities. One “weird” case for this code could be when one or both objectives are equal to the criteria
  4. Never test a precondition violation. Precondition violations are outside the scope of the function and should not be included in unit tests

Test cases should be exhaustive and even simple codes may have many test cases. In my example above I provide seven, can you think of any more that are applicable to this code?

Implementing a testing strategy

After you’ve developed your test cases, it’s time to implement your unit test. For demonstration purposes I’ve written my own unit test code which can be used to test the cases developed above. This function simply utilizes assert statements to check if each test input generates the correct output. A danger of writing your own testing function is that the test function itself may have errors. In practice, it’s easiest to use an established tool such as PyTest to perform unit testing (for in-depth coverage of PyTest see Bernardo’s post from 2019).

def test_check_satisficing():
    """
    Unit test for the function check_satisficing
    """
    import numpy as np
    from check_satisficing import check_satisficing
    
    print("Testing check_satisficing")
    
    
    # test case 1:
    objs1 = np.array([0.5, 0.5])
    criteria1 = np.array([0.4, 0.4])
    directions1 = ['ge','ge']
    result1 = True
    
    assert (check_satisficing(objs1, criteria1, directions1)) == result1, \
    'Test 1 failed ' + repr(objs1) + ', ' + repr(criteria1) + ', ' + \
    repr(directions1) + ' returned False, should be True'
    
    # test case 2:
    objs2 = np.array([0.5, 0.5])
    criteria2 = np.array([0.4, 0.4])
    directions2 = ['ge','le']
    result2 = False
    
    assert (check_satisficing(objs2, criteria2, directions2)) == result2, \
    'Test 2 failed ' + repr(objs2) + ', ' + repr(criteria2) + ', ' + \
    repr(directions2) + ' returned True, should be False'
    
    
     # test case 3:
    objs3 = np.array([0.4, 0.4])
    criteria3 = np.array([0.5, 0.5])
    directions3 = ['le','le']
    result3= True
    
    assert (check_satisficing(objs3, criteria3, directions3)) == result3, \
    'Test 3 failed ' + repr(objs3) + ', ' + repr(criteria3) + ', ' + \
    repr(directions3) + ' returned False, should be True'
    
    
     # test case 4:    
    objs4 = np.array([0.5, 0.5])
    criteria4 = np.array([0.4, 0.4])
    directions4 = ['ge','le']
    result4 = False
    
    assert (check_satisficing(objs4, criteria4, directions4)) == result4, \
    'Test 4 failed ' + repr(objs4) + ', ' + repr(criteria4) + ', ' + \
    repr(directions4) + ' returned True, should be False'
    
    
    # test case 5    
    objs5 = np.array([0.5, 0.5])
    criteria5 = np.array([0.4, 0.4])
    directions5 = ['le','ge']
    result5 = False
    
    assert (check_satisficing(objs5, criteria5, directions5)) == result5, \
    'Test 5 failed ' + repr(objs5) + ', ' + repr(criteria5) + ', ' + \
    repr(directions5) + ' returned True, should be False'
    
    
    # test case 6: 
    objs6 = np.array([0.5, 0.5])
    criteria6 = np.array([0.5, 0.5])
    directions6 = ['ge','ge']
    result6 = True
    
    assert (check_satisficing(objs6, criteria6, directions6)) == result6, \
    'Test 6 failed ' + repr(objs6) + ', ' + repr(criteria6) + ', ' + \
    repr(directions6) + ' returned False, should be True'
    
    # test case 7: 
    objs7 = np.array([0.5, 0.5])
    criteria7 = np.array([0.5, 0.5])
    directions7 = ['le','le']
    result7 = True
    
    assert (check_satisficing(objs7, criteria7, directions7)) == result7, \
    'Test 7 failed ' + repr(objs7) + ', ' + repr(criteria7) + ', ' + \
    repr(directions7) + ' returned False, should be True'
    
    
    print("check_satisficing has passed all tests!")

    return    

Concluding thoughts

When developing a new Python function, it’s good practice to code the test cases while you’re writing the function and test each component during the development cycle. Coding in this matter will allow you to identify and fix errors early an can save a lot of time and headache. Creating test cases during code development may also improve the quality of your code by helping you conceptualize and avoid potential bugs.

Writing sharable Python code

Most of us in academia do not have formal training in computer science, yet for many of us writing and sharing code is an essential part of our research. While the quality of code produced by many graduate students can be quite impressive, many of us never learned the basic CS principles that allow programs to be sharable and inheritable by other programmers. Last summer I proctored an online class in Python fundamentals. The course was for beginner programmers and, though it covered material simpler than the scripts I write for my PhD, I learned a lot about how to properly document and structure Python functions to make them usable by others. In this post I’ll discuss two important concepts I took away from this course: writing proper function specifications and enforcing preconditions using assert statements.

Function Specifications

One of the the most important aspects of writing a quality Python function is proper specification. While the term may sound generic, a specification actually has a very precise definition and implementation for a Python function. In practice, a specification is a docstring, a “string literal” that occurs as the first statement in a function, module, class or method, formed by a bracketed set of “””. Here is an example of a simple function I wrote with a specification:


def radians_to_degrees(theta):
    """
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float
    """
    return theta * (180.0/3.14159)

The function specification is everything between the sets of “””. When Python sees this docstring at the front of a function definition, it automatically is stored as the “__doc__” associated with the function. With this specification in place, any user that loads this function can access its __doc__ by typing help(radians_to_degrees), which will print the following to the terminal:

Help on function radians_to_degrees in module __main__:

radians_to_degrees(theta)
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float

The help function will print anything in the docstring at the start of a function, but a it is good practice for the specification to have the following elements:

  1. A one-line summary of the function at the very beginning, if the function is a “fruitful function” (meaning it returns something), this line should tell what it returns. In my example I note that my function returns theta converted to degrees.
  2. A more detailed description of the function. In my example this simply provides the type of the return value.
  3. A description of the function’s parameter(s)
  4. Any preconditions that are necessary for the code to run properly (more on this later)

I should note that officially the Python programming language has a more flexible set of requirements for function specifications, which can be found here, but the attributes above are a good starting point for writing clear specifications.

Properly specifying a Python function will clarify the function’s intended use and provide instructions for how new users can utilize it. It will also help you document your code for formal release if you ever publish it. Google any of your favorite Python functions and you’ll likely be brought to a page that has a fancy looking version of the function’s specification. These pages can be automatically generated by tools such as Spinx that create them right from the function’s definition.

Aside from clarifying and providing instructions for your function, specifications provide a means of creating a chain of accountability for any problems with your code. This chain of accountability is created through precondition statements (element four above). A precondition statement dictates requirements for the function to run properly. Preconditions may specify the type of parameter input (i.e. x is a float) or a general statement about the parameter (x < 0).

For large teams of many developers and users of functions, precondition statements create a chain of accountability for code problems. If the preconditions are violated and a code crashes, then it is the responsibility of the user, who did not use the code properly. On the other hand, if the preconditions were met and the code crashes, it is the responsibility of the developer, who did not properly specify the code.

Asserting preconditions to catch errors early

One way to improve the usability of a code is to catch precondition violations with statements that throw specific errors. In Python, this may be done using assert statements. Assert statements evaluate a boolean expression and can display a custom error message if the expression returns false. For example, I can modify my radians_to_degrees function with the following assert statement (line 10):

def radians_to_degrees(theta):
    """
    Returns: theta converted to degrees
    
    Value return has type float
    
    Parameter theta: the angle in radians
    Precondition: theta is a float
    """
    assert type(theta) == float, repr(theta) + ' is not float'
    return theta * (180.0/3.14159)

A helpful function in my assert statement above is “repr”, which will return a printable representation of a given object (i.e. for a string it will keep the quotation marks around it). Now if I were to enter ‘a’ into my function, I would get the following return:

AssertionError: 'a' is not float

This is trivial for my example, but can save a lot of time when running large and complex functions. Rather than seeing 30 lines of “traceback…”, the user will be immediately brought to the source of the error. In general, you should try to make assert statements as precise as possible to pinpoint the violation.

It’s important to note that not every conceivable precondition violation requires an assert statement. There is a tradeoff between code efficiency and number of assert statements (checking a precondition takes time). Determining which preconditions to strictly enforce with assert statements is a balancing act and will be different for each application. It’s important to remember though, that even if you do not have an assert statement for a precondition, you should still list all preconditions in the function specification to preserve the chain of accountability.

Further resources

This post concerns specifications for Python functions. However, the use of docstrings is also important when coding modules, classes and public methods. Guidance on how these docstrings may be formatted can be found here.

While this post focused on Python, informative function specifications are important for all programming languages. Some resources on documentation in other languages can be found below:

More on simple Bash Shell scripts (examples of “find” and “sed”)

When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs.  You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model.  For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”

find

Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].

find . -type f -name "218*"

Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:

find . -type f -name "218*" |  wc -l

You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:

find . -type f -name "218_wheat*" -exec rm {} \;

If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:

find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;

Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:

find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;

 “d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.

If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”

find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;

Or we can look for files that do not contain “success.”

find . -type f -name 218cs*.err" -exec grep -L "success" {} \;

sed

“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:

sed -i 's/1295/1360/' 218cs.txt

Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:

sed -i '5s/1295/1360/' 218cs.txt

There might be more modifications that have to be done, so we can add them in one line using “-e”:

sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt

find + sed

If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:

find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/'  {} \;

Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:

“FORCING1         /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”

For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].

path_new="FORCING1	/home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|$path_new|" global_param_default.txt {} +

Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt).

find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1" {} \; 

Now, all of the “global_param_default” files have a new line with this content: “OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1”.

It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path:

for P in $(find . -type d -name "output"); do (new_name="$(echo "${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "${P}" && mv output__46.34375_-119.90625 $ new_name); done

With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding $ and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions.

With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring

“&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in $P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in $new_name.

Profiling your Python script using cProfile

Profiling refers to performing dynamic analysis on a script to measure its execution time, the execution time of its subcomponents, as well as how many times each subcomponent is being called. This produces data on where the script program is spending the most time, and can help with optimizing your script to minimize its execution time. This blog has had two past posts on profiling, one on C++ using Callgrind and one on Python using PyCharm. PyCharm is a Python IDE that’s very useful but unfortunately not free, so if you’re looking for some freeware profiling functionality in Python, this post is for you.

Python has a module called cProfile. A simple example on timing the multiplication of two matrices with cProfile:

import cProfile
import numpy as np

mat1 = ([1, 6, 3],[3, 6, 3],[2, 8, 3]) 
mat2 = ([2, 7, 6],[5, 4, 7],[6, 1, 9]) 
  
cProfile.run('np.dot(mat1,mat2)')

this should print out something like the following:

         4 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I have recently used cProfile on one of my own scripts which I’ll be using here to demonstrate how it can be used in your own work. I have a function called fish_game, which contained my model and took vars as input. This function also calls function hrvSTR which represented my action policy function (it’s extraneous to this post what these functions do exactly, one represents the system and the other represents a policy that we use to act on the system, you can see the full model here). The fish_game function was called by my optimization algorithm during optimization. Running cProfile on it produces this:

cProfile.run('fish_game(vars)',sort='cumulative')
         282814 function calls in 0.698 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.698    0.698 {built-in method builtins.exec}
        1    0.000    0.000    0.698    0.698 <string>:1(<module>)
        1    0.166    0.166    0.698    0.698 <ipython-input-4-df258d5a749f>:55(fish_game)
    20200    0.385    0.000    0.531    0.000 <ipython-input-4-df258d5a749f>:1(hrvSTR)
    20200    0.016    0.000    0.089    0.000 fromnumeric.py:1821(sum)
    20200    0.021    0.000    0.069    0.000 fromnumeric.py:64(_wrapreduction)
   121208    0.055    0.000    0.055    0.000 {built-in method numpy.core.multiarray.zeros}
    20200    0.046    0.000    0.046    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    20200    0.003    0.000    0.003    0.000 {built-in method builtins.isinstance}
    40400    0.003    0.000    0.003    0.000 {built-in method builtins.len}
    20200    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This tells me one run of my function takes 0.698 seconds in total (this might vary slightly every time, depending on your processor usage at the time and other factors), and that most of that time, 0.531 seconds, are consumed by the hrvSTR function. Even though 0.7 seconds might not seem long, in the context of optimization, where this function would need to be evaluated tens of thousands of times, an additional 0.1 seconds might add hours of process time to your workflow. Trying to bring that down is probably a worthwhile investment of time that will result in time savings later on. As a result I figured there could be something I could do to reduce the time hrvSTR took. I particularly intrigued by the fact that some numpy process numpy.core.multiarray.zeros was called 121208 times, an order of magnitude more than every other method in my script, which prompted me to think that I might be unnecessarily repeating a process.

Looking at my code more closely (this is a script I have been using for more than a year now), I realized that I was ordering and normalizing and creating arrays for my action policy every single time it was called, something that was unnecessary, as the same policy was used for all time steps. I could instead perform all those steps once, save the outputs and use them for every time step instead of recalculating every time. I spent some time to adjust my script to do that and running cProfile again, produced this:

 cProfile.run('fish_game(vars)',sort='cumulative')
         79282 function calls in 0.379 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.379    0.379 {built-in method builtins.exec}
        1    0.000    0.000    0.379    0.379 <string>:1(<module>)
        1    0.150    0.150    0.379    0.379 <ipython-input-16-3c528334eb55>:62(fish_game)
    19800    0.193    0.000    0.229    0.000 <ipython-input-16-3c528334eb55>:35(hrvSTR)
    59414    0.036    0.000    0.036    0.000 {built-in method numpy.core.multiarray.zeros}
        4    0.000    0.000    0.000    0.000 fromnumeric.py:2817(mean)
        4    0.000    0.000    0.000    0.000 _methods.py:58(_mean)
        2    0.000    0.000    0.000    0.000 <ipython-input-16-3c528334eb55>:4(generate_policy)
        6    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        2    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}
        2    0.000    0.000    0.000    0.000 fromnumeric.py:1821(sum)
        2    0.000    0.000    0.000    0.000 fromnumeric.py:64(_wrapreduction)
        4    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        4    0.000    0.000    0.000    0.000 _methods.py:48(_count_reduce_items)
        2    0.000    0.000    0.000    0.000 {method 'clip' of 'numpy.generic' objects}
        4    0.000    0.000    0.000    0.000 numeric.py:504(asanyarray)
       10    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        8    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I basically shaved 0.3 seconds off my function evaluation time by investing some time to look more closely at my script. If optimizing with 30000 function evaluations, this translates to some 2+ hours of processing time that I am saving (I spent far less figuring this out for the first time).

You can also run cProfile directly from the command line like so:

python -m cProfile -s cumtime fish_game.py