How ChatGPT Helped Me To Convert a MATLAB Toolbox to Python and Learn Python Coding

This post is a guest post by Veysel Yildiz from the University of Sheffield. Veysel is a third-year PhD student in Civil Engineering working with Dr. Charles Rouge (website). Veysel’s research contributes to scientific and technical advances in hydropower plant design. If you are interested in contributing a guest post to this blog, please contact Lillian Lau (lbl59@cornell.edu).

Like many engineers with a graduate education, I learned MATLAB first before realising Python could be a better language for code I want to make freely available.  But contrary to most, I had already written a 2500-line piece of software in MATLAB before realising other programming languages would be better. My software, the HYPER toolbox is the first to simultaneously optimise plant capacity and turbine design parameters for new run-of-river hydropower plants (for more information, please follow this link). In my PhD, I have been improving this toolbox to also evaluate the financial robustness of hydropower plant design in a changing world, i.e., is a plant still a sensible investment when climate and socio-economic conditions are different? This is a key question in countries such as Turkey, where I am from, that are getting drier. I have also been making my toolbox faster through algorithmic improvements, and want to release this fast version in a license-free language.

While I was proficient in MATLAB, my knowledge of Python was limited. Fortunately, with the help of ChatGPT, I was able to successfully navigate the transition from MATLAB to Python, both for my toolbox and for myself. Indeed, I was not only able to convert the code but also gain valuable insights into Python programming. In this blog post, I will share my ChatGPT experience, its benefits, and offer some practical tips for using ChatGPT effectively in Python coding.

The challenge of transitioning from MATLAB to Python

MATLAB is a sophisticated numerical computing environment known for its ease of use and extensive libraries for a wide range of scientific and engineering applications. Python, on the other hand, has grown in popularity due to its adaptability, open-source nature, and  diverse set of libraries and frameworks.  Given the changing environment of software development and user demands, it was vital to make the switch from MATLAB to Python. The main challenges I faced during this transition are:

  • Syntax Differences: MATLAB and Python have different syntax and conventions, making it challenging to convert code directly.
  • Lack of Proficiency: I was not proficient in Python, which added to the complexity of the task.
  • Code Size: With 2500 lines of code, the change could take several months.
  • Performance: Python can be slower than MATLAB for certain tasks such as matrix computation.

How ChatGPT accelerated the process

ChatGPT,  is a flexible large language model that can answer queries, generate code samples, and explain concepts in a variety of programming languages, including Python.

Here’s how ChatGPT helped me through this process:

Code Conversion Guidance

I started asking for basic functions including how to load/read a dataset. Then, I asked for guidance on how to convert specific functions and syntax from MATLAB to Python. ChatGPT provided clear explanations and Python code examples, helping me understand the nuances of Python syntax. For example:

ChatGPT guided me through the conversion process step by step. It assisted me in identifying and addressing syntactic differences, data structure changes, and library replacements needed to make the code run in Python. The conversion was not a one-time event. It involved continuous iterations and refinements. After implementing ChatGPT’s suggestions for one part of the code, I would return with additional questions as I progressed to the next steps. This iterative approach allowed me to build a solid foundation in Python while converting the MATLAB code.  Here is an example:

Debugging and Error Handling

During the conversion process, I encountered errors and unexpected behaviour in the Python code. Whenever it happened, I input the error messages to ChatGPT to identify the causes of the errors. ChatGPT described Python’s error messages and traceback information, allowing me to handle difficulties on my own. In other words, it assisted in debugging by offering tricks for error detection and correction. Here is an example:

ValueError: Unable to parse string “EKİM” at position 0

Code Optimization

When I successfully create a script that runs, I often ask ChatGPT to suggest methods to optimize the Python code for performance and readability.  Here is an example:

Learning Python Concepts

As I progressed in converting the code with the assistance of ChatGPT, I discovered that the journey was not just about code translation; it was an invaluable opportunity to learn essential Python concepts and best practices.  ChatGPT was quite beneficial in taking me through this learning process, providing explanations on Python’s data types, control structures, and how to use libraries efficiently.

A Word of Caution

ChatGPT is a powerful AI language model, but it, like any other, has limits. It might sometimes offer inaccurate or unsatisfactory code suggestions or descriptions. As a user, you should take caution and systematically validate the information given by ChatGPT. Here is an example:

Several Key Takeaway Messages of the Journey

  • Use clear instructions:  When interacting with ChatGPT, provide clear and concise instructions. Describe the problem you’re trying to solve, the desired outcome, and any constraints or requirements. The more specific you are, the better the model’s response is likely to be.
  • Understand its limitations: ChatGPT may not always produce perfect or optimal solutions. It is important to evaluate and verify the code created to ensure it meets your requirements.
  • Review and validate generated code: Carefully review the code generated by ChatGPT. Check for  accuracy, readability, and efficiency. Debug any issues that arise, and test the code with different inputs to ensure it works as expected.
  • Iterative refinement: If the initial code generated by ChatGPT doesn’t meet your requirements, provide feedback and iterate. Ask for specific improvements or clarifications, and guide the model towards a better solution.

Conclusion

Through the guidance and support of ChatGPT, I successfully converted 2500 lines of MATLAB code to Python in 10 days. This accomplishment was a significant milestone in my programming journey. Along the way, I gained proficiency in Python, a skill that opened up new opportunities in data science, and more.  This experience demonstrates the power of AI-driven tools in supporting programmers with challenging tasks while supporting continuous learning.

If you find yourself in a similar transition or coding challenge, try using AI-powered language models like ChatGPT to assist facilitate your journey and enrich your programming skills. It is not just about solving issues, but also about learning and growing as a programmer in a dynamic technological environment.

ChatGPT is a powerful tool to assist you with coding tasks, but it does not replace developing your own coding skills and actively engage in the learning process.

Legacy Code Reborn: Wrapping C++ into MATLAB Simulink

Did you ever have the need to include legacy code in your own projects without rephrasing it into your new framework? In Matlab Simulink, there is a way to include C++ code within your system. This post will teach you how to fit legacy C++ code into your Simulink projects.

Let’s assume you have been given a C++ code in which neural networks are implemented and already trained (as an example, but this guide extends to any generic class/function). Let’s also assume that your goal is to include them within a larger control system. Recreating them from scratch and trying to translate the original code into Matlab could be an unwise choice, as the implementation error could be around the corner. The best solution is to include the neural networks directly as they are, wrapping them in a nice box. How to proceed?

Requirements
To fully appreciate the following post, you should have a basic understanding of object-oriented programming, C++, and Matlab Simulink.

Disclaimer
The one I will show you differs from the fastest solution you could think about. Indeed, there are easier ways to wrap your C++ code if it fulfills certain limitations declared in the C Function Block documentation. Still, this solution is the most complete you can think of! And always works.
Here below I recall the Matlab Simulink limitations from the official documentation.
The C Function block cannot be used to directly access all C++ classes. These types of classes are not supported:

  • Template classes
  •  Standard Template Library (STL) containers
  •  Classes with private constructors and destructors

If your code falls in one of the above cases, you should be interested in reading this post; otherwise, you can think about following the more straightforward way.

1. Installing the C++ compiler
To execute C++ code, you need a C++ compiler. If you already have one, skip this section. Otherwise, you have two possibilities, in both the cases, you will have a functioning C++ compiler on your device:

  1. Installing the standard open-source C++ compiler.
    In this case, I recommend MINGW (even if it is pretty old). You can install it simply by following the software installation guidelines.
  2. Installing MINGW as Matlab adds-on. 
    Another (even easier) possibility is installing MINGW directly from Matlab as a standard Matlab toolbox. 

To let Matlab know that you will need to compile C++ code, you need to specify (each time you execute the code) the following command in Matlab terminal: 

mex -setup C++

This, if the compiler has been correctly installed (in my case through Microsoft Visual Studio 2022), should prompt out:

MEX configured to use 'Microsoft Visual C++ 2022' for C++ language compilation.

or, if you installed through Matlab adds-on:

MEX configured to use 'MinGW64 Compiler (C++)' for C++ language compilation. 

2. The C Function Block

Here is where things start to be more challenging.
The starting point is understanding which Simulink block we are going to use to wrap around our C++ executed code. In this case, it is represented by the C Function Block.

The block has the following structure:

  • Output code: here is where you will generate your output. The following is a candidate example.
    output = predict(myNN,input1,input2);
    Here we invoke the predict method passing as arguments two possible inputs and returning the output (our hopefully prediction). The special guest is myNN. This is the key of the solution to the problem and represents an instance of the class of the object we are going to wrap.
  • Start code: here, you instantiate the myNN object. It is the equivalent of the constructor in the C++ class.
    myNN = createNN();
  • Terminate code: here is where you destroy your object (equivalent to the destructor).
    deleteNN(myNN);
  • Symbols: here, you declare your input, output and myNN object. You can specify names (input1,input2, myNN) and data types (int32,double…). Remark: it is fundamental that the type of the myNN object is VoidPointer. This depends on the fact that we are building a wrapper.

To conclude the inspection of the C Function block, we must look at the settings. Here we must specify all the source files (.cpp) and header files (.h) inside the corresponding sections. The files must be in the same folder of the Simulink file, unless you specify the directory in the corresponding field.

An important remarkthe settings are common for all the C Function blocks you will have inside the project. Thus, you can’t structure your application thinking that each C Function block can have separate and independent source files. For this reason, even if you need to instantiate multiple classes of different natures inside your scheme, in the end, you will need one unique wrapper (with corresponding .cpp and .h files).

3. Writing your C++ Wrapper

After exploring the C Function block, we are ready to start writing our wrapper. Let’s assume you have legacy code that implements your neural network in source/header files. What you must do is: 

  1. Write your NN class, whose methods will be invoked by the wrapper. 
  2. Write the wrapper.

3a. Writing the class

We will call the files uniqueNN.cpp/uniqueNN.h.

#include "uniqueNN.h"
#include "ann.h"
// include here all additional dependences

myNN::myNN() {
}

myNN::~myNN() {
    delete ANN;
}

double myNN::predict(double input1, double input2) {
// copy-paste here your legacy code
}

In this class, you can see constructor, destructor, and predict method. Inside that method, you would copy-paste your legacy code. In there, you can do whatever you want. Remark: you don’t need to fulfill the limitations of the C Function Block because this class will be invoked by a “protective layer” represented by our wrapper!

Suppose your system has multiple components, and a different network with a different structure represents each. You can still use the very same class and implement a corresponding method for each of your original networks to produce the corresponding network’s prediction. 

Clearly, you must specify all your needed files for executing your legacy code in source and header files. 

3b. Implementing the wrapper

#include "uniqueNN.h"
// include here all additional standard libraries 

// Neural Networks

void* createNN()
{
    return new myNN();
}

void deleteNN(void* obj)
{
    delete static_cast<myNN*>(obj);
}

double predict(void* obj, double input1, double input2)
{
    myNN* nn = static_cast<myNN*>(obj);
    return nn-> predict(input1, input2);
}

Here we have three methods. These are the ones we have seen in the C Function Block! They will be invoked by Simulink.

As you can see, we find again the constructor, the destructor, and the predict method. For the sake of conciseness, I suggest using the same name for the method implemented in the class and the method implemented in the wrapper, but it is not mandatory. Notice that you must pass as an argument for the predict method in the wrapper the void pointer (your class object you are instantiating, that is the one you declared in the C Function Block!).

Remark: as you might have imagined, for each method implemented in the class (that in our example represented one different legacy neural network), you should create one corresponding method in the wrapper.

A final remark: the wrapper needs only the header of your class (for us uniqueNN.h) (not all the ones from legacy code, so do not put them because it is redundant!)

Here below the quality of the reconstruction of the legacy code through the C++ wrapper. The small differences that you might experience are due to internal data type differences between the legacy code simulator and the new one, absolutely negligible.

The figure above shows that the predictions from the neural network are nearly identical for both the wrapper-based simulink and original C++ implementations.

That’s it! Thank you for reading and if you have further questions, please feel free to contact me, here is my LinkedIn profile!

Weather Regime-Based Stochastic Weather Generation (Part 2/2)

In this post on the Water Programming Blog, we continue to explore the application of the stochastic weather generator (available on GitHub) for climate-change scenario developments. This is the second installment of a two-part series of blog posts, and readers are strongly encouraged to familiarize themselves with different components of the stochastic weather generator, as explained in Part 1 by Rohini Gupta (The Reed Research Group).

Here, we will begin by offering a concise overview of developing climate change scenarios and how these scenarios are integrated into the weather generation model. Following this, we will proceed to interpret the impact of these climatic change conditions on key statistical insights concerning occurrences of floods and droughts. Through these examples, the implications of these climatic shifts on water resources management and flood risk analysis will become evident.

Climate Change Perturbations

In this stochastic weather generator, we specifically focus on two aspects of climate change scenario developments, including 1) thermodynamic and 2) dynamic perturbations.

1) Thermodynamic Change

Thermodynamic climate change, often referred to as temperature-driven change, is primarily driven by changes in the Earth’s energy balance and temperature distribution. This warming affects various aspects of the climate system, such as intensifying precipitation extremes, melting snowpacks and ice sheets, rising sea levels, altered weather patterns, and shifts in ecosystems. The primary driver of temperature-driven climate change is the increase in regional-to-global average temperatures due to the enhanced greenhouse effect. As temperatures rise due to natural and anthropogenic forcings, they trigger a cascade of interconnected impacts throughout the climate system.

In the stochastic weather generator, scenarios of temperature change are treated simply by adding trends to simulated temperature data for each location across the spatial domain. However, scenarios of thermodynamic precipitation intensification are modeled using a quantile mapping technique through scaling the distribution of daily precipitation in a way that replicates the effects of warming temperatures on precipitation as the moisture holding capacity of the atmosphere increases. In the context of California, previous studies have demonstrated that as temperatures rise, the most severe precipitation events (often associated with Atmospheric Rivers landfalling) are projected to increase in frequency, while the intensity of smaller precipitation events is expected to decrease (Gershunov et al., 2019). This alteration effectively stretches the distribution of daily precipitation, causing extreme events to become more pronounced while reducing the occurrence and strength of lighter precipitation events. We replicate this phenomenon by making adjustments to the statistical characteristics and distribution percentiles of precipitation (e.g., Pendergrass and Hartmann, 2014). To further elaborate on this, we select a scaling factor for the 99th percentile of nonzero precipitation and then modify the gamma-GPD mixed distribution to enforce this chosen scaling factor. For instance, in a scenario with a 3°C temperature warming and a 7% increase in extreme precipitation per °C (matching the theoretical Clausius-Clapeyron rate of increase in atmospheric water holding capacity due to warming; Najibi and Steinschneider (2023)), the most extreme precipitation events are projected to increase by approximately 22.5% (1.073). We adjust the gamma-GPD models fitted to all locations to ensure this percentage increase in the upper tail of the distribution. Assuming that mean precipitation remains constant at baseline levels, this adjustment will cause smaller precipitation values in the gamma-GPD model to decrease, compensating for the increase in extreme events through stretching the distribution of nonzero precipitation.

Lines 33-40 from ‘config.simulations.R’ show the user-defined changes to implement the thermodynamic scenarios based on temperature in Celsius (tc: e.g. 1 °C), percent change in extreme precipitation quantile (pccc: e.g. 7% per °C), and percent change in average precipitation (pmuc: e.g. 12.5% decrease) inputs. Needless to say that the stochastic weather generator runs at baseline mode if tc=0 and pmuc=0.

    ##-------------Define perturbations-------------##
    ##climate changes and jitter to apply:
    change.list <- data.frame("tc"=  c(0), # {e.g., 0, 1, 2, ...} (changes in temperature)
                              "jitter"= c(TRUE),
                              "pccc"= c( 0.00), # {e.g., 0, 0.07, 0.14, ...} (changes for precipitation extreme quantile -- CC)
                              "pmuc"= c( 0.00)# {e.g., 0, -.125, .125, ...} (changes in precipitation mean)
    )
    ##----------------------------------------------##

2) Dynamic Change

Dynamic climate change, also known as circulation-driven change, is driven by shifts in atmospheric and oceanic circulation patterns. These circulation patterns are influenced by a variety of factors, including temperature gradients, differences in air pressure, and Earth’s rotation. Changes in these patterns can lead to alterations in weather patterns, precipitation distribution, and regional climate characteristics. One well-known example of dynamic climate change is the phenomenon of El Niño and La Niña, which involve changes in ocean temperatures and atmospheric pressure in the Pacific Ocean. These events can significantly impact local-to-global weather patterns, causing droughts, heavy rainfall, and other extreme weather events (Pfahl et al., 2017).

Dynamic changes impact the evolution of weather patterns and can modify the occurrence frequency of these patterns. This influence can occur through direct adjustments to the transition probabilities between different weather regimes, or indirectly by modifying the covariates that govern the progression of these weather regimes. In Steinschneider et al. (2019), a Niño 3.4 index is used to force weather regime evolution and is systematically adjusted to create more frequent El Niño and La Niña events. In Gupta et al. (in review), a 600-year long sequence of tree-ring reconstructed principal components of weather regime occurrence are used as an alternative covariate to better capture natural variability inherent in the weather regimes.

In the most recent version of the stochastic weather generator, we developed a novel non-parametric approach to simulation of weather regimes, allowing for future dynamic change scenarios with altered (customized) weather regime probabilities. Assuming that the historical time series of water regimes is divided into distinct, consecutive segments without overlaps, each segment has a duration of D years, and there is a total of ND segments considered there (here, D=4 and ND=18). In the non-parametric method, each segment (indexed as n=1 to ND) is assigned a sampling probability denoted as pn. To generate a new sequence of daily weather regimes spanning any desired number of years, the procedure involves resampling (with replacement) the nth D-year segment of daily weather regimes using the corresponding probability pn. This process is repeated until the required number of years of simulated weather regimes has been attained. If needed, the last segment can be trimmed to ensure the precise desired duration of simulated weather regimes.

In the baseline scenario for the weather generator with no dynamic climate change (only thermodynamic change), each segment is considered equally likely (i.e., no changes to large-scale circulation patterns).

However, the probabilities pn can be adjusted to alter the frequencies of each of the identified weather regimes in the final simulation, enabling the generation of dynamic climate change scenarios (i.e., scenarios in which the frequencies of different atmospheric flow patterns change compared to their historical frequencies). This is achieved using a linear program. The goal of the linear model (not shown) is to identify new sampling probabilities pn that, when used in the non-parametric simulation approach above, create a sequence of weather regimes with long-term average frequencies that approach some vector of target probabilities for those identified weather regimes.

Lines 91-126 from ‘config.simulations.R’ show the user-defined changes to implement a non-parametric scenario with equal probabilities (0: no change to the historical sequence of weather regimes) to ten weather regimes, i.e., dynamic scenario #0; and a 30% increase in weather regime number three (a dry weather condition) in California, i.e., dynamic scenario #1.

##Choose below whether through parametric or non-parametric way to create the simulated WRs ##
    use.non_param.WRs <- TRUE #{TRUE, FALSE}: TRUE for non-parametric, FALSE for parametric simulated WRs

    dynamic.scenario  <- 0 # {0, 1, 2}: 0: no dynamic change; 1: dynamic scenario #1 (30% increase in WR3); or 2: dynamic scenario #2 (linear trend)

    if (use.non_param.WRs){      #----------- 1+2 dynamic scenarios ----------#
      if (dynamic.scenario==0){
        ##===> Attempt #0 (thermodynamic only; no change to freq of WRs) ===##
        # #specify target change (as a percent) for WR probabilities
        WR_prob_change <- c(0,0,0,0,0,0,0,0,0,0) # between 0 and 1
        # #how close (in % points) do the WR frequencies (probabilities) need to be to the target
        lp.threshold <- 0.00001
        # #how much change do we allow in a sub-period sampling probability before incurring a larger penalty in the optimization
        piecewise_limit <- .02
        
      }else if(dynamic.scenario==1){
        ##===> Attempt #1 (dynamic scenario #1) ===##
        # #specify target change (as a percent) for WR probabilities (if, increasing WR3 in future)
        WR_prob_change <- c(0,0,.3,0,0,0,0,0,0,0) # between 0 and 1
        # #how close (in % points) do the WR frequencies (probabilities) need to be to the target
        lp.threshold <- 0.007
        # #how much change do we allow in a sub-period sampling probability before incurring a larger penalty in the optimization
        piecewise_limit <- .02
        
      }else if(dynamic.scenario==2){
        ##===> Attempt #2 (dynamic scenario #2) ===##
        # specify target change (as a percent) for WR probabilities (if, continuing their current trends in future)
        WR_prob_change <- c(-0.09969436,  0.27467048,  0.33848792,
                            -0.28431861, -0.23549986,  0.03889970,
                            -0.05628958, 0.38059153, -0.16636739, -0.17995965) # between 0 and 1
        # how close (in % points) do the WR frequencies (probabilities) need to be to the target
        lp.threshold <- 0.008
        # how much change do we allow in a sub-period sampling probability before incurring a larger penalty in the optimization
        piecewise_limit <- .02
      }
    }

Stochastic Weather Generation for Climate Change Scenarios

The stochastic weather generator is utilized to generate two climate change scenarios as defined above. The description of specific configurations for each scenario is listed as follows:

  • Thermodynamic Scenario: 3°C increase in temperature, 7% per °C increase in precipitation extremes, no change in average precipitation.
  • Dynamic Scenario: 30% increase in occurrence frequency of one weather regime only, labeled as ‘WR3’, which exhibits a ridge directly over the northwest US, i.e., blocking moisture flow over California, and resulting in dry conditions during the cold season there.

Thus, we generated 1008 years of simulated precipitation and temperature for each 12 sites in the Tuolumne River Basin in California (similar to Part 1) following these two scenarios. Below is a list of figures to understand better the impact of each scenario on precipitation and temperature statistical distributions and important climate extremes at basin scale.

The two Figures below present the cumulative distribution function (CDF) of the generated scenario for precipitation (left) and temperature (right) based on the thermodynamic and dynamic change, respectively. The observed time-series of precipitation and temperature across these 12 sites is also illustrated.

As seen above, although the 3°C warming is clearly manifested in the alteration of simulated temperature’s CDF, it is hard to notice any drastic shifts in the overall CDF of precipitation time series, as the bulk of distribution has not been altered (remember the average precipitation remained constant although its extreme quantile scaled up by ~ 22.5%).

Similarly, while the CDF of precipitation demonstrates a slight shift towards a drier condition, we notice a large shift in tail of temperature distribution.

Now, we go ahead and examine a set of important indexes for climate risk assessment of water systems. The Figure below presents the 1-day precipitation maxima derived from the generated scenario for precipitation based on the thermodynamic (left) and dynamic (right) change.

In the plot above depicted for thermodynamic climate change, the median 1-day precipitation extremes at the basin scale throughout the entire synthetic weather generation vs. historical period demonstrates a 25.5% increase in its magnitude, which is consistent with the event-based precipitation scaling by 22.5% at each site. However, such metric has almost remained unchanged in the dynamic climate change scenario.

Finally, the Figure below shows the 5-year drought derived from the generated scenario for water-year precipitation total, based on the thermodynamic (left) and dynamic (right) change.

The boxplots presented above related to the thermodynamic scenario, revealing a consistent median 5-year drought magnitude, as anticipated (no shift in the distribution of average precipitation bulk). In contrast, the dynamic climate change scenario exhibits a substantial exacerbation, with the 5-year drought magnitude worsening by around 9% compared to the historical records.

There is plenty more to explore! The stochastic weather generator is suitable to quickly simulate a long sequence of weather variables that reflect any climate change of interest. Keep an eye out for upcoming additions to the repository in the coming months, and do not hesitate to contact us or create a GitHub issue if you need clarification.

References

Gershunov, A., Shulgina, T., Clemesha, R.E.S. et al. (2019). Precipitation regime change in Western North America: The role of Atmospheric Rivers. Scientific Reports, 9, 9944. https://doi.org/10.1038/s41598-019-46169-w.

Gupta, R.S., Steinschneider S., Reed, P.M. Understanding Contributions of Paleo-Informed Natural Variability and Climate Changes on Hydroclimate Extremes in the Central Valley Region of California. Authorea. March 13, 2023. https://doi.org/10.22541/essoar.167870424.46495295/v1

Najibi, N., and Steinschneider, S. (2023). Extreme precipitation-temperature scaling in California: The role of Atmospheric Rivers, Geophysical Research Letters, 50(14), 1–11, e2023GL104606. https://doi.org/10.1029/2023GL104606.

Pendergrass, A.G., and Hartmann, D.L. (2014). Two modes of change of the distribution of rain, Journal of Climate, 27(22), 8357-8371. https://doi.org/10.1175/JCLI-D-14-00182.1

Pfahl, S., O’Gorman, P.A., Fischer, E.M. (2017). Understanding the regional pattern of projected future changes in extreme precipitation, Nature Climate Change, 7 (6), 423-427. http://dx.doi.org/10.1038/nclimate3287

Steinschneider, S., Ray, P., Rahat, S. H., & Kucharski, J. (2019). A weather‐regime‐based stochastic weather generator for climate vulnerability assessments of water systems in the western United States. Water Resources Research, 55(8), 6923-6945. https://doi.org/10.1029/2018WR024446

The Reed Group Lab Manual for efficient training & open science

The Reed Group Lab Manual for efficient training & open science

About a year ago, the Reed Group began to develop a group Lab Manual. The motivation of the Lab Manual is to facilitate efficient training of new students, postdocs, and collaborators on everything from basic programming and high performance computing to synthetic streamflow generation and multi-objective reservoir operations. The Lab Manual is also meant to facilitate the sharing of open science tools more broadly with other researchers and practitioners. The Lab Manual is designed to be synergistic and interactive with the Water Programming Blog, which has been advancing similar goals for 12 years and counting.

In this blog post, I will give an introductory tour of the structure and content of the Reed Group Lab Manual. This project is fully open source – you can find the lab manual hosted on GitHub Pages here, as well as the underlying code repository on GitHub here. Previous blog posts have described the technical details behind creating the lab manual with Jupyter Books and automating build and deployment with GitHub Actions, as well as detailed code snippets for two different sections of the Figure Library (post 1, post 2). The present post, on the other hand, will provide a higher level tour of the Lab Manual contents without going into code-level detail.

The Lab Manual currently contains 8 sections as laid out in the left side navigation bar: (1) Graduate Student Resources, (2) Computational Resources, (3) Software, (4) Training, (5) Water Programming Blog Post Catalog, (6) List of Paper Repositories, (7) Figure Library, and (8) Contributing to this Manual. In the remainder of this post I will give a brief overview of each section.

Graduate Student Resources

The first section is meant to provide helpful resources for new students and scholars joining the Reed Group. This includes subsections on finding housing in Ithaca and a checklist for visiting scholars. It also includes links on important courses at Cornell and conferences to consider attending. Lastly, it includes tips, resources, and templates for designing presentations, writing papers, and staying organized.

Note that a few of the resources in this section (e.g., course list, presentations) link to private resources that are only available to Reed Group members and collaborators. The rest of the site is generally designed to be fully open access.

Computational Resources

The next section is designed to bring new researchers up to speed on topics in programming, computing, and other related topics. There are pages introducing researchers to basic Python, Linux, and Git/GitHub. There are also pages on more advanced topics of machine learning, citation/reference management, high performance computing, and Python-based website deployment (such as this Lab Manual website). These pages provide some original content but also rely heavily on links to previous Water Programming Blog posts and other online resources. This section also provides access information for The Cube and Hopper, two private clusters at Cornell University that Reed Group members have access to.

Software

Next, the Software section provides information on various software tools that past and present Reed Group members have contributed to. This includes software for multi-objective evolutionary computation and exploratory modeling (Borg MOEA, Rhodium, MOEA Framework), water resources management modeling (WaterPaths, Pywr-DRB, CALFEWS), sensitivity analysis (SALib), and high-dimensional visualization (J3). Each page contains links to the software itself, along with tutorials, Water Programming Blog posts, academic papers, and other code repositories that rely on the software.

Training

The Training section satisfies a core purpose of the Lab Manual: training new students, postdocs, and collaborators on core competencies needed by Reed Group members. Topics include key methodologies that underly much of our research (e.g., MOEAs, MORDM, sensitivity analysis, scenario discovery, synthetic generation) as well as important test problems (e.g., the shallow lake problem, the fisheries game) and software (e.g., Borg MOEA, WaterPaths).

These pages build on many years of formal and informal training exercises developed by past and present Reed Group members. Each page uses a common template structure that introduces the motivation and learning objectives, the necessary software installations, the prerequisite training prior to starting the current training, and the detailed sequence of training activities ranging from literature review to model execution and analysis.

Water Programming Blog Post Catalog

To complement the training exercises in the previous section, we have also embedded a searchable table of Water Programming Blog posts within the lab manual. This allows trainees to quickly find additional helpful material that may not be directly linked to their training exercises. Note that the main page of the Blog also has its own search functionality. The search tools in the lab manual vs the blog can return different results for the same search term, so it can be helpful to try both if you are looking for posts on a particular topic.

List of Paper Repositories

The Lab Manual also contains a list of important Reed Group papers and their relevant software repositories. This can be helpful for students or other researchers who are looking to replicate or build on the workflows in those papers.

Figure Library

Next, we have a Figure Library. The purpose of this section is to facilitate collaborative development and sharing of common types of figures that Reed Group members tend to use often. As discussed in my recent blog post, there are many tools available for creating figures such as parallel coordinates plots, but they often lack the level of flexibility and detail required to create high-quality figures for presentations and papers. Thus many Reed Group members and other researchers have built their own custom visualization tools from scratch. The purpose of the Figure Library is to create a bank of such tools so that we can build off of each other’s work rather than reinventing the wheel.

Because the Lab Manual is built with Jupyter Books, each page can be built using either Markdown or Python-based Jupyter Notebooks. The latter is an excellent option for the Figure Library as it allows for seamless integration of text, code blocks, and code output such as figures.

Contributing to this Lab Manual

The last section gives detailed instructions for those who want to contribute to the Lab Manual itself. This process is fairly straightforward for those who are comfortable with basic Git-based version control. Much of the complexity of building and deploying the site is done behind the scenes using GitHub Actions and GitHub Pages each time a new commit is pushed to the Main branch of the repository. This section also provides general examples for how to build pages in Markdown and Jupyter Notebooks, as well as a template that is used to ensure consistency across Training pages. Lastly, there is an FAQ page where contributors can describe helpful fixes to problems that they have experienced when working on the Lab Manual.

Conclusion

Overall, we hope that this Lab Manual will be a valuable resource for current and future Reed Group members and collaborators. This should improve the efficiency of training new researchers while easing the burden on both trainers and trainees. In addition, we hope that it is helpful for other researchers who want to learn about our suite of methodologies and software, as well as to other research groups looking to develop their own open source lab manuals.

It is worth noting that the Reed Group Lab Manual is meant to be a living document that will continue to evolve as current and future group members contribute new content or update old content to reflect the changing state-of-the-art. What I have described in this blog post simply reflects the current iteration of this Lab Manual roughly one year from when we began this project.