Welcome to our blog!

Welcome to Water Programming! This blog is by Pat Reed’s group at Cornell, who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email dfg42 “at” cornell.edu. You’ll need a WordPress.com account.

Structuring a Python Project: Recommendations and a Template Example

Motivation:

The start of a new year is a good (albeit, relatively arbitrary) time to reassess aspects of your workflow.

I, like many people, taught myself Python by jumping into different projects. The consequence of this ad-hoc learning was that I did not learn some of the fundamentals until much later in my project development.

At the end of the Fall ’23 semester, I found myself spending a lot of time cleaning up repositories and modules that I had constructed in the preceding months. I ended up restructuring and reorganizing significant portions of the code base, implementing organizational practices that I had learned after it’s conception.

This post is intended to save the reader from making the same mistake. Here, I present a recommended structure for new Python projects, and discuss the main components. This is largely targeted at Python users who have not had a formal Python training, or who are working with their first significantly sized project.

I provide an example_python_project, hosted on GitHub, as a demonstration and to serve as a template for beginners.

Content:

1. Recommended structure
2. Example project repository
3. Project components (with example project)
1. Modules, packages, __init__.py
2. Executable
3. README
4. Documentation
5. Tests
6. Requirements
7. LICENSE
4. Conclusion

Recommended project structure

I’ll begin by presenting a recommended project structure. Further down, I provide some explanation and justification for this structure.

My example_python_project follows recommendations from Kenneth Reitz, co-author of The Hitchhiker’s Guide to Pythton (1), while also drawing from a similar demo-project, samplemod, by Navdeep Gill (2).

The project folder follows this structure:

example_python_project/
├── sample_package/
│   ├── subpackage/
│   │   ├── __init__.py
│   │   └── subpackage_module.py
│   ├── __init__.py
│   ├── helpers.py
│   └── module.py
├── docs/
│   └── documentation.md
├── tests/
│   ├── __init__.py
│   └── basic_test.py
├── main.py
├── README.md
├── requirements.txt
└── LICENSE


If you are just starting a project, it may seem unnecessary complex to begin with so much modularity. It may seem easier to open a .py file and start freewheeling. Here, I am trying to highlight the several reasons why it is important to take care when initially constructing a Python project. Some of these reasons include:

1. Maintenance: A well-structured project makes it easier to understand the code, fix bugs, and add new features. This is especially important as the project grows in size and complexity.
2. Collaboration: When working on a project with multiple developers, a clear structure makes it easier for everyone to understand how the code is organized and how different components interact with each other.
3. Scalability: A well-structured project allows to scale up the codebase, adding new features and sub-components, without making the codebase hard to understand or maintain.
4. Testing: A well-structured project makes it easier to write automated tests for the code. This helps to ensure that changes to the code do not break existing functionality.
5. Distribution: A well-structured project makes it easier to distribute the code as a package. This allows others to easily install and use the code in their own projects.

Overall, taking the time to structure a Python project when starting can save a lot of time and heartache in the long run, by making the project easier to understand, maintain, and expand.

Example Project Repository

The repository containing this example project is available on GitHub here: example_python_project.

The project follows the recommended project structure above, and is designed to use modular functions from the module, helpers, and subpackage_module. It is intended to be a skeleton upon which you can build-up your own project.

If you would like to experiment with your own copy of the code, you can fork a copy of the repository, or Download a ZIP version.

Project overview

The project is a silly riddle program with no real usefulness other than forming the structure of the project. The bulk of the work is done in the main_module_function() which first prints a riddle on the screen, then iteratively uses the helper_function() and subpackage_function() to try and “solve” the riddle. Both of these functions simply return a random True/False, and are repeatedly called until the riddle is solved (when status == True).

Below is a visual representation of how the different functions are interacting. The green-box functions are contained within the main sample_package, while the blue-box function is stored in the subpackage.

The program can then be executed from a command line using the main.py executable:

C:\<your-local-directory>\example_python_project> python main.py


The output will first print out the riddle, then print statements indicating which functions are being used to “solve” the riddle. This is simply a means of demonstrating how the different functions are being activated, and not necessarily a recommended “Best Practice”.

A normal output should resemble something similar to the below, although there may be more or less print statements depending upon how many times it takes the random generator to produce a “True” solution:

Here is a riddle, maybe sample_package can help solve it:

What runs but has no feet, roars but has no mouth?

Lets see if the helper can solve the riddle.
The helper_function is helping!
The helper could not solve it.
Maybe the subpackage_module can help.
The subpackage_function is being used now.
The subpackage solved it, the answer is "A River"!


Project components

Modules, packages, __init__.py, oh my!

Before going any further, I want to take time to clarify some vocabulary which is helpful for understanding component interactions.

Module:
A module is simply a file ending in .py, which contains functions and or variables.

Package:
A package is a collection of modules (.py files) which relate to one another, and which contains an __init__.py file.

__init__.py
Inclusion of a __init__.py file (pronounced “dunder in-it”) within a folder will indicate to Python that the folder is a package. Often, the __init__ module is empty, however it can be used to import other modules, or functions which will then be stored in namespace, making it available for use later.

For example, in my sample_package/__init__.py, I import all contents of the module.py and subpackage_module.py:

# Import the all functions from main and sub modules
from .module import *
from .subpackage.subpackage_module import *


This allows all of the functions stored within module to be callable from the primary sample_package directly, rather than specifying the various sub-structures needed to access various functions. For example, by including from .subpackage.subpackage_module import *, I able to run:

# IF __init__ imports all content from main and sub modules then you can do this:
import sample_package
sample_package.subpackage_module_function()


Rather than requiring the following fully-nested call, which is necessary when the __init__.py is empty:

# IF __init__ is EMPTY, then you need to do this:
import sample_package
sample_package.subpackage.subpackage_module.subpackage_module_function()


Notably, an __init__.py is not necessary to use modules and functions within a folder… however, customizing the imports present in the packages __init__.py will provide increased customization to your projects use. As the project increases in complexity, strategic usage of imports within the __init__ can keep your main executable functions cleaner.

Executables

So, you’ve crafted a Python project with a sleek, modular package design. The next step is to setup a single file which will execute the package.

Inclusion of a single executable has the benefit of providing a single-entry point for other users who want to run the program without getting lost in the project.

In the example_python_project, this is done with main.py:

# Import the main package
import sample_package

def run():
solved = sample_package.main_module_function()
return solved

# Run the function if this is the main file executed
if __name__ == "__main__":
run()


The program then can then be executed from a command line:

C:\<your-local-directory\example_python_project> python run_program.py


README

The README.md file is typically someone’s first encounter with your project. This is particularly true if the project is hosted on GitHub, where the README.md is used as the home-page of a repository.

A README.md file should include, at minimum, a brief description of the project, it’s purpose, and clear instructions on how to use the code.

Often, README files are written in Markdown, which includes simple text-formatting options. You can find a basic Markdown Cheat Sheet here. Although reStructuredText is often used, and even .txt files may be suitable.

Documentation

Great code requires great documentation. Initializing a new project with a dedicated docs/ folder may help hold you accountable for documenting the code along the way.

For information on how to use Sphinx and reStructuredText to create clean webpage-based documentation, you can see Rohini Gupta’s post on Using Python, Sphinx, and reStructuredText to Create a Book (and Introducing our eBook: Addressing Uncertainty in Multisector Dynamics Research!).

Tests

Bugs aren’t fun. They are even less fun when a code was bug-free yesterday but contains bugs today. Implementing automated tests in your project can help verify functionality throughout the development process and catch bugs when they may arise.

It is recommended to implement Unit Tests which verify individual components of the project. These tests should assert that function output properties align with expectations. As you develop your project in a modular way, you can go in and progressively add consecutive tests, then run all of the tests before sharing or pushing the project to others.

A standard Python instillation comes with the unittest package, which is intended to provide a framework for these tests. I provide an example test below, but deeper-dive into the unittest framework may require a dedicated future posts.

In the example_python_project, I include the basic_test.py to verify that the solution generated by main_module_function() is True using the unittest package:

import sample_package
import unittest

# Define a test suite targeting specific functionality
class BasicTestSuite(unittest.TestCase):
"""Basic test cases."""
def test_that_riddle_is_solved(self):
solved = sample_package.module.main_module_function()
self.assertTrue(solved)

if __name__ == '__main__':
unittest.main()


Running the basic_test module from the command line produce an “OK” if everything runs smoothly, otherwise will provide information regarding which tests are failing.

----------------------------------------------------------------------
Ran 1 test in 0.004s
OK


Currently, the example_python_project requires the basic_test module to be executed manually. To learn more about automating this process, you can see Andrew Dirck’s 2020 post: Automate unit testing with Github Actions for research codes.

Requirements

The requirements.txt is a simple text file which lists the dependencies, or necessary packages that are required to run the code.

This can be particularly important if your code requires a specific version of a package, since the package verison can be specified in the requirements.txt. Specifying a particular package version (e.g., numpy==1.24.1) can improve the reliability of your code, since different versions of these packages may operate in different ways in the future.

Here is an example of what might be inside a requirements.txt, if the numpy and random packages are necessary:

numpy==1.24.1
random==3.11.1


Users can easily install all the packages listed in requirements.txt using the command:

pip install -r requirements.txt


License

I’ll keep this section brief, since I am far from legally qualified to comment much on Licensing. However, general advice seems to suggest that if you are sharing code publicly, safest to include a license of some sort.

Inclusion of an open-source license allows other users to comfortably use and modify your code for their own purposes, allowing you to contribute and benefit the broader community. At the same time, protecting the original author from future liabilities associated with its use by others.

The GNU General Public License is the most common open-source license, however if you would like to know more about the different options, you can find some guidance here: https://choosealicense.com/

Conclusions

If you are an experienced Python user, there may not be anything new for you here but at the least I hope it serves as a reminder to take care in your project design this year.

Additionally, this is likely to be one part in a multi-part Introduction to Python series that I will be writing for future members of our research group. With that in mind, check back here later this spring for the subsequent parts if that interests you.

Best of luck!

References

(1) Reitz, K., & Schlusser, T. (2016). The Hitchhiker’s guide to Python: best practices for development. ” O’Reilly Media, Inc.”.
(2) Navdeep Gill. 2019. samplemod. https://github.com/navdeep-G/samplemod. (2023).

Ringing in the year with a lil’ bit of Tkinter

Happy 2023! For the first post of the new year, we will be learning how to make a simple graphic user interface (GUI) using a nifty Python package called Tkinter, the default interface framework built into the Python standard library. Today, I will demonstrate a few Tkinter functions that I recently found useful to achieve the following tasks:

1. Setting up the GUI window
2. Partitioning the window into frames
3. Creating and updating user entry spaces
4. Creating buttons
5. Creating and launching popup windows
6. Creating dropdown lists

At the end of this post, I will use the tools in all five tasks to create a GUI that enables the user to generate a universal input file that, in turn, is used as input to a Python script that will run WaterPaths (Trindade et al., 2020). WaterPaths is an open-source water portfolio and infrastructure investment pathways management and planning simulation tool. The GitHub repository that includes both WaterPaths and the GUI Python script can be found here. Documentation for Tkinter can be found here if you are interested in further details on GUI-making.

Before beginning, install the Tkinter library on your machine. This can be done by entering the following line of code into your command line:

pip install tkinter


Now, let’s get started!

Setting up the GUI window

As per the Python tradition, using any library requires that we import it into our main Python script:

from tkinter import *
from tkinter import font


Next, we will initialize the main window:

# create the main window
root = Tk()
root.title("WaterPaths")
# load an icon to use
p1 = PhotoImage(file = 'wpaths.png')
root.iconphoto(False, p1)
# set the initial size of the window, allow it to be resizeable
root.geometry("850x780")
root.resizable(True, True)


The Tk() function initializes the main GUI window and the title() function names it. This is the window within which we will create two frames. This GUI also has an icon that will appear in the taskbar using the photoimage() and iconphoto() function when the GUI is launched. We also set the initial size of the window with the geometry() function when it first pops up on the screen, and allow the user to resize the window to better suit their preferences by enabling both its width and height to very in length with the resizable() function.

Partitioning the window into frames and understanding the grid system

Before exploring more functions, let’s take a look at how the Tkinter grid system works:

The system is pretty intuitive where a window consists of frames. Each frame is structured using a rows and columns where widgets (entry spaces, labels, dropdown lists, buttons, etc.) can be entered into. Columns and rows do not have to be explicitly added into a frame, and a frame’s number of rows and columns do not have to pre-allocated. However, a widget’s position should be specified when adding it to a frame.

Now, let’s use the LabelFrame() function to create a frame within the window. This function has a couple of useful parameters as shown in the code snippet below:

frame_setup_model = LabelFrame(root, text="1-Setup WaterPaths",
bg='lightcyan', font=('HelvLight', 20), width=800, height=400)


Here, we are making a frame within the main root window and calling it ‘1-Setup WaterPaths’. We are also setting the color of the frame to a light cyan color and specifying the font type and size of the title of the frame. We are also setting the width and height of the frame.

Next, we specify the position of the frame:

frame_setup_model.grid(row=0,column=0,padx=10,pady=10, sticky="ew")
frame_setup_model.grid_propagate(True)


Using the grid() function, we placed the frame in the first row and column of the main window, and set it to span the entire length of the frame using the sticky parameter. We then use the grid_propagate() function to ensure that the frame changes size when the window is also resized.

We now have a functioning frame – let’s populate it with some widgets!

Creating widgets

We will be covering three types of widgets today: user entry spaces, buttons, and dropdown lists. Each widget should be labeled using the Label() function:

data_dir_label = Label(frame_setup_model, text="Data directory", justify=LEFT,
bg='lightcyan').grid(sticky = W, row=1, column=0)


In the code snippet above, we are created a label indicating to the user to enter the string for the folder’s full filepath in the adjacent entry space (demonstrated immediately below). We name this label ‘Data directory’ and left justify it. We also ensure that the grid cell containing this label has the same background color as the frame that hosts it. Using the grid() function, we left justify this label and place it in the second row of the column.

User entry spaces

The first widget we’ll use is the user entry widget, initialized using the Entry() function, shown below:

# Create and place the user entry widget
data_dir = Entry(frame_setup_model, width=60)
data_dir.grid(row=1, column=1, sticky=W)
# Insert a default value in the entry space
data_dir.insert(0, '/home/fs02/pmr82_0001/lbl59/Implementation_Uncertainty/WaterPaths_duReeval/)


We place the widget in the second row and column of the frame we first created. The insert() function then provides the option to enter a default value in the entry space, which can be later changed by the user. If this function is not used, the entry space will appear blank when the GUI is launched.

Buttons

Next, the button widget can be setup using the Button() function. This function requires that it be associated with another function written in the same script, as it launches that associated function. An example is shown below:

def install_req():
'''
Installs all Python libraries required to run the GUI
Runs the WaterPaths makefile
'''
os.system("pip install -r requirements.txt")
os.system("make gcc")
text_out = "Program requirements installed. Do not run again."

open_popup(text_out)

reqs_button = Button(frame_setup_model, text="Install libraries", padx=10, pady=5, command=install_req, fg='darkslategrey',
bg='lightblue', font=['HelvLight', 12, 'bold']).grid(row=13, column=0, sticky='we')


Here, we create a button widget called ‘Install libraries’ that, when clicked, runs the install_req() function defined immediately above it. There are two new parameters here that we have not seen previously. The fg parameter allows the user to speficy the font color, and the command parameter links the install_req() function to the button we are creating.

Popup windows

In the install_req() function we see above, notice the internal function called open_popup() that takes a string parameter. This function is defined as follows:

def open_popup(text_out):
'''
Opens a popup window
'''
top= Toplevel(root)
top.geometry("600x120")
top.title("WaterPaths")
Label(top, text=text_out, justify=CENTER).place(x=10,y=10)


This function uses Tkinter’s Toplevel() function to open a popup window when run. Overall, this function creates a popup window that is 600-by-120 pixels in size and titled ‘WaterPaths’. The message contained within the popup window is specified by the use, and is placed at the (10,10) position within the popup box.

Dropdown lists

The last widget that we will learn to make before putting everything together is the dropdown widget using the OptionMenu() function. The implementation is shown below:

# Define a list of options
mode_dropdown = ["Optimize", "Re-evaluate", "Reduced"]
mode = StringVar()
# Set the default value
mode.set("Reduced")

# Create the dropdown label
mode_select_label = Label(frame_setup_model, text="Select WaterPaths run mode", anchor="w", justify=LEFT,
bg='lightcyan').grid(sticky = W, row=9, column=0)
# Create and position the dropdown widget
mode_select = OptionMenu(frame_setup_model, mode , *mode_dropdown )
mode_select.grid(row=9, column=1, columnspan=1, sticky='W')
mode_select.config(width=10, font=['HelvLight','10', 'normal'], bg='lightcyan')


We first create a list of options called mode_dropdown. In this case, the list contains options for the modes in which WaterPaths can be run. We then create a string variable using the StringVar() function to indicate that the value of the widget is variable and contingent upon the user’s choice. We also set a default value for the widget.

Next, we define a label for the dropdown widget called “Select WaterPaths run mode”. Directly to the right of the label, we use the OptionMenu() create the dropdown widget and associate it with the mode string variable that responds to the mode_dropdown list. We finalize the creation of the dropdown list by positioning it within its grid cell.

Now that we’ve covered some of the fundamental Tkinter tools, let’s create a simple GUI!

The WaterPaths Input File GUI

Before continuing, note that the WaterPaths GUI should be used with a HPC resource using a Linux interface. While you will be able to view the GUI if it is run on a personal machine, it can only be used at full functionality if paired with WaterPaths on a HPC resources.

Putting together the tools shown above, we can structure a GUI that looks like this:

You can view this GUI by entering the following into the command line:

python WaterPaths_GUI.py


In the first frame (1-Setup WaterPaths), you can enter the path of the main WaterPaths directory and the location where the solutions are stored. You can also specify the number of deeply uncertain states of the world (DU SOWs) you would like to explore, and the number of hydroclimatic realizations you would like to pair with these SOWs. You can also specify the number of solutions you would like to run. If you would like to run one specific solution number, you can enter in the solution number in the ‘First solution to run’ and add one to it in the ‘Last solution to run’ space. The dropdown lists enable you to choose between generating (export) or using existing (import) risk of failure (ROF) tables. You can also choose to generate and use ROF tables as you run the WaterPaths simulation (do nothing). This option is not recommended if large experiments of more than 10 DU SOWs are being conducted, as ROF table generation is computationally expensive. In addition, you can select if you would like to run WaterPaths in Optimization mode using Borg MOEA (Optimize) or perform DU Re-evaluation (Re-evaluate). You can also choose to run a short demonstrative WaterPaths simulation (Reduced) that should take approximately 2 minutes to run. Note that the ‘Optimize’ option should only be used if Borg is installed. If you would like to download Borg to explore optimization, please submit a download request here.

If this the first time you are running WaterPaths, please click on ‘Install Libraries’, which will install all libraries required to run WaterPaths, as well as run its Makefile. You only have to click this button once immediately after downloading WaterPaths. The ‘Setup WaterPaths’ button writes all your input in the entry spaces and the dropdown lists into a text file called input_parser_file.txt that is read by the run_waterpaths.py script found in the same repository.

In the second frame (2-Optimization/Re-evaluation), you can enter the number of threads, nodes, and task that your WaterPaths simulation requires. Further explanation on these terms can be found at Cornell CAC’s glossary of HPC terms. You can also modify the output frequency of your function evaluations, and the number of seeds that you would like to run. This step is optional and should only be completed if you selected the ‘Optimize’ option in the first frame. Clicking on the ‘Make input file’ button will add on these new entries into the input_parser_file.txt file.

As previously hinted at, this GUI links to two external scripts:

1. run_waterpaths.py This Python script take and interprets the input from the input_parser_file.txt file and generates the command to be submitted to the Simple Linux Utility Resource Management (SLURM) job scheduler to be run on the HPC resource that you are using. More in-depth explanation on how SLURM works can be found in Dave Gold’s post here.
2. run_waterpaths.sh This is the Bash script that submits the command from run_waterpaths.py to SLURM. The parameters on Line 2 should match the parameters entered under ‘Enter HPC submission requirements’ in the second frame.

Both these scripts can be found in the linked GitHub repository. If you have successfully run the GUI, you should generate the input_parser_file.txt that looks a little like this:

Conclusion

In this post, we walked through tools to initialize, structure, and implement a simple Python GUI. We then applied these tools to generate a GUI for WaterPaths. Hope this was a useful tidbit for the new year, and congratulations on learning a new skill!

Animating figures with Manim: A citation network example

Grant Sanderson’s 3blue1brown channel has amassed almost 5 million YouTube followers by explaining complicated mathematical topics (e.g., linear algebra, neural networks, disease modeling) with clear and compelling visual storytelling. The channel’s success is due in no small part to Manim, the open-source Python-based “Mathematical Animation Engine” developed by Sanderson to create his stunning videos.

This blog post will demonstrate how to use the Manim engine to turn your figures into slick, animated GIFs. As an example, I will visualize co-authorship clusters within a network of literature.

Setup

Two versions of Manim are available: the original repository developed and maintained by Grant Sanderson, and an expanded and more stable Community Edition maintained by a larger group of users. For the rest of this post, I will refer exclusively to the latter. Manim is simple to install using basic Python package managers such as conda and pip; see this page in the docs for details. I also highly recommend looking through the detailed tutorials and example gallery in the Manim Community docs, as well as this blogpost by Khuyen Tran on Towards Data Science. These resources provide more details on the basics of Manim and a broader view of what is possible – for this post, I will focus more narrowly on the example of visualizing a citation network.

All code for this blogpost can be found in this GitHub repository.

Co-authorship network data

In a previous blogpost, Rohini Gupta demonstrated how to generate a database of literature using the Dimensions tool and visualize co-authorship networks using the VOSviewer tool. The resulting interactive VOSviewer visualization can be found on the Reed Group website. This co-authorship network can be downloaded as a JSON file by clicking “save” in the top right hand corner.

Opening the JSON file, we see a nested set of key:value pairs, similar to a dictionary in Python. The “network” consists of two lists called “items” and “links”. Each element of “items” is an author, with details on how many publications they have authored that cite the Borg MOEA, and how many citations those publications have. Each element of “links” is a connection between two authors, signifying that they have co-authored at least one publication together that cites the Borg MOEA.

{
"network": {
"items": [
{
"id": 2,
"label": "abdo, gamal m.",
"x": 0.9606,
"y": 0.2704,
"cluster": 5,
"weights": {
"Documents": 1,
"Citations": 38,
"Norm. citations": 1.5776
},
"scores": {
"Avg. pub. year": 2018,
"Avg. citations": 38,
"Avg. norm. citations": 1.5776
}
},
...
],
"links": [
{
"source_id": 2,
"target_id": 194,
"strength": 1
},
...
],
}


Static visualization with Manim

In order to get a feel for the data and for how Manim’s geometric objects work, let’s create a static visualization (i.e., no animation) of the co-authorship network. First, we import libraries, load a colormap, and import the JSON file into lists of nodes and links.

import json
from manim import *
from matplotlib import cm
from matplotlib.colors import rgb2hex
import numpy as np

cmap = cm.get_cmap('viridis')

### open JSON file of VOSviewer citation network
with open('VOSviewer-network.json', 'r') as f:
vos = json.load(f)['network']
nodes = vos['items']
links = vos['links']


Now we can create a new class defining the static citation network visualization, inheriting from Manim’s basic Scene class.

### create animation class for static citation network, inheriting from basic Scene Manim class
class CitationNetworkStatic(Scene):
def construct(self):
ax = Axes(
x_range=[-1.1,1.2], y_range=[-1.1,1.1], axis_config={"include_tip": False}
)


Next, we can draw each link in the dataset using Manim’s Line class. The x and y locations are taken directly from VOSviewer’s visualization of the network.

        ### create link for each citation, add line to animation
for link in links:
source = [node for node in nodes if node['id'] == link['source_id']][0]
target = [node for node in nodes if node['id'] == link['target_id']][0]
line = Line(ax.coords_to_point(source['x'], source['y']), ax.coords_to_point(target['x'], target['y']), color=GREY)
line.set_opacity(0.6)
self.add(line)


Once we have created the Line object (line 5) and modified its opacity (line 6), we add it to the figure (line 7).

The next step is loop over the “clusters” assigned by VOSviewer, and plot each author within each cluster as a Manim Circle object. The authors in each cluster can be grouped together using the Manim Group class. I chose to color each circle based on its cluster and size each circle based on how many times the authors’ Borg-citing papers have been cited. I also keep track of the most-cited author within each cluster for labeling purposes. Note that the clustering and labeling methodology here were chosen purely for convenience – there are many other ways to define “influence” in citation networks, but the purpose of this exercise is simply to demonstrate the visualization technique.

        ### create circle for each node, colored by cluster and sized by citations.
### create group for each cluster, labeled by largest node. add all circles to animation.
nodegroups = []
nodegrouplabels = []
nodegroupweights = []
for cluster in range(1, 50):
nodelist = [node for node in nodes if node['cluster'] == cluster]
color =  rgb2hex(cmap(cluster/25))
nodeweights = [node['weights']['Citations'] for node in nodelist]
largestlabel = [node['label'] for node in nodelist if node['weights']['Citations'] == max(nodeweights)]
largestweight = [node['weights']['Citations'] for node in nodelist if node['weights']['Citations'] == max(nodeweights)]
nodegrouplabels.append(largestlabel)
nodegroupweights.append(largestweight)
nodegrouplist = []
for node in nodelist:
circle = Circle(color = color)
circle.set_fill(color, opacity = 0.8)
circle.move_to(ax.coords_to_point(node['x'], node['y']))
circle.scale(np.log10(node['weights']['Citations']+2)/15)
self.add(circle)
nodegrouplist.append(circle)
nodegroup = Group(*nodegrouplist)
nodegroups.append(nodegroup)


Lastly, we can loop over the clusters and label each with the most “influential” author within that cluster, using Manim’s Text class. The size of each author’s name, like the size of the markers, is set based on total citations within each author’s Borg-citing publications.

        ### add text for central author in each cluster
order = np.argsort(nodegroupweights)[::-1]
for i in order:
nodegroup = nodegroups[i]
if len(nodegroup) > 0:
text = Text(nodegrouplabels[i][0]).set_color(WHITE).move_to(nodegroup).scale(np.log10(nodegroupweights[i][0]+10) / 8)
self.add(text)


With the CitationNetworkStatic class now fully defined in the scene_vosviewer.py file, we can create the Manim visualization by running the following simple command from the command prompt.

manim -qh scene_vosviewer.py CitationNetworkStatic


This command tells the Manim engine to run the scene_vosviewer.py script and create an image from the CitationNetworkStatic class. The -qh flag means “high quality”. This command creates the following image in the media/ directory:

Adding animation

As you can see, Manim provides some nice functionality visualizing geometric objects such as circles, lines, text, etc (see docs for many more options). However, the real advantage of Manim over other plotting libraries is its streamlined functionality for creating powerful, precise animations. Geometric objects in Manim (e.g., Line, Circle, Text, and even Group) have built-in operators for appearing, disappearing, moving, rotating, transforming, changing colors, etc.

As a simple demonstration, we will create an animated GIF that does the following:

1. Begins with the links already drawn, and pauses 1 second
2. Ads the co-authorship clusters one at a time, pausing 0.25 seconds between each
3. Pauses 1 second
4. Loops over the clusters. For each, zoom in and pan the camera to that cluster, rotate all circles around the Group’s centroid, display the group’s label, pause for 1 second, then remove the label and zoom back out to the entire figure.

To do this, we only need to make a few rather small changes to our code. First, the new class CitationNetworkAnimated should inherit from Manim’s MovingCameraScene class, which adds functionality for camera zooming and panning on top of the basic Scene class.

class CitationNetworkAnimated(MovingCameraScene):


The simplest thing we can do to transform our static image into an animation is to add pauses in between different steps. To pause one second, we simply add

        self.wait(1)


For the application described above, I add a 1 second pause after adding the links to the figure, a 0.25 second pause within the first cluster loop where each cluster of circles is added to the figure, a 1 second pause after that loop completes, and a 1 second pause within the second cluster loop where each group is rotated and labeled.

The rest of the animations in Step 4 above are added with the following snippet, which replaces the final snippet in the previous section:

        ### now animate zooming & labeling clusters 1 by 1
self.camera.frame.save_state()
self.wait(1)
order = np.argsort(nodegroupweights)[::-1]
for i in order:
nodegroup = nodegroups[i]
if len(nodegroup) > 0:
text = Text(nodegrouplabels[i][0]).set_color(WHITE).move_to(nodegroup).scale(np.log10(nodegroupweights[i][0]+10) / 8)
self.play(self.camera.frame.animate.move_to(nodegroup).set(width = max(nodegroup.width * 2, text.width * 2)))
self.play(Rotate(nodegroup, angle=2*PI))
self.add(text)
self.wait(1)
self.remove(text)
self.play(Restore(self.camera.frame))


Lines 11-13 are used to add the text to the figure, wait 1 second, then remove the text from the figure. Line 10 uses Manim’s play and Rotate methods to rotate each cluster by a full 360 degrees around its centroid. Finally, the animated zooming and panning are orchestrated with lines 2, 9, and 14. Line 2 saves the initial camera frame (location and extent) when viewing the entire figure. Line 9 zooms and pans to a particular cluster. Line 14 restores the camera to the original frame. Each of these actions is automatically done in a smooth, aesthetically pleasing way without having to manually create and stitch together the intermediate still images.

Finally, we can create the GIF using the following command:

manim -ql -i scene_vosviewer.py CitationNetworkAnimated


The -i flag tells Manim to create a GIF rather than a movie file. I’ve used the low quality flag -ql to save time and space (Note: WordPress seems to be downgrading the image quality somewhat, it looks better when I view the file directly). Here is the result:

This is, of course, a rather silly example which probably won’t win me any Oscars. But I hope it demonstrates the ease and power of creating animated figures with the Manim engine in Python. If you’re intrigued and looking for more inspiration, I definitely recommend checking out the detailed tutorials and example gallery in the Manim Community docs, the blogpost by Khuyen Tran on Towards Data Science, and of course Grant Sanderson’s 3blue1brown channel.

The Building Blocks of Time Series Analysis (plus a demo!)

Recently I’ve been learning to view the world in frequencies and amplitudes instead of time and magnitudes, also known as Fourier-domain time series analysis. It’s become a rather popular method in the field of water supply planning, with applications in forecasting demand and water quality control. While this post will not delve into advanced applications, we will walk through some fundamental time series concepts to ease you into the Fourier Domain. I will then demonstrate how time series analysis can be used to identify long- and short-term trends in inflow data.

Important frequencies

When reviewing time series analysis, there are three fundamental frequencies to remember:

1. The sampling frequency, $f_{s}$ is the frequency at which data for a time series is collected. It is equivalent to the inverse of the time difference between two consecutive data points, $\frac{1}{\delta t}$. It has units of Hertz ($Hz$). Higher values of $f_{s}$ indicate a higher number of data points recorded per second, which means better data resolution.
2. The Nyquist frequency, $f_{Ny}$ is equal to $\frac{1}{2} f_{s}$. As a rule of thumb, you should sample your time series at a frequency $f < f_{Ny}$, as sampling at a higher frequencies will result in the time series signals overlapping. This is a phenomenon called aliasing, where the sampled points do not sufficiently represent the input signal, resulting in higher frequencies being incorrectly observed at lower frequencies (Ellis, 2012)
3. The cutoff frequency, $f_{cut}$. The value of $f_{cut}$ defines the signal frequency that passes through or is stopped by a filter. This is a nice segue into the next section:

Filtering

Filtering does exactly what its name implies it does. It allows certain signal frequencies to pass through, filtering out the rest. There are three main types of filters:

1. Low-pass filter: A low-pass filter eliminates all frequencies above $f_{cut}$, allowing lower frequencies to pass through. It behaves like a smoothing operator and works well for signal noise reduction.
2. High-pass filter: Unlike the low-pass filter, the high-pass filter eliminates all frequencies below $f_{cut}$. This is a useful filter when you have discontinuous data, as high-pass filtering a signal maintains its discontinuities and does not artificially smooth it over.
3. Band-pass filter: This filter has both the properties of a low- and high-pass filter. It eliminates all frequencies that do not lie within a range of frequencies.

Windowing

The core idea behind windowing is that is is possible to obtain more detailed insight into your signal’s properties by dividing it into ‘windows’ of time. Windowing is useful to reduce artifacts in a signal caused by periodic extension, and the distinction between signal and noise becomes unclear. For particularly noisy signals, windowing can also be useful to amplify the signal-to-noise ratio, thereby making it easier to distinguish between the actual signal and white noise. This is possible as the large ‘main lobe’ of the window (both in the time and frequency domain), amplifies the lower-frequency, higher-amplitude signal, but dampens the higher-frequency, lower-amplitude noise.

Spectrograms

There is a key issue when analyzing a time series in the Fourier domain, and vice versa – there is no way (as of right now) to view when or where a specific frequency occurs, or which frequencies dominate at any specific time. A spectrogram (or the Short-Time Fourier Transform, STFT) of a signal address this issue by consolidating three types of information into one visualization:

1. The frequencies inherent to the signal
2. How the frequencies change over time
3. The relative strength of the frequencies (the frequency content or “power” of the frequency).

It allows us to view changes in the signal across both time and frequency, overcoming the limitation of only being applicable to stationary time series. However, it still applies a fixed window size, as we will see in a little bit, and cannot fully capture all the different frequencies that characterize a signal. Here, using wavelets is useful. Please refer to Rohini’s post on using wavelets on ENSO El Niño data here if you’re interested! On that note, let’s see how we can bring it all in with a quick demonstration!

Demonstrating basic time series analysis methods on an inflow timeseries

This demonstration will use MATLAB, and we begin by first loading and plotting a time series of weekly inflow data. Feel free to replace the inflow file with any time series you have on hand!

%%
% Read in and plot inflows of Jordan Lake in the Research Triangle

inflows_jl = readmatrix('jordan_lake_inflows.csv');
inflows_jl = inflows_jl(1,:);

% define sampling rate, Nyquist frequency, number of samples, sampling period, and frequency series
fs_inflow = 1/(7*24*3600);
N = length(inflows_jl);
delta_t = 1/fs_inflow;
freqseries_inflow = (1/(N*delta_t))*(0:(N-1));
fNy = fs_inflow/2;

%% Plot the inflow time series
figure(1)
plot(inflows_jl, 'Color','#104E8B');
grid on
xlabel('Time (weeks)');
ylabel('Inflow (MGD)');
xlim([0 5200]); ylim([0 15e4]);
title('Jordan Lake inflow timeseries');


The output figure should look similar to this:

Next, we Fourier Transform the inflow time series using the MATLAB fft() function, then convert it’s amplitude to decibels by first applying the abs() function, and then the mag2db() function to the output of the Fourier Transform.

%% Perform and plot the FFT of inflow
fft_jl = abs(fft(inflows_jl));

figure(2)
loglog(freqseries_inflow, mag2db(fft_jl), '-ko', 'Color','#104E8B', 'MarkerSize',2);
grid on
xlabel('Frequency (Hz)');
ylabel('Amplitude (dB)');
title("FFT of Jordan Lake");


This returns the following plot:

Note the large spikes in Figure 3. Those indicate frequencies that result in large inflow events. In this blog post, we will focus on the two leftmost spikes: the four-year cycle and the annual cycle, shown below. These cycles were identified by first converting their frequencies into seconds, and then into weeks and years.

We also want to confirm that these frequencies characterize the time series. To verify this, we apply a low-pass Butterworth filter:

%% Number of weeks in a year
num_weeks = 52.1429;

% Record the low-pass cutoff frequencies
fcut1_jl = 8.40951e-9;
years1_jl = conv2weeks(1/fcut1_jl)/num_weeks;

fcut2_jl = 3.16974e-8;
years2_jl = conv2weeks(1/fcut2_jl)/num_weeks;

% Set up the filters and apply to the inflow time series
[b1_fl, a1_fl] = butter(6,fcut1_fl/(fs_inflow/2));
inflows_filtered_fl1 = filter(b1_fl, a1_fl, inflows_jl);

[b2_fl, a2_fl] = butter(6,fcut2_fl/(fs_inflow/2));
inflows_filtered_fl2 = filter(b2_fl, a2_fl, inflows_jl);

%% Plot the filtered inflow time series
figure(3)
plot(inflows_jl, 'Color','#104E8B',  'DisplayName', 'Original', 'LineWidth',2);
xlim([0 5200]);
ylim([0 15e4]);
xlabel('Time (Years)');
ylabel('Inflows (MGD)');
title('Jordan Lake 4-year cycle')
legend();
hold on
plot(inflows_filtered_jl1,'Color','#D88037', 'MarkerSize',2, 'DisplayName', '4-year cycle', 'LineWidth',2);
hold off
ticklocations = 0:520:5112;
xticks(ticklocations);
ticklabels = string(ticklocations);
ticklabels(mod(ticklocations, 520) ~= 0) = "";
xticklabels({'0', '10', '20', '30', '40', '50', '60', '70', '80', '90'});

% Plot their Fourier counterparts
fft_win1_jl = abs(fft(inflows_filtered_jl1));

figure(4)
loglog(freq_series(1:length(fft_jl)),fft_jl, 'Color','#104E8B', 'DisplayName','Original');
legend({'Original','4-year cycle'},'Location','southeast');
title("Low-pass filter (f_{cut}=", string(fcut1_jl));
xlabel("Frequency (Hz)");
ylabel("Amplitude");
hold on
loglog(freq_series(1:length(fft_win1_jl)),fft_win1_jl, 'Color','#D88037','DisplayName','4-year cycle');
hold off


Replacing inflows_filtered_jl1 and fft_win1_jl (the 4-year cycle) with that of the annual cycle will result in Figure 5 below.

As previously mentioned, the low-pass filter effectively attenuated all frequencies higher than their respective $f_{cut}$ values. These frequencies identified in Figures 3 and 4 can also be said to characterize this inflow time series as the peaks of their filtered counterparts align with that of the original time series.

Next, we construct a spectrogram for this inflow time series. Here, we will use the Blackman-Harris window, which is a window that has a relatively wide main lobe and small side lobes. This enables it to amplify frequencies that more significantly characterize inflow, while suppressing less-significant ones. This window can be generated using the MATLAB blackmanharris() function, as shown below:

%% Define window size
Noverlap1_jl = N1_jl/2;   % 4-year cycle
Noverlap2_jl = N2_jl/2;   % Annual cycle

% Generate window
win1_jl = blackmanharris(N1_jl, 'periodic');  % 4-year cycle
win2_jl = blackmanharris(N2_jl, 'periodic');  % Annual cycle


Finally, we are ready to apply the spectrogram() function to generate the spectrogram for this inflow time series! In the code below, we set up the necessary values and generate the figure:

%% Define overlap size
Noverlap1_jl = N1_jl/2;  % 4-year cycle
Noverlap2_jl = N2_jl/2;  % Annual cycle

%% Plot inflow spectrogram for annual sampling (Jordan Lake)
spectrogram(inflows_jl, win1_jl, Noverlap1_jl, N1_jl, fs_inflow, 'yaxis');
colormap(flipud(parula));
clim([100,150]);
title('Jordan Lake 4-year cycle (Blackman-Harris)');


By replacing the values associated with the 4-year cycle with that of the annual cycle in the above code, the following figures can be generated:

In Figure 6, the darker blue indicates more powerful fluctuations (high-inflow events). However, their low frequency implies that, in this time series, such event are fairly rare. On the other hand, a lighter blue to yellow indicates inflow events of higher frequencies, but are less powerful. While they occur more frequently, they do not influence the signals in the time series significantly. We can also see from these two figures, that, in the long- and mid-term, both low and high frequency events become less powerful, implying a decrease in inflow, even more rare rainfall events that historically result in large inflows.

Conclusion

In this post, I first introduced a few basic concepts and methods related to time series analysis and the Fourier Domain. Next, we applied these concepts to the simple analysis of an inflow time series using MATLAB. Nevertheless, this post and its contents barely scratch the surface of Fourier Transforms and time series analysis as a whole. I hope this gave you a flavor of what these tools enable you to do, and hope that it piqued your interest into exploring this topic further!

QPPQ Method for Streamflow Prediction at Ungauged Sites – Python Demo

Streamflow Predictions in Ungauged Basins

Predicting streamflow at ungauged locations is a classic problem in hydrology which has motivated significant research over the last several decades (Hrachowitz, Markus, et al., 2013).

There are numerous different methods for performing predictions in ungauged basins, but here I focus on the common QPPQ method.

Below, I describe the method and further down I provide a walkthrough demonstration of QPPQ streamflow prediction in Python.

The supporting code can be found on my GitHub here: QPPQ_Streamflow_Prediction_Tutorial.

QPPQ-Method for Streamflow Prediction

Fennessey (1994) introduced the QPPQ method for streamflow estimation at ungauged locations.

The QPPQ method is commonly used and encouraged by the USGS, and is described at length in their publication Estimation of Daily Mean Streamflow for Ungaged Stream locations… (2016).

QPPQ consists of four key steps:

1. Estimating an FDC for the target catchment of interest, $\hat{FDC}_{pred}$.
2. Identify $K$ donor locations, nearest to the target point.
3. Transferring the timeseries of nonexceedance probabilities ($\mathbf{P}$) from the donor site(s) to the target.
4. Using estimated FDC for the target to map the donated nonexceedance timeseries, $\mathbf{P}$ back to streamflow.

To limit the scope of this tutorial, let’s assume that an estimate of the FDC at the target site, $\hat{FDC}_{pred}$, has already been determined through some other statistical or observational study.

Then the QPPQ method can be described more formally. Given an ungauged location with an estimated FDC, $\hat{FDC}{pred}$, and set of observed streamflow timeseries $\mathbf{q_i}$ at $K$ neighboring sites, such that:

$Q_{obs} = [\mathbf{q_1}, \mathbf{q_2}, ..., \mathbf{q_k}]$

With corresponding $K$ FDCs at the observation locations:

$FDC = [FDC_1, FDC_2, ..., FDC_k]$

The FDCs are used to convert the observed streamflow timeseries, $\mathbf{q_{obs, i}}$, to non-exceedance probability timeseries, $\mathbf{p_{obs, i}}$.

$\displaystyle FDC_i : \mathbf{q_{i}} \to \mathbf{p_i}$

We can then perform a weighted-aggregation of the non-exceedance probability timeseries to estimate the non-exceedance timeseries at the ungauged location. It is most common to apply an inverse-squared-distance weight to each observed timeseries such that:

$\mathbf{p_{pred}} = \sum^k (\mathbf{p_i}w_i)$

Where $w_i = 1 / d_i^2$ where $d_i$ is the distance from the observation $i$ to the ungauged location, and $\sum^k w_i = 1$.

Finally, the estimated FDC at the ungauged location, $\hat{FDC}_{pred}$ is used to convert the non-exceedance timeseries to streamflow timeseries:

$\hat{FDC}_{pred} : \mathbf{p}_{pred} \to \mathbf{q}_{pred}$

Looking at this formulation, and the sequence of transformations that take place, I hope it is clear why the method is rightfully called the QPPQ method.

This method is summarized well by the following graphic, taken from the USGS Report on the topic:

In the following section, I step through an implementation of this method in Python.

Tutorial

All Python scripts used in this tutorial can be found in my GitHub repository: QPPQ_Streamflow_Prediction_Tutorial.

In order run the scripts in this tutorial yourself, you will need to have installed the a few Python libraries, listed in requirements.txt. Running pip install -r requirements.txt from the command line, while inside a local copy of the directory will install all of these packages.

Data retrieval

I collected USGS streamflow data from $N$ gages using the HyRiver suite for Python.

If you would like to learn more about hydro-environmental data acquisition in Python, check out my old post on Efficient hydroclimatic data accessing with HyRiver for Python.

The script used to retrieve the data is available here. If you would like to experiment with this method in other regions, you can change the region variable on line 21, which specifies the corners of a bounding-box within which gage data will be retrieved:


# Specify time-range and region of interest
dates = ("2000-01-01", "2010-12-31")
region = (-108.0, 38.0, -105.0, 40.0)



Above, I specify a region West of the Rocky Mountains in Colorado. Running the generate_streamflow_data.py, I found 73 USGS gage locations (blue circles).

QPPQ Model

The file QPPQ.py contains the method outlined above, defined as the StreamflowGenerator class object.

The StreamflowGenerator has four key methods (or functions):

class StreamflowGenerator():
def __init__(self, args):
...
def get_knn(self):
...
def calculate_nep(self):
...
def interpolate_fdc(self):
...
def predict_streamflow(self):
...
return predicted_flow



The method get_knn finds the $k$ observation, gage locations nearest to the prediction location, and stores the distances to these observation locations (self.knn_distances) and the indices associated with these locations (self.knn_indices).

    def get_knn(self):
"""Find distances and indices of the K nearest gage locations."""
distances = np.zeros((self.n_observations))
for i in range(self.n_observations):
distances[i] = geodesic(self.prediction_location, self.observation_locations[i,:]).kilometers
self.knn_indices = np.argsort(distances, axis = 0)[0:self.K].flatten()
self.knn_distances = np.sort(distances, axis = 0)[0:self.K].flatten()
return


The next method, calculate_nep, calculates the NEP of a flow at an observation location at time t, or $P(Q \leq q_t)_{i}$.

    def calculate_nep(self, KNN_Q, Q_t):
"Finds the NEP at time t based on historic observatons."
# Get historic FDC
fdc = np.quantile(KNN_Q, self.fdc_neps, axis = 1).T
# Find nearest FDC value
nearest_quantile = np.argsort(abs(fdc - Q_t), axis = 1)[:,0]
nep = self.fdc_neps[nearest_quantile]
return nep


The interpolate_fdc performs a linear interpolate of the discrete FDC, and estimates flow for some given NEP.

    def interpolate_fdc(self, nep, fdc):
"Performs linear interpolation of discrete FDC at a NEP."
tol = 0.0000001
if nep == 0:
nep = np.array(tol)
sq_diff = (self.fdc_neps - nep)**2

# Index of nearest discrete NEP
ind = np.argmin(sq_diff)

# Handle edge-cases
if nep <= self.fdc_neps[0]:
return fdc[0]
elif nep >= self.fdc_neps[-1]:
return fdc[-1]

if self.fdc_neps[ind] <= nep:
flow_range = fdc[ind:ind+2]
nep_range = self.fdc_neps[ind:ind+2]
else:
flow_range = fdc[ind-1:ind+1]
nep_range = self.fdc_neps[ind-1:ind+1]

slope = (flow_range[1] - flow_range[0])/(nep_range[1] - nep_range[0])
flow = flow_range[0] + slope*(nep_range[1] - nep)
return flow


Finally, predict_streamflow(self, *args) combines these other methods and performs the full QPPQ prediction.

    def predict_streamflow(self, args):
"Run the QPPQ prediction method for a single locations."
self.prediction_location = args['prediction_location']
self.prediction_fdc = args['prediction_fdc']
self.fdc_quantiles = args['fdc_quantiles']
self.n_predictions = self.prediction_location.shape[0]

### Find nearest K observations
self.get_knn()
knn_flows = self.historic_flows[self.knn_indices, :]

### Calculate weights as inverse square distance
self.wts = 1/self.knn_distances**2

# Normalize weights
self.norm_wts = self.wts/np.sum(self.wts)

### Create timeseries of NEP at observation locations
self.observed_neps = np.zeros_like(knn_flows)
for t in range(self.T):
self.observed_neps[:,t] = self.calculate_nep(knn_flows, knn_flows[:,t:t+1])

### Calculate predicted NEP timeseries using weights
self.predicted_nep = np.zeros((self.n_predictions, self.T))
for t in range(self.T):
self.predicted_nep[:,t] = np.sum(self.observed_neps[:,t:t+1].T * self.norm_wts)

### Convert NEP timeseries to flow timeseries
self.predicted_flow = np.zeros_like(self.predicted_nep)
for t in range(self.T):
nep_t = self.predicted_nep[0,:][t]
self.predicted_flow[0,t] = self.interpolate_fdc(nep_t, self.prediction_fdc)

return self.predicted_flow


The predict_streamflow method is the only function called directly by the user. While get_knn, calculate_nep, and interpolate_fdc are all used by predict_streamflow.

Generate streamflow predictions

The script run_QPPQ_predictions.py runs the model and produces predictions at a test site. First, the data generated by generate_streamflow_data.py is loaded:

import numpy as np
from QPPQ import StreamflowGenerator

### Load Data
gage_locations = np.loadtxt('./data/observed_gage_locations.csv', delimiter = ',')
observed_flows = np.loadtxt('./data/observed_streamflow.csv', delimiter = ',')


The FDCs at each site are estimated at 200 discrete quantiles:

fdc_quantiles = np.linspace(0,1,200)
observed_fdc = np.quantile(observed_flows, fdc_quantiles, axis =1).T



A random test site is selected, and removed from the observation data:

# Select a test site and remove it from observations
test_site = np.random.randint(0, gage_locations.shape[0])

# Store test site
test_location = gage_locations[test_site,:]
test_flow = observed_flows[test_site, :]
test_site_fdc = observed_fdc[test_site, :]

# Remove test site from observations
gage_locations = np.delete(gage_locations, test_site, axis = 0)
observed_flows = np.delete(observed_flows, test_site, axis = 0)


When initializing the StreamflowGenerator, we must provide an array of gage location data (longitude, latitude), historic streamflow data at each gage, and the K number of nearest neighbors to include in the timeseries aggregation.

I’ve set-up the StreamflowGenerator class to receive these inputs as a dictionary, such as:

# Specify model prediction_inputs
QPPQ_args = {'observed_locations' : gage_locations,
'historic_flows' : observed_flows,
'K' : 20}

# Intialize the model
QPPQ_model = StreamflowGenerator(QPPQ_args)


Similarly, the prediction arguments are provided as a dictionary to the predict_streamflow function:

# Specify the prediction arguments
prediction_args = {'prediction_location': test_location,
'prediction_fdc' : test_site_fdc,
'fdc_quantiles' : fdc_quantiles}

# Run the prediction
predicted_flow = QPPQ_model.predict_streamflow(prediction_args)


I made a function, plot_predicted_and_observed, which allows for a quick visual check of the predicted timeseries compared to the observed timeseries:

from plot_functions import plot_predicted_and_observed
plot_predicted_and_observed(predicted_flow, test_flow)


Which shows some nice quality predictions!

One benefit of working with the StreamflowGenerator as a Python class object is that we can retrieve the internal variables for further inspection.

For example, I can call QPPQ_model.knn_distances to retrieve the distances to the K nearest neighbors used to predict the flow at the ungauged location. In this case, the gages used to make the prediction above were located 4.44, 13.23,. 18.38,… kilometers away.

Caveat and Conclusion

It is worth highlighting one major caveat to this example, which is that the FDC used for the prediction site was perfectly known from the historic record. In most cases, the FDC will not be known when making predictions in ungauged basins. Rather, estimations of the FDC will need to be used, and thus the prediction quality shown above is somewhat of a ideal-case when performing a QPPQ in ungauged basins.

There are numerous methods for estimating FDCs at the ungauged site, including the Generalized Pareto distribution approximation proposed by Fennessey (1994) or, more recently, through the use of Neural Networks, as highlighted in Worland, et al. (2019).

Hopefully this tutorial helped to get you familiar with a foundational streamflow prediction method.

References

Fennessey, Neil Merrick. "A Hydro-Climatological Model of Daily Stream Flow for the Northeast United States." Order No. 9510802 Tufts University, 1994. Ann Arbor: ProQuest. Web. 21 Nov. 2022.

Hrachowitz, Markus, et al. "A decade of Predictions in Ungauged Basins (PUB)—a review." Hydrological sciences journal 58.6 (2013): 1198-1255.

Razavi, Tara, and Paulin Coulibaly. "Streamflow prediction in ungauged basins: review of regionalization methods." Journal of hydrologic engineering 18.8 (2013): 958-975.

Stuckey, M.H., 2016, Estimation of daily mean streamflow for ungaged stream locations in the Delaware River Basin, water years 1960–2010: U.S. Geological Survey Scientific Investigations Report 2015–5157, 42 p., http://dx.doi.org/10.3133/sir20155157.

Worland, Scott C., et al. "Prediction and inference of flow duration curves using multioutput neural networks." Water Resources Research 55.8 (2019): 6850-6868.

Creating a collaborative lab manual pt. 2: Automated build & deploy with GitHub Actions

In my last blog post, I introduced the collaborative lab manual that the Reed Group is beginning to develop. Although the content of this site is still very much a work in progress, the goal is for this lab manual to help streamline the process of onboarding new students, postdocs, and collaborators by acting as a centralized location for readings, training materials, and coding resources relevant to our group’s research. As described in the previous post, the site itself is built using the Jupyter Books package for Python, which allows for Markdown documents and executable Jupyter Notebooks to be translated into static HTML websites.

However, the distributed nature of this project, with multiple collaborators developing material for the site from their own computers, initially led to some software dependency challenges. As the project has evolved, we have worked to overcome these challenges using GitHub Actions. This post will describe our GitHub Actions workflow, which automatically rebuilds the site in a consistent way after each new update, without inheriting specific software dependencies from individual developers, and then automatically deploys the updated site using GitHub Pages. All code used to build and deploy the lab manual can be found in this GitHub repository.

GitHub Actions workflow

This post will build on Travis Thurber’s previous blogpost on Continuous Integration/Continuous Delivery (CI/CD), GitHub Actions, and GitHub Pages, so I would recommend checking that out first if you have not already. As discussed in that post, GitHub Actions are automated processes that can be triggered by particular events within a GitHub repository (e.g., updates to the code base). The instructions for executing these processes are written in a YAML script, .github/workflows/deploy.yml. GitHub automatically looks for scripts in this directory to trigger automated GitHub Actions.

### github actions to build & deploy book, following https://github.com/executablebooks/cookiecutter-jupyter-book/blob/main/.github/workflows/deploy.yml

name: deploy

on:
# Trigger the deploy on push to main branch
push:
branches:
- main
schedule:
# jupyter-book is updated regularly, let's run this deployment every month in case something fails
# <minute [0,59]> <hour [0,23]> <day of the month [1,31]> <month of the year [1,12]> <day of the week [0,6]>
# https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07
# https://crontab.guru/every-month
# Run cron job every month
- cron: '0 0 1 * *'

jobs:
# This job deploys the example book
deploy-example-book:
runs-on: ${{ matrix.os }} strategy: matrix: os: [ubuntu-latest] python-version: [3.8] steps: - uses: actions/checkout@v2 # Install python - name: Set up Python${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python-version }} # install virtual environment with caching, so only updates when requirements.txt changes, # based on https://github.com/marketplace/actions/restore-or-create-a-python-virtualenv#custom_virtualenv_dir # Note: virtual environment by default will be created under ~/.venv - uses: syphar/restore-virtualenv@v1 id: cache-virtualenv with: requirement_files: docs/requirements_py.txt - uses: syphar/restore-pip-download-cache@v1 if: steps.cache-virtualenv.outputs.cache-hit != 'true' # install python dependencies - name: Install python dependencies run: pip install -r docs/requirements_py.txt if: steps.cache-virtualenv.outputs.cache-hit != 'true' # update kernel of all jupyter notebooks to .venv to match GH action environment - name: Update Jupyter Notebook kernels run: python docs/update_jupyter_kernels.py .venv | python -m ipykernel install --user --name=.venv # install R - name: Install R uses: r-lib/actions/setup-r@v2 with: use-public-rspm: true # install R dependencies - name: Install R dependencies run: sh docs/install_R_dependencies.sh # Build the example book - name: Build book run: jupyter-book build --all docs/ # Deploy html to gh-pages - name: GitHub Pages action uses: peaceiris/actions-gh-pages@v3.6.1 with: github_token:${{ secrets.GITHUB_TOKEN }}
publish_dir: docs/_build/html
publish_branch: gh-pages


I borrowed initially from the example Jupyter Books deployment that can be found here. This Action (Line 3) has a number of steps, which will be outlined sequentially. First, in Lines 5-16, we establish the set of conditions that trigger automated execution of the Action: (1) the Action will be triggered each time an update is pushed to the main branch of the repository; and (2) the Action will be triggered on the first of each month, regardless of whether any new updates have been pushed to the repository, which should help ensure that software dependencies remain up to date over time.

Next, in Lines 18-27, we specify that the Action should be executed on an Ubuntu virtual machine (latest available version). In Lines 29-48, we instruct the Action to install Python 3.8 on the virtual machine, and then install all Python packages listed in the file docs/requirements_py.txt. Standardizing the operating system and software dependencies using a virtual machine, as opposed to building and deploying locally on individual developers’ machines, helps to avoid situations wherein one developer accidentally breaks functionality introduced by another developer due to differences in their local software environments.

One downside to rebuilding a new virtual machine each time the Action is run is that this can be rather slow, especially for complicated software environments, e.g., those with many Python package installations. However, one way to make this process more efficient is to create a virtual environment which can be cached and reloaded into memory. This process is implemented in Lines 35-48, using the restore_virtualenv “task” in the GitHub Action Marketplace. Now the Python environment will only need to be rebuilt when the requirements_py.txt file has changed since the last commit; otherwise, the cached virtual environment can be loaded from memory in order to save time.

Another issue we encountered relates to the naming of virtual environments within Jupyter Notebooks. When you create a new Notebook on a personal computer, you will generally need to specify the Python “Kernel” manually using the graphical user interface. The Kernel can either be the generic Python interpreter you have installed (e.g., Python 3.8), or else a specific virtual environment that you have installed for a particular purpose (e.g., .venv_spatial). Once you have selected a Kernel, the Notebook will expect the same Kernel to be available as the default whenever the Notebook is run. This can cause errors when working on teams where different contributors have different naming conventions for their environments. For example, consider a situation where Contributor A develops a Jupyter Notebook-based page on spatial analysis, using the local virtual environment .venv_spatial. Contributor B then pulls Contributor A’s changes, including the new Notebook on spatial analysis, and then adds her own new Notebook on machine learning using the environment .venv_ML. When Contributer B tries to build the Jupyter Book (more on this process later), the spatial analysis notebook will exit with an error saying that the .venv_spatial environment cannot be found. Contributer B could manually change the Notebook Kernel to .venv_ML or another local environment, but this is not an ideal solution for long-term collaborative interoperability. To circumvent this issue, Lines 50-53 execute the script docs/update_jupyter_kernels.py:

### script to update the kernel for all Jupyter notebooks to user-specified name, given as argv
### this will be run by GitHub Action to set Kernel to .venv before deploying website. Can also be run by user if their env has different name.

import glob
import json
import sys

venv_name = sys.argv[1]
nbpaths = glob.glob('*.ipynb') + glob.glob('*/*.ipynb') + glob.glob('*/*/*.ipynb')

for nbpath in nbpaths:
### load jupyter notebook into dict based on its json format
with open(nbpath) as f:
nbdict = json.load(f)
### update kernel metadata
nbdict['metadata']['kernelspec']['display_name'] = venv_name
nbdict['metadata']['kernelspec']['name'] = venv_name
### write updated dict to json/ipynb
with open(nbpath, 'w') as f:
json.dump(nbdict, f, indent=2)


The purpose of this Python script is to reassign all Jupyter Notebooks within the repository to use the same Kernel name, .venv (supplied as a command line argument). The key insight is that Jupyter Notebooks are actually stored on disc in structured JSON format (try opening one in a text editor to verify this). The script above leverages this fact to programmatically replace the existing Kernel name with our new name, .venv. By running this command in GitHub Actions, we can automatically standardize the Kernel for all Notebooks to use the virtual environment that we created in Lines 35-48 (.venv is the default environment name used by the restore_virtualenv task). Note that this should not impact the Kernel names in the repository itself, since this script is only being run on the GitHub Actions virtual machine. However, if individual contributors experience local errors from incompatible Kernel names, they can either change the Kernel manually in the Notebook GUI (the old fashion way), or else run the update_jupyter_kernels.py script on their own machine with their unique Python virtual environment name given as a command line argument.

Returning to the deploy.yml file, the next step is to install the R programming language (Lines 55-59) and any necessary R libraries (Lines 60-62). The latter step is accomplished by running the script docs/install_R_dependencies.sh.

#!/bin/bash
### script for installing R dependencies from requirements file
### simplified version of code here: https://stackoverflow.com/questions/54534153/install-r-packages-from-requirements-txt-files
while IFS=" " read -r package;
do
Rscript -e "install.packages('"$package"')"; done < "docs/requirements_r.txt"  This Bash script reads a list of dependencies from the file docs/requirements_r.txt and installs each library on top of the base R installation on our virtual machine. Unfortunately the R installations are quite slow, but I am not aware of any virtual environment caching task for R on GitHub Actions. Returning again to the deploy.yml file, the next step in Lines 64-66 is to build the static HTML site using the Jupyter Books functionality. The command in Line 66 ( jupyter-book build --all docs/ ) is also the command used to build a site locally when individual contributors want to test their work before pushing changes to the main repository. In this case, this command will create a separate folder of HTML files which can be opened in a browser. However, when running on the virtual machine through GitHub Actions, the new HTML files are not yet being hosted anywhere. The final set of commands in Lines 68-74 accomplish this by restructuring and organizing the file contents on the virtual machine (including the new HTML files) to match the structure that is expected by GitHub Pages. The updated files are then committed and pushed to a separate branch of the GitHub repository called gh-pages. This gh-pages branch should only be altered by GitHub Actions, as opposed to being developed directly by humans like the main branch. Finally, because gh-pages is set as the source branch for our GitHub Pages site, the updates pushed to this branch by the GitHub Action will automatically deploy the updated HTML files to our lab manual webpage. To verify that all of these steps on the virtual machine completed successfully, we can check the Action log on GitHub. This shows that each step completed as expected without errors, along with the time to complete each step. If errors do occur, each of these steps can be expanded to show the errors and other terminal outputs in more detail. Overall, this workflow based on GitHub Actions is helping us to streamline our processes for collaborative development of a group lab manual. Our hope is that this workflow will continue to adapt over time based on our evolving needs, much like the living document of the lab manual itself. Copula Coding Exercise This Halloween, you may be scared by many things, but don’t let one of them be copulas. Dave Gold wrote an excellent post on the theory behind copulas and I’ll just build off of that post with a simple coding exercise. When I first heard about copulas, they sounded really relevant to helping me come up with a way to capture joint behavior across variables (which I then wanted to use to capture joint hydrologic behavior across watersheds). However, because I hadn’t learned about them formally in a class, I had a hard time translating what I was reading in dense statistics textbooks into practice and found myself blindly trying to use R packages and not making much progress. It was very helpful for me to start to wrap my head around copulas by just starting with coding a simple trivariate Gaussian copula from scratch. If copulas are new to you, hopefully this little exercise will be helpful to you as well. Let’s pose a scenario that will help orient the reason why a copula will be helpful. Let’s imagine that we have historical daily full natural flow for three gauged locations in the Central Valley region of California: Don Pedro Reservoir, Millerton Lake, and New Melones Lake. Because these three locations are really important for water supply for the region, it’s helpful to identify the likelihood of joint flooding or drought tendencies across these basins. For this post, let’s focus on joint flooding. First let’s fit distributions for each basin individually. I calculate the water year and create another column for that and then take the daily flow and determine the aggregate peak three-day flow for each year and each basin. Given the different residence times in each basin, a three-day aggregate flow can better capture concurrent events than using a one-day flow. Let’s term the flows in each basin [xt,1…xt,n] which is the 3-day peak annual flows in each of the n basins in year t. library(evd) library(mvtnorm) library(extRemes) library(mnormt) #Read in annual observed flows TLG_flows=read.table("C:/Users/rg727/Downloads/FNF_TLG_mm.txt") MIL_flows=read.table("C:/Users/rg727/Downloads/FNF_MIL_mm.txt") NML_flows=read.table("C:/Users/rg727/Downloads/FNF_NML_mm.txt") #Calculate water year column TLG_flows$water_year=0

for (i in 1:dim(TLG_flows)[1]){
if (TLG_flows$V2[i]==10|TLG_flows$V2[i]==11|TLG_flows$V2[i]==12){ TLG_flows$water_year[i]=TLG_flows$V1[i]+1 } else{ TLG_flows$water_year[i]=TLG_flows$V1[i]} } MIL_flows$water_year=0

for (i in 1:dim(MIL_flows)[1]){
if (MIL_flows$V2[i]==10|MIL_flows$V2[i]==11|MIL_flows$V2[i]==12){ MIL_flows$water_year[i]=MIL_flows$V1[i]+1 } else{ MIL_flows$water_year[i]=MIL_flows$V1[i]} } NML_flows$water_year=0

for (i in 1:dim(NML_flows)[1]){
if (NML_flows$V2[i]==10|NML_flows$V2[i]==11|NML_flows$V2[i]==12){ NML_flows$water_year[i]=NML_flows$V1[i]+1 } else{ NML_flows$water_year[i]=NML_flows$V1[i]} } #Calculate 3-day total flow TLG_flows$three_day=0

for (i in 1:length(TLG_flows$V4)){ if (i == 1){ TLG_flows$three_day[i]=sum(TLG_flows$V4[i],TLG_flows$V4[i+1])
}
else
TLG_flows$three_day[i]=sum(TLG_flows$V4[i-1],TLG_flows$V4[i],TLG_flows$V4[i+1])
}

MIL_flows$three_day=0 for (i in 1:length(MIL_flows$V4)){
if (i == 1){
MIL_flows$three_day[i]=sum(MIL_flows$V4[i],MIL_flows$V4[i+1]) } else MIL_flows$three_day[i]=sum(MIL_flows$V4[i-1],MIL_flows$V4[i],MIL_flows$V4[i+1]) } NML_flows$three_day=0

for (i in 1:length(NML_flows$V4)){ if (i == 1){ NML_flows$three_day[i]=sum(NML_flows$V4[i],NML_flows$V4[i+1])
}
else
NML_flows$three_day[i]=sum(NML_flows$V4[i-1],NML_flows$V4[i],NML_flows$V4[i+1])
}

#Calculate peak annual flow
TLG_flow_peak_annual=aggregate(TLG_flows,by=list(TLG_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL) MIL_flow_peak_annual=aggregate(MIL_flows,by=list(MIL_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
NML_flow_peak_annual=aggregate(NML_flows,by=list(NML_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL) #Remove extra year (1986) from TLG TLG_flow_peak_annual=TLG_flow_peak_annual[2:33,]  Next, we need to fit individual marginal distributions for each location. Since we are interested in capturing flood risk across the basins, a common marginal distribution to use is a Generalized Extreme Value distribution (GEV). We fit a different GEV distribution for each basin and from here, we create a bit of code to determine the peak three-day flows associated with a historical 10-year return period event. #Determine level of a 10-year return period getrlpoints <- function(fit){ xp2 <- ppoints(fit$n, a = 0)
ytmp <- datagrabber(fit)
y <- c(ytmp[, 1])
sdat <- sort(y)
npy <- fit$npy u <- fit$threshold
rlpoints.x <- -1/log(xp2)[sdat > u]/npy
rlpoints.y <- sdat[sdat > u]
rlpoints <- data.frame(rlpoints.x, rlpoints.y)

return(rlpoints)
}

getcidf <- function(fit){

rperiods = c(2,5,10,20,50,100,500,1000)
bds <- ci(fit, return.period = rperiods)
c1 <- as.numeric(bds[,1])
c2 <- as.numeric(bds[,2])
c3 <- as.numeric(bds[,3])
ci_df <- data.frame(c1, c2, c3, rperiods)

return(ci_df)
}

gevfit_MIL <- fevd(MIL_flow_peak_annual$three_day) rlpoints <- getrlpoints(gevfit_MIL) ci_df <- getcidf(gevfit_MIL) gevfit_NML <- fevd(NML_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_NML)
ci_df <- getcidf(gevfit_NML)

gevfit_TLG <- fevd(TLG_flow_peak_annual$three_day) rlpoints <- getrlpoints(gevfit_TLG) ci_df <- getcidf(gevfit_TLG) #These are the historical thresholds defined from the GEV fit to the three-day sum of the peak flows ten_yr_event_TLG=58.53738 ten_yr_event_MIL=40.77821 ten_yr_event_NML=58.95369  Now that we have our marginals fit, we need to use these in some way to fit a Gaussian copula. By definition, a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. So we need to transform our observations to be psuedo-uniform. To do so, we push the values of [xt,1…xt,n] through the inverse of the CDF of the respective fitted GEV function. These marginals are now uniformly distributed between 0 and 1. Let’s term these values now [ut,1…ut,n]. #Now we want to fit the copula on our historical flows. First create u's u_TLG = pgev(TLG_flow_peak_annual$three_day, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3],lower.tail = TRUE)
u_MIL = pgev(MIL_flow_peak_annual$three_day, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE) u_NML = pgev(NML_flow_peak_annual$three_day, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

#Let's also convert the thresholds to u's for each basin
u_TLG_event = pgev(ten_yr_event_TLG, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3], lower.tail = TRUE)
u_MIL_event = pgev(ten_yr_event_MIL, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML_event = pgev(ten_yr_event_NML, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)



Now we address the Gaussian part of the copula. Recall the following in Dave’s post:

Where Φ_R is the joint standard normal CDF with:

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

So we have our u vectors that we now need to push through an inverse standard normal distribution to get Z scores.

#Now takes these u's and push them through a standard normal to get Z scores

Z_TLG=qnorm(u_TLG)
Z_MIL=qnorm(u_MIL)
Z_NML=qnorm(u_NML)

#Let's do the same for the historical events
Z_TLG_event=qnorm(u_TLG_event)
Z_MIL_event=qnorm(u_MIL_event)
Z_NML_event=qnorm(u_NML_event)



The last part that we need for our copula is a spearman rank correlation matrix that captures the structural relationship of the peak flows across the three basins.

cor_matrix=cor(cbind(Z_TLG,Z_MIL,Z_NML),method = "spearman")


Finally, we have all the components of our copula. What question can we ask now that we have this copula? How about “What is the likelihood of exceeding the historical 10-year event in each basin?”. We code up the following formula which expresses this exact likelihood. More details can be found in this very helpful book chapter:

We calculate this with the following line and we find an exceedance probability of 0.0564.

#Now calculate the probability of exceeding the historical threshold in all basins

exceedance=1-pnorm(Z_TLG_event)-pnorm(Z_MIL_event)-pnorm(Z_NML_event)+pmnorm(c(Z_TLG_event,Z_MIL_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_MIL)))+pmnorm(c(Z_TLG_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_NML)))+pmnorm(c(Z_MIL_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_MIL,Z_NML)))-pmnorm(c(Z_TLG_event,Z_MIL_event,Z_NML_event),mean=rep(0,3),varcov = cor(cbind(Z_TLG,Z_MIL,Z_NML)))



Now this intuitively makes sense because a 10-year event has a 10% or 0.1 probability of being exceeded in any given year. If you calculate a joint likelihood of exceedance of this event across multiple basins, then this event is even less likely, so our lower probability of 0.0564 tracks. Note that you can create synthetic realizations of correlated peak flows by simulating from from a multivariate normal distribution with a mean of 0 and the Spearman rank correlation matrix defined above. Hopefully this was a simple but helpful example to demonstrate the key components of the Gaussian copula. The script and corresponding datasets can be found in this repo.

Motivation

Over the last year, being in the Reed Research Group, I have been exposed to new ideas more rapidly than I can manage. During this period, I was filling my laptop storage with countless .docx, .pdf, .txt, .html files semi-sporadically stored across different cloud and local storages.

Finding a good method for organizing my notes and documentation has been very beneficial for my productivity, processing of new ideas, and ultimately staying sane. For me, this has been done through Obsidian.

Obsidian is an open-source markdown (.md) based note taking and organization tool, with strengths being:

• Connected and relational note organization
• Rendering code and mathematical (LaTeX) syntax
• Community-developed plugins
• Light-weight
• A nice aesthetic

The fact that all notes are simply .md files is very appealing to me. I don’t need to waste time searching through different local and cloud directories for my notes, and can avoid having OneNote, Notepad, and 6 Word windows open at the same time.

Also, I like having localized storage of the notes and documentation that I am accumulating. I store my Obsidian directory on GitHub, and push-pull copies between my laptop and desktop, giving me the security of having three copies of my precious notes at all times.

I’ve been using Obsidian as my primary notetaking app for a few months now, and it become central to my workflow. In this post, I show how Obsidian extends Markdown feature utility, helps to organize topical .md libraries, and generally makes the documentation process more enjoyable. I also try to explain why I believe Obsidian is so effective for researchers or code developers.

Note Taking in Markdown

Markdown (.md) text is an efficient and effective way of not just documenting code (e.g., README.md files), but is great for writing tutorials (e.g., in JupyterNotebook), taking notes, and even designing a Website (as demonstrated in Andrew’s Blog Post last week).

In my opinion, the beauty is that only a few syntax tricks are necessary for producing a very clean .md document. This is particularly relevant as a programmer, when notes often require mathematical and code syntax.

I am not going to delve into Markdown syntax. For those who have not written in Markdown, I suggest you see the Markdown Documentation here. Instead, I focus on how Obsidian enables a pleasant environment in which to write .md documentation.

Obsidian Key Features

Below, I hit on what I view as the key strengths of Obsidian:

• Connected and relational note organization
• Rendering code and mathematical (LaTeX) syntax
• Community-developed plugins

Relational Note Organization

If you take a quick look at any of the Obsidian forums, you will come across this word: Zettelkasten.

Zettelkasten, meaning "slip-box" in German and also referred to as card file, is a note taking strategy which encourages the use of many highly-specific notes which are connected to one another via references.

Obsidian is designed to help facilitate this style of note taking, with the goal of facilitating what they refer to as a "second brain". The goal of this relational note taking is to store not just information about a single topic but rather a network of connected information.

To this end, Obsidian helps to easily navigate these connections and visualize relationships stored within a Vault (which is just a fancy word for one large folder directory). Below is a screen cap of my note network developed over the last ~4 months. On the left is a visualization of my entire note directory, with a zoomed-in view on the right.

Notice the scenario discovery node, which has direct connections to methodological notes on Logistic Regression, Boosted Trees, PRIM, and literature review notes on Bryant & Lempert 2010, a paper which was influential in advocating for participatory, computer-assisted scenario discovery.

Each of these nodes is then connected to the broader network of notes and documentation.

These relations are easily constructed by linking other pages within the directory, or subheadings within those pages. When writing the notes, you can then click through the links to flip between files in the directory.

Links to Internal Pages and Subheadings

Referencing other pages (.md files) in your library is done with a double square bracket on either side: [[Markdown cheatsheet]]

You can link get down to finer resolution and specifically reference various levels of sub-headings within a page by adding a hashtag to the internal link, such as: [[Markdown cheatsheet#Basics#Bold]]

Tags and Tag Searches

Another tool that helps facilitate relational note taking is the use of #tags. Attach a # to any word within any of your notes, and that word becomes connected to other instances of the word throughout the directory.

Once tags have been created, they can be searched. The following Obsidian syntax generates a live list of occurrences of that tag across your entire vault:

query
tag: #scenarios


Which produces the window:

Rending code and math syntax

Language-Specific Code Snippets

Obsidian will beautifully stylized code snippets using language-specific formatting, and if you don’t agree then you change change your style settings.

A block of code is specified, for some specific language using the tripple-tic syntax as such:

langage
<Enter code here>


The classic HelloWorld() function can be stylistically rendered in Python or C++:

LaTeX

As per usual, the $ characters are used to render LaTeX equations. Use of single-$ characters will results in in-line equations ($<Enter LaTeX>$) with double-$$ used for centered equations. Obsidian is not limited to short LaTeX equations, and has plugins designed to allow for inclusion other LaTeX packages or HTML syntaxes. latex$$ \phi_{t,i} = \left\{ \begin{array}\\ \phi_{t-1,i} + 1 & \text{if } \ z_t < \text{limit}\\ 0 & \text{otherwise.} \end{array} \right. 

will produce:

Plugins

Obsidian boasts an impressive 667 community-developed plugins with a wide variety of functionality. A glimpse at the webpage shows plugins that give more control over the visual interface, allow for alternative LaTeX environments, or allow for pages to be exported to various file formats.

Realistically, I have not spent a lot of time working with the plugins. But, if you are someone who likes the idea of a continuously evolving and modifiable documentation environment then you may want to check them out in greater depth.

Conclusion: Why did I write this?

This is not a sponsered post in any way, I just like the app.

When diving into a new project or starting a research program, it is critical to find a way of taking notes, documenting resources, and storing ideas that works for you. Obsidian is one tool which has proven to be effective for me. It may help you.

Best of luck with your learning.

Enhancing Jupyter Notebooks for Teaching

In our research group, we often use Jupyter Notebooks as teaching tools in the classroom and to train new students. Jupyter Notebooks are useful because they integrate code and its output into a single document that also supports visualizations, narrative text, mathematical equations, and other forms of media that often allows students to better contextualize what they are learning.

There are some aspects of Jupyter Notebooks that students sometimes struggle with as they shift from IDEs to this new environment. Many have noted difficulty tracking the value of variables through the notebook and getting stuck in a mindset of just pressing “Shift+Enter”. Since the advent of Project Jupyter, there have been a variety of “unofficial” extensions and widgets that have been put out by the community that I have found can help enhance the learning experience and make Jupyter Notebooks less frustrating. It’s just not apparent that these extensions exist until you go searching for them. I will demonstrate some below that have been particularly helpful when working with our undergrads and hopefully will be helpful for you too!

Variable Inspector

One very easy way to remedy the situation of not having a variable display is to enable the Variable Inspector Extension which allows you to see the value of all of the variables in an external floating window. In your command prompt type:

#Installation
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

#Enable extension
jupyter nbextension enable varInspector/main


Here we are installing a massive library of extensions called Nbextensions that we will enable in this post. When you open your notebooks, you will now see a small icon at the top that you can click to create a variable inspector window that will reside in the margins if the notebook.

As you step through the notebook, your variable inspector will become populated.

Snippets of Code

The Snippets menu is pretty fantastic for students who are new to coding or for those of us who go back and forth between coding languages and keep having to quickly search how to make arrays or need code for example plots.

We can just enable this extension as follows right in Jupyter Notebook under the Nbextensions tab:

This gif provides an example of some of the options that are available for NumPy, Matplotlib, and Pandas.

If for instance, you want to create a histogram but can’t remember the form of the code, a snippet for this exists under the Matplotlib submenu.

Rubberband (Highlight Multiple Cells)

If you enable the Rubberband extension in the Nbextensions tab, you can highlight and execute multiple cells, which just gives a bit more precision in selecting a subset of cells that you want to run at once. You hold down “Ctrl+Shift” while dragging the mouse. Then you can execute the highlighted cells using “Shift+Enter”.

Execution Time

You can wrap your code in a variety of timing functions or just enable the Execution Time extension in the Nbextensions tab. For every cell that you execute, the elapsed time will now be recorded under the cell. For me, this is particularly useful when you have some cells that take a longer time to execute. If you are trying to create training, it’s helpful for a user to get a breakdown of exactly how long the code should be taking to run (so that they know if the timing is normal or if there is an issue with the code).

Creating a Jupyter Notebook Slideshow with RISE (a reveal.js extension)

The last teaching tool that I want to demonstrate is how to turn your Jupyter Notebook into an interactive slideshow, which has been infinitely more useful to use as a teaching tool than just scrolling through the Jupyter Notebook at a meeting.

First, installation is easy:

pip install RISE


Now, we have to prepare our notebook for a slideshow. Go to View -> Cell Toolbar -> Slideshow. We need to select the Slide Type for each of our cells.

At the top of the screen, we now have a new icon that allows you to enter into slideshow mode when you are ready. We can now step through the slideshow that we have made and interactively execute the coding boxes as well.

I hope these tips help you to have a more pleasant experience while using Jupyter Notebooks. If there are other extensions that you enjoy using regularly, please post below!

Creating a collaborative research group lab manual with Jupyter Books

Motivation

Onboarding new students, staff, or collaborators can be a challenge in highly technical fields. Often, the knowledge needed to run complex experiments or computing workflows is spread across multiple individuals or teams based on their unique experiences, and training of new team members tends to occur in an ad hoc and inefficient manner. These challenges are compounded by the inevitable turnover of students, postdocs, and external collaborators in academic research settings.

Over the years, the Reed Group has developed a large bank of training materials to help streamline the onboarding process for new team members. These materials introduce concepts and tools related to water systems modeling, multi-objective evolutionary algorithms, global sensitivity analysis, synthetic streamflow generation, etc. However, these materials are still spread across a variety of sources (academic papers, GitHub repositories, blog posts, etc.) and team members, and there has been growing recognition of the need to catalogue and compile relevant resources and trainings in a more structured way.

For this reason, we have begun to create a lab manual for the Reed group. This will include a wide variety of information relevant to new students and researchers – everything from training exercises and code snippets, to reading lists and coursework suggestions, to a code of conduct outlining our values and expectations. The goal is for this to be a collaborative, living document created and maintained by students and postdocs. Ideally this will continue to evolve for years along with the evolving state-of-the-art in methodology, software, and literature.

After considering a number of different platforms for constructing websites, we settled on the Jupyter Book package for Python. You can find our lab manual here, and the source code used to create it here – note that this is still very much in development, a skeleton waiting to be fleshed out. In the remainder of this blog post, I will highlight the major elements of a Jupyter Book website, using our skeleton lab manual as an example. Then in a future blog post, I will outline the Continuous Integration and Continuous Delivery (CI/CD) strategy we are using to manage versioning and platform dependency issues across multiple developers.

Intro to Jupyter Book

Jupyter Book is a Python package for creating static websites. The package is built on the popular Sphinx engine used to create documentation for many of your favorite Python packages. Sphinx was also used to create the ebook for “Addressing Uncertainty in MultiSector Dynamics Research“, as described in two recent blog posts by Rohini Gupta and Travis Thurber. The ebook was a source of inspiration for our lab manual and the reason we initially considered Sphinx-based workflows. However, Jupyter Books layers several additional functionalities on top of Sphinx. First, it supports use of the MyST Markdown language, which is more familiar and intuitive to most researchers than the reStructured Text format favored by Sphinx. And second, it allows for pages to be built from executable Jupyter Notebooks, a powerful tool for combining text and equations with formatted code blocks, program output, and generated figures.

The Jupyter Book documentation contains tutorials, examples, and references, and is an excellent resource for anyone looking to build their own site. The documentation itself is, of course, created using the Jupyter Book package, and interested readers can check out the source code here.

Designing the website structure

The hierarchical structure of a Jupyter Book is defined in a simple YAML-style Table of Contents file, which should be named _toc.yml. Here is the TOC for our lab manual at present:

format: jb-book
root: intro.md
parts:
- chapters:
- file: ExamplePages/ExamplePages.md
sections:
- file: ExamplePages/mdExample.md
- file: ExamplePages/nbExample.ipynb
- file: Resources/Resources.md
sections:
- file: Resources/ClusterBasics.md
- file: Resources/Computing.md
- file: Resources/Courses.md
- file: Resources/DataVisualization.md
- file: Resources/LifeAtCornell.md
- file: Resources/ReedGroupTools.md
- file: Resources/WritingProcess.md
- file: Resources/CitationNetworkandDiscovery.md
- file: Training/Training.md
sections:
- file: Training/Schedule.md
- file: Training/Reading.md
- file: Training/LakeProblem.md
- file: Training/Training_Fisheries_Part1.md
- file: Training/Linux_MOEAs_HPC.md


The “root” defines the landing page, in this case the intro.md markdown file. That landing page will link to three “chapters” called ExamplePages, Resources, and Training. Each of these chapters has it’s own landing page as well as multiple child “sections.” Each page can either be written as a Markdown file (.md) or a Jupyter Notebook (.ipynb).

The other important YAML file for all Jupyter Books is _config.yml:

title: Reed group lab manual
author: The Reed Group at Cornell CEE
logo: logo.png

# Force re-execution of notebooks on each build.
# See https://jupyterbook.org/content/execute.html
execute:
execute_notebooks: force

# Define the name of the latex output file for PDF builds
latex:
latex_documents:
targetname: book.tex

# Add a bibtex file so that we can create citations
bibtex_bibfiles:
- references.bib

# Information about where the book exists on the web
repository:
url: https://github.com/reedgroup/reedgroup.github.io  # Online location of your book
path_to_book: docs  # Optional path to your book, relative to the repository root

# Add GitHub buttons to your book
# See https://jupyterbook.org/customize/config.html#add-a-link-to-your-repository
html:
use_issues_button: true
use_repository_button: true


We first define our website’s title and author, as well as an image logo to display. The line “execute_notebooks: force” means that we want to reexecute all Jupyter Notebooks each time the site is built (see docs for other options). The url gives the web address where we want to host our site – in this case the GitHub Pages address associated with the GitHub repository for the site. The path_to_book defines “docs” as the folder in the repository where all source code is to be held. Finally, the last two options are used to create buttons at the top of our site that link to the GitHub repository in case readers want to browse the source code or report an issue. For now, we are using the default vanilla style, but there are many ways to customize the structural and aesthetic style of the site. You would need to point to custom style files from this configuration file – see the Jupyter Book gallery for inspiration.

Building pages with Markdown and Jupyter Notebooks

Jupyter Book makes it very easy to write new pages using either Markdown or Jupyter Notebooks. For context, here is a screenshot of the site’s homepage:

The main content section for this page is built from the “root” file, intro.md:

# Welcome to our lab manual!

{warning}
This site is still under construction


The purpose of this site is to help new students and collaborators get up to speed on the research methods/tools used by the Reed Group. This page is designed and maintained by other graduate students and post docs, and is intended to serve as a living document.

This manual was created using the Jupyter Books Python package, and is hosted with GitHub Pages. You can find our source code at https://github.com/reedgroup/reedgroup.github.io.

{tableofcontents}


As you can see, this uses a very human-readable and intuitive Markdown-based file structure. Jupyter Book provides simple functionality for warning labels and other emphasis boxes, as well as a Table of Contents that is automatically rendered from the _toc.yml file. The tableofcontents command can be used from anywhere in the hierarchical page tree and will automatically filter to include only children of the current page. The separate sidebar TOC will also expand to show “sections” as you navigate into different “chapters.” For example, here is the Markdown and rendered webpage for the “ExamplePages” chapter:

# Example Pages with JupyterBooks
{tableofcontents}


For more detailed pages, you can also apply standard Markdown syntax to add section headers, bold/italic font, code blocks, lists, Latex equations, images, etc. For example, here is ExamplePages/mdExample.md:

# Markdown example
This is an example page using just markdown

### Subsection 1
Here is a subsection

### Subsection 2
Here is another subsection.

:::{note}
Here is a note!
:::

And here is a code block:


e = mc^2


And here comes a cute image!

![capybara and friends](capybaraFriends.jpg "Capybara and friends")


Lastly, and most importantly for purposes of building a training manual, we can create pages using Jupyter Notebooks. For example, here are two screenshots of the webpage rendered from ExamplePages/nbExample.ipynb:

As you can see, the Notebook functionality allows us to combine text and equations with rendered Python code. We can also execute Bash, R, or other programs using Jupyter Notebook’s “magic” commands. Note that the Jupyter-based website is not interactive – for that you’ll need Binder, as demonstrated in this blog post by David Gold.

Nevertheless, the Notebook is reexecuted each time we rebuild the website, which should really streamline collaborative lab manual development. For example, once we have developed a code bank of visualization examples (stay tuned!), it will be straightforward to edit the existing examples and/or add new examples, with the rendered visualizations being automatically updated rather than needing to manually upload the new images. Additionally, reexecuting the Notebooks each time we rebuild the site will force us to maintain the functionality of our existing code bank rather than letting portions become obsolete due to package dependencies or other issues.

Next steps

You now have the basic building blocks to create your own lab manual or ebook using a collection of YAML files, Markdown files, and Jupyter Notebooks. The last two critical steps are to actually build the static site (e.g., the html files) using Jupyter Book, and then host the site using GitHub pages. I will demonstrate these steps, as well as our CI/CD strategy based on GitHub Actions, in my next blog post.