The Building Blocks of Time Series Analysis (plus a demo!)

Recently I’ve been learning to view the world in frequencies and amplitudes instead of time and magnitudes, also known as Fourier-domain time series analysis. It’s become a rather popular method in the field of water supply planning, with applications in forecasting demand and water quality control. While this post will not delve into advanced applications, we will walk through some fundamental time series concepts to ease you into the Fourier Domain. I will then demonstrate how time series analysis can be used to identify long- and short-term trends in inflow data.

Important frequencies

When reviewing time series analysis, there are three fundamental frequencies to remember:

  1. The sampling frequency, f_{s} is the frequency at which data for a time series is collected. It is equivalent to the inverse of the time difference between two consecutive data points, \frac{1}{\delta t}. It has units of Hertz (Hz). Higher values of f_{s} indicate a higher number of data points recorded per second, which means better data resolution.
  2. The Nyquist frequency, f_{Ny} is equal to \frac{1}{2} f_{s}. As a rule of thumb, you should sample your time series at a frequency f < f_{Ny}, as sampling at a higher frequencies will result in the time series signals overlapping. This is a phenomenon called aliasing, where the sampled points do not sufficiently represent the input signal, resulting in higher frequencies being incorrectly observed at lower frequencies (Ellis, 2012)
  3. The cutoff frequency, f_{cut}. The value of f_{cut} defines the signal frequency that passes through or is stopped by a filter. This is a nice segue into the next section:

Filtering

Filtering does exactly what its name implies it does. It allows certain signal frequencies to pass through, filtering out the rest. There are three main types of filters:

  1. Low-pass filter: A low-pass filter eliminates all frequencies above f_{cut}, allowing lower frequencies to pass through. It behaves like a smoothing operator and works well for signal noise reduction.
  2. High-pass filter: Unlike the low-pass filter, the high-pass filter eliminates all frequencies below f_{cut}. This is a useful filter when you have discontinuous data, as high-pass filtering a signal maintains its discontinuities and does not artificially smooth it over.
  3. Band-pass filter: This filter has both the properties of a low- and high-pass filter. It eliminates all frequencies that do not lie within a range of frequencies.

Windowing

The core idea behind windowing is that is is possible to obtain more detailed insight into your signal’s properties by dividing it into ‘windows’ of time. Windowing is useful to reduce artifacts in a signal caused by periodic extension, and the distinction between signal and noise becomes unclear. For particularly noisy signals, windowing can also be useful to amplify the signal-to-noise ratio, thereby making it easier to distinguish between the actual signal and white noise. This is possible as the large ‘main lobe’ of the window (both in the time and frequency domain), amplifies the lower-frequency, higher-amplitude signal, but dampens the higher-frequency, lower-amplitude noise.

Figure 1: The Blackman-Harris windowing function.

Spectrograms

There is a key issue when analyzing a time series in the Fourier domain, and vice versa – there is no way (as of right now) to view when or where a specific frequency occurs, or which frequencies dominate at any specific time. A spectrogram (or the Short-Time Fourier Transform, STFT) of a signal address this issue by consolidating three types of information into one visualization:

  1. The frequencies inherent to the signal
  2. How the frequencies change over time
  3. The relative strength of the frequencies (the frequency content or “power” of the frequency).

It allows us to view changes in the signal across both time and frequency, overcoming the limitation of only being applicable to stationary time series. However, it still applies a fixed window size, as we will see in a little bit, and cannot fully capture all the different frequencies that characterize a signal. Here, using wavelets is useful. Please refer to Rohini’s post on using wavelets on ENSO El NiƱo data here if you’re interested! On that note, let’s see how we can bring it all in with a quick demonstration!

Demonstrating basic time series analysis methods on an inflow timeseries

This demonstration will use MATLAB, and we begin by first loading and plotting a time series of weekly inflow data. Feel free to replace the inflow file with any time series you have on hand!

%%
% Read in and plot inflows of Jordan Lake in the Research Triangle

inflows_jl = readmatrix('jordan_lake_inflows.csv');
inflows_jl = inflows_jl(1,:);

% define sampling rate, Nyquist frequency, number of samples, sampling period, and frequency series
fs_inflow = 1/(7*24*3600);  
N = length(inflows_jl);
delta_t = 1/fs_inflow;
freqseries_inflow = (1/(N*delta_t))*(0:(N-1));
fNy = fs_inflow/2;

%% Plot the inflow time series
figure(1)
plot(inflows_jl, 'Color','#104E8B');
grid on
xlabel('Time (weeks)');
ylabel('Inflow (MGD)');
xlim([0 5200]); ylim([0 15e4]);
title('Jordan Lake inflow timeseries');

The output figure should look similar to this:

Figure 2: Jordan Lake weekly inflow time series over 95 years.

Next, we Fourier Transform the inflow time series using the MATLAB fft() function, then convert it’s amplitude to decibels by first applying the abs() function, and then the mag2db() function to the output of the Fourier Transform.

%% Perform and plot the FFT of inflow
fft_jl = abs(fft(inflows_jl));

figure(2)
loglog(freqseries_inflow, mag2db(fft_jl), '-ko', 'Color','#104E8B', 'MarkerSize',2);
grid on
xlabel('Frequency (Hz)');
ylabel('Amplitude (dB)');
title("FFT of Jordan Lake");

This returns the following plot:

Figure 3: The inflow time series in the Fourier domain.

Note the large spikes in Figure 3. Those indicate frequencies that result in large inflow events. In this blog post, we will focus on the two leftmost spikes: the four-year cycle and the annual cycle, shown below. These cycles were identified by first converting their frequencies into seconds, and then into weeks and years.

Figure 4: The inflow time series in the Fourier domain with the 4-year and annual cycle indicated.

We also want to confirm that these frequencies characterize the time series. To verify this, we apply a low-pass Butterworth filter:

%% Number of weeks in a year
num_weeks = 52.1429;

% Record the low-pass cutoff frequencies 
fcut1_jl = 8.40951e-9;
years1_jl = conv2weeks(1/fcut1_jl)/num_weeks;  

fcut2_jl = 3.16974e-8;
years2_jl = conv2weeks(1/fcut2_jl)/num_weeks;

% Set up the filters and apply to the inflow time series
[b1_fl, a1_fl] = butter(6,fcut1_fl/(fs_inflow/2));
inflows_filtered_fl1 = filter(b1_fl, a1_fl, inflows_jl);

[b2_fl, a2_fl] = butter(6,fcut2_fl/(fs_inflow/2));
inflows_filtered_fl2 = filter(b2_fl, a2_fl, inflows_jl);

%% Plot the filtered inflow time series 
figure(3)
plot(inflows_jl, 'Color','#104E8B',  'DisplayName', 'Original', 'LineWidth',2);
xlim([0 5200]);
ylim([0 15e4]);
xlabel('Time (Years)');
ylabel('Inflows (MGD)');
title('Jordan Lake 4-year cycle')
legend();
hold on
plot(inflows_filtered_jl1,'Color','#D88037', 'MarkerSize',2, 'DisplayName', '4-year cycle', 'LineWidth',2);
hold off
ticklocations = 0:520:5112;
xticks(ticklocations);
ticklabels = string(ticklocations);
ticklabels(mod(ticklocations, 520) ~= 0) = "";
xticklabels({'0', '10', '20', '30', '40', '50', '60', '70', '80', '90'});

% Plot their Fourier counterparts
fft_win1_jl = abs(fft(inflows_filtered_jl1));

figure(4)
loglog(freq_series(1:length(fft_jl)),fft_jl, 'Color','#104E8B', 'DisplayName','Original');
legend({'Original','4-year cycle'},'Location','southeast');
title("Low-pass filter (f_{cut}=", string(fcut1_jl));
xlabel("Frequency (Hz)");
ylabel("Amplitude");
hold on
loglog(freq_series(1:length(fft_win1_jl)),fft_win1_jl, 'Color','#D88037','DisplayName','4-year cycle');
hold off

Replacing inflows_filtered_jl1 and fft_win1_jl (the 4-year cycle) with that of the annual cycle will result in Figure 5 below.

Figure 5: The filtered inflow time series in both the time and Fourier domains.

As previously mentioned, the low-pass filter effectively attenuated all frequencies higher than their respective f_{cut} values. These frequencies identified in Figures 3 and 4 can also be said to characterize this inflow time series as the peaks of their filtered counterparts align with that of the original time series.

Next, we construct a spectrogram for this inflow time series. Here, we will use the Blackman-Harris window, which is a window that has a relatively wide main lobe and small side lobes. This enables it to amplify frequencies that more significantly characterize inflow, while suppressing less-significant ones. This window can be generated using the MATLAB blackmanharris() function, as shown below:

%% Define window size
Noverlap1_jl = N1_jl/2;   % 4-year cycle
Noverlap2_jl = N2_jl/2;   % Annual cycle

% Generate window
win1_jl = blackmanharris(N1_jl, 'periodic');  % 4-year cycle
win2_jl = blackmanharris(N2_jl, 'periodic');  % Annual cycle

Finally, we are ready to apply the spectrogram() function to generate the spectrogram for this inflow time series! In the code below, we set up the necessary values and generate the figure:

%% Define overlap size
Noverlap1_jl = N1_jl/2;  % 4-year cycle
Noverlap2_jl = N2_jl/2;  % Annual cycle

%% Plot inflow spectrogram for annual sampling (Jordan Lake)
spectrogram(inflows_jl, win1_jl, Noverlap1_jl, N1_jl, fs_inflow, 'yaxis');
colormap(flipud(parula));  
clim([100,150]);
title('Jordan Lake 4-year cycle (Blackman-Harris)');

By replacing the values associated with the 4-year cycle with that of the annual cycle in the above code, the following figures can be generated:

Figure 6: Spectrograms generated using the 4-year and annual window respectively.

In Figure 6, the darker blue indicates more powerful fluctuations (high-inflow events). However, their low frequency implies that, in this time series, such event are fairly rare. On the other hand, a lighter blue to yellow indicates inflow events of higher frequencies, but are less powerful. While they occur more frequently, they do not influence the signals in the time series significantly. We can also see from these two figures, that, in the long- and mid-term, both low and high frequency events become less powerful, implying a decrease in inflow, even more rare rainfall events that historically result in large inflows.

Conclusion

In this post, I first introduced a few basic concepts and methods related to time series analysis and the Fourier Domain. Next, we applied these concepts to the simple analysis of an inflow time series using MATLAB. Nevertheless, this post and its contents barely scratch the surface of Fourier Transforms and time series analysis as a whole. I hope this gave you a flavor of what these tools enable you to do, and hope that it piqued your interest into exploring this topic further!

QPPQ Method for Streamflow Prediction at Ungauged Sites – Python Demo

Streamflow Predictions in Ungauged Basins

Predicting streamflow at ungauged locations is a classic problem in hydrology which has motivated significant research over the last several decades (Hrachowitz, Markus, et al., 2013).

There are numerous different methods for performing predictions in ungauged basins, but here I focus on the common QPPQ method.

Below, I describe the method and further down I provide a walkthrough demonstration of QPPQ streamflow prediction in Python.

The supporting code can be found on my GitHub here: QPPQ_Streamflow_Prediction_Tutorial.

QPPQ-Method for Streamflow Prediction

Fennessey (1994) introduced the QPPQ method for streamflow estimation at ungauged locations.

The QPPQ method is commonly used and encouraged by the USGS, and is described at length in their publication Estimation of Daily Mean Streamflow for Ungaged Stream locationsā€¦ (2016).

QPPQ consists of four key steps:

  1. Estimating an FDC for the target catchment of interest, \hat{FDC}_{pred}.
  2. Identify K donor locations, nearest to the target point.
  3. Transferring the timeseries of nonexceedance probabilities (\mathbf{P}) from the donor site(s) to the target.
  4. Using estimated FDC for the target to map the donated nonexceedance timeseries, \mathbf{P} back to streamflow.

To limit the scope of this tutorial, let’s assume that an estimate of the FDC at the target site, \hat{FDC}_{pred}, has already been determined through some other statistical or observational study.

Then the QPPQ method can be described more formally. Given an ungauged location with an estimated FDC, \hat{FDC}{pred}, and set of observed streamflow timeseries \mathbf{q_i} at K neighboring sites, such that:

Q_{obs} = [\mathbf{q_1}, \mathbf{q_2}, ..., \mathbf{q_k}]

With corresponding K FDCs at the observation locations:

FDC = [FDC_1, FDC_2, ..., FDC_k]

The FDCs are used to convert the observed streamflow timeseries, \mathbf{q_{obs, i}}, to non-exceedance probability timeseries, \mathbf{p_{obs, i}}.

\displaystyle FDC_i : \mathbf{q_{i}} \to \mathbf{p_i}

We can then perform a weighted-aggregation of the non-exceedance probability timeseries to estimate the non-exceedance timeseries at the ungauged location. It is most common to apply an inverse-squared-distance weight to each observed timeseries such that:

\mathbf{p_{pred}} = \sum^k (\mathbf{p_i}w_i)

Where w_i = 1 / d_i^2 where d_i is the distance from the observation i to the ungauged location, and \sum^k w_i = 1.

Finally, the estimated FDC at the ungauged location, \hat{FDC}_{pred} is used to convert the non-exceedance timeseries to streamflow timeseries:

\hat{FDC}_{pred} : \mathbf{p}_{pred} \to \mathbf{q}_{pred}


Looking at this formulation, and the sequence of transformations that take place, I hope it is clear why the method is rightfully called the QPPQ method.

This method is summarized well by the following graphic, taken from the USGS Report on the topic:

In the following section, I step through an implementation of this method in Python.

Tutorial

All Python scripts used in this tutorial can be found in my GitHub repository: QPPQ_Streamflow_Prediction_Tutorial.

In order run the scripts in this tutorial yourself, you will need to have installed the a few Python libraries, listed in requirements.txt. Running pip install -r requirements.txt from the command line, while inside a local copy of the directory will install all of these packages.

Data retrieval

I collected USGS streamflow data from N gages using the HyRiver suite for Python.

If you would like to learn more about hydro-environmental data acquisition in Python, check out my old post on Efficient hydroclimatic data accessing with HyRiver for Python.

The script used to retrieve the data is available here. If you would like to experiment with this method in other regions, you can change the region variable on line 21, which specifies the corners of a bounding-box within which gage data will be retrieved:


# Specify time-range and region of interest
dates = ("2000-01-01", "2010-12-31")
region = (-108.0, 38.0, -105.0, 40.0)

Above, I specify a region West of the Rocky Mountains in Colorado. Running the generate_streamflow_data.py, I found 73 USGS gage locations (blue circles).

Fig: Locations of USGS gages used in this demo.

QPPQ Model

The file QPPQ.py contains the method outlined above, defined as the StreamflowGenerator class object.

The StreamflowGenerator has four key methods (or functions):

class StreamflowGenerator():
    def __init__(self, args):  
	    ...
	def get_knn(self):
		...
	def calculate_nep(self):
		...
	def interpolate_fdc(self):
		...
	def predict_streamflow(self):
		...
		return predicted_flow

The method get_knn finds the $k$ observation, gage locations nearest to the prediction location, and stores the distances to these observation locations (self.knn_distances) and the indices associated with these locations (self.knn_indices).

    def get_knn(self):
        """Find distances and indices of the K nearest gage locations."""
        distances = np.zeros((self.n_observations))
        for i in range(self.n_observations):
            distances[i] = geodesic(self.prediction_location, self.observation_locations[i,:]).kilometers
        self.knn_indices = np.argsort(distances, axis = 0)[0:self.K].flatten()
        self.knn_distances = np.sort(distances, axis = 0)[0:self.K].flatten()
        return

The next method, calculate_nep, calculates the NEP of a flow at an observation location at time t, or P(Q \leq q_t)_{i}.

    def calculate_nep(self, KNN_Q, Q_t):
        "Finds the NEP at time t based on historic observatons."
        # Get historic FDC
        fdc = np.quantile(KNN_Q, self.fdc_neps, axis = 1).T
        # Find nearest FDC value
        nearest_quantile = np.argsort(abs(fdc - Q_t), axis = 1)[:,0]
        nep = self.fdc_neps[nearest_quantile]
        return nep	

The interpolate_fdc performs a linear interpolate of the discrete FDC, and estimates flow for some given NEP.

    def interpolate_fdc(self, nep, fdc):
        "Performs linear interpolation of discrete FDC at a NEP."
        tol = 0.0000001
        if nep == 0:
            nep = np.array(tol)
        sq_diff = (self.fdc_neps - nep)**2

        # Index of nearest discrete NEP
        ind = np.argmin(sq_diff)

        # Handle edge-cases
        if nep <= self.fdc_neps[0]:
            return fdc[0]
        elif nep >= self.fdc_neps[-1]:
            return fdc[-1]

        if self.fdc_neps[ind] <= nep:
            flow_range = fdc[ind:ind+2]
            nep_range = self.fdc_neps[ind:ind+2]
        else:
            flow_range = fdc[ind-1:ind+1]
            nep_range = self.fdc_neps[ind-1:ind+1]

        slope = (flow_range[1] - flow_range[0])/(nep_range[1] - nep_range[0])
        flow = flow_range[0] + slope*(nep_range[1] - nep)
        return flow

Finally, predict_streamflow(self, *args) combines these other methods and performs the full QPPQ prediction.

    def predict_streamflow(self, args):
        "Run the QPPQ prediction method for a single locations."
        self.prediction_location = args['prediction_location']
        self.prediction_fdc = args['prediction_fdc']
        self.fdc_quantiles = args['fdc_quantiles']
        self.n_predictions = self.prediction_location.shape[0]

        ### Find nearest K observations
        self.get_knn()
        knn_flows = self.historic_flows[self.knn_indices, :]

        ### Calculate weights as inverse square distance
        self.wts = 1/self.knn_distances**2

        # Normalize weights
        self.norm_wts = self.wts/np.sum(self.wts)

        ### Create timeseries of NEP at observation locations
        self.observed_neps = np.zeros_like(knn_flows)
        for t in range(self.T):
            self.observed_neps[:,t] = self.calculate_nep(knn_flows, knn_flows[:,t:t+1])

        ### Calculate predicted NEP timeseries using weights
        self.predicted_nep = np.zeros((self.n_predictions, self.T))
        for t in range(self.T):
            self.predicted_nep[:,t] = np.sum(self.observed_neps[:,t:t+1].T * self.norm_wts)

        ### Convert NEP timeseries to flow timeseries
        self.predicted_flow = np.zeros_like(self.predicted_nep)
        for t in range(self.T):
            nep_t = self.predicted_nep[0,:][t]
            self.predicted_flow[0,t] = self.interpolate_fdc(nep_t, self.prediction_fdc)

        return self.predicted_flow

The predict_streamflow method is the only function called directly by the user. While get_knn, calculate_nep, and interpolate_fdc are all used by predict_streamflow.

Generate streamflow predictions

The script run_QPPQ_predictions.py runs the model and produces predictions at a test site. First, the data generated by generate_streamflow_data.py is loaded:

import numpy as np
from QPPQ import StreamflowGenerator

### Load Data
gage_locations = np.loadtxt('./data/observed_gage_locations.csv', delimiter = ',')
observed_flows = np.loadtxt('./data/observed_streamflow.csv', delimiter = ',')

The FDCs at each site are estimated at 200 discrete quantiles:

fdc_quantiles = np.linspace(0,1,200)
observed_fdc = np.quantile(observed_flows, fdc_quantiles, axis =1).T

A random test site is selected, and removed from the observation data:

# Select a test site and remove it from observations
test_site = np.random.randint(0, gage_locations.shape[0])

# Store test site
test_location = gage_locations[test_site,:]
test_flow = observed_flows[test_site, :]
test_site_fdc = observed_fdc[test_site, :]

# Remove test site from observations
gage_locations = np.delete(gage_locations, test_site, axis = 0)
observed_flows = np.delete(observed_flows, test_site, axis = 0)

When initializing the StreamflowGenerator, we must provide an array of gage location data (longitude, latitude), historic streamflow data at each gage, and the K number of nearest neighbors to include in the timeseries aggregation.

I’ve set-up the StreamflowGenerator class to receive these inputs as a dictionary, such as:

# Specify model prediction_inputs
QPPQ_args = {'observed_locations' : gage_locations,
            'historic_flows' : observed_flows,
            'K' : 20}

# Intialize the model
QPPQ_model = StreamflowGenerator(QPPQ_args)

Similarly, the prediction arguments are provided as a dictionary to the predict_streamflow function:

# Specify the prediction arguments
prediction_args = {'prediction_location': test_location,
                    'prediction_fdc' : test_site_fdc,
                    'fdc_quantiles' : fdc_quantiles}
                    
# Run the prediction
predicted_flow = QPPQ_model.predict_streamflow(prediction_args)

I made a function, plot_predicted_and_observed, which allows for a quick visual check of the predicted timeseries compared to the observed timeseries:

from plot_functions import plot_predicted_and_observed
plot_predicted_and_observed(predicted_flow, test_flow)

Which shows some nice quality predictions!

Fig: Comparison of observed streamflow and streamflow generated through QPPQ.

One benefit of working with the StreamflowGenerator as a Python class object is that we can retrieve the internal variables for further inspection.

For example, I can call QPPQ_model.knn_distances to retrieve the distances to the K nearest neighbors used to predict the flow at the ungauged location. In this case, the gages used to make the prediction above were located 4.44, 13.23,. 18.38,ā€¦ kilometers away.

Caveat and Conclusion

It is worth highlighting one major caveat to this example, which is that the FDC used for the prediction site was perfectly known from the historic record. In most cases, the FDC will not be known when making predictions in ungauged basins. Rather, estimations of the FDC will need to be used, and thus the prediction quality shown above is somewhat of a ideal-case when performing a QPPQ in ungauged basins.

There are numerous methods for estimating FDCs at the ungauged site, including the Generalized Pareto distribution approximation proposed by Fennessey (1994) or, more recently, through the use of Neural Networks, as highlighted in Worland, et al. (2019).

Hopefully this tutorial helped to get you familiar with a foundational streamflow prediction method.

References

Fennessey, Neil Merrick. "A Hydro-Climatological Model of Daily Stream Flow for the Northeast United States." Order No. 9510802 Tufts University, 1994. Ann Arbor:Ā ProQuest.Ā Web. 21 Nov. 2022.

Hrachowitz, Markus, et al. "A decade of Predictions in Ungauged Basins (PUB)ā€”a review."Ā Hydrological sciences journalĀ 58.6 (2013): 1198-1255.

Razavi, Tara, and Paulin Coulibaly. "Streamflow prediction in ungauged basins: review of regionalization methods."Ā Journal of hydrologic engineeringĀ 18.8 (2013): 958-975.

Stuckey, M.H., 2016, Estimation of daily mean streamflow for ungaged stream locations in the Delaware River Basin, water years 1960ā€“2010: U.S. Geological Survey Scientific Investigations Report 2015ā€“5157, 42 p., http://dx.doi.org/10.3133/sir20155157.

Worland, Scott C., et al. "Prediction and inference of flow duration curves using multioutput neural networks."Ā Water Resources ResearchĀ 55.8 (2019): 6850-6868.

Creating a collaborative lab manual pt. 2: Automated build & deploy with GitHub Actions

In my last blog post, I introduced the collaborative lab manual that the Reed Group is beginning to develop. Although the content of this site is still very much a work in progress, the goal is for this lab manual to help streamline the process of onboarding new students, postdocs, and collaborators by acting as a centralized location for readings, training materials, and coding resources relevant to our group’s research. As described in the previous post, the site itself is built using the Jupyter Books package for Python, which allows for Markdown documents and executable Jupyter Notebooks to be translated into static HTML websites.

However, the distributed nature of this project, with multiple collaborators developing material for the site from their own computers, initially led to some software dependency challenges. As the project has evolved, we have worked to overcome these challenges using GitHub Actions. This post will describe our GitHub Actions workflow, which automatically rebuilds the site in a consistent way after each new update, without inheriting specific software dependencies from individual developers, and then automatically deploys the updated site using GitHub Pages. All code used to build and deploy the lab manual can be found in this GitHub repository.

GitHub Actions workflow

This post will build on Travis Thurber’s previous blogpost on Continuous Integration/Continuous Delivery (CI/CD), GitHub Actions, and GitHub Pages, so I would recommend checking that out first if you have not already. As discussed in that post, GitHub Actions are automated processes that can be triggered by particular events within a GitHub repository (e.g., updates to the code base). The instructions for executing these processes are written in a YAML script, .github/workflows/deploy.yml. GitHub automatically looks for scripts in this directory to trigger automated GitHub Actions.

### github actions to build & deploy book, following https://github.com/executablebooks/cookiecutter-jupyter-book/blob/main/.github/workflows/deploy.yml

name: deploy

on:
  # Trigger the deploy on push to main branch
  push:
    branches:
      - main
  schedule:
    # jupyter-book is updated regularly, let's run this deployment every month in case something fails
    # <minute [0,59]> <hour [0,23]> <day of the month [1,31]> <month of the year [1,12]> <day of the week [0,6]>
    # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/crontab.html#tag_20_25_07
    # https://crontab.guru/every-month
    # Run cron job every month
    - cron: '0 0 1 * *'

jobs: 
  # This job deploys the example book
  deploy-example-book:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest]
        python-version: [3.8]
    steps:
    - uses: actions/checkout@v2

    # Install python
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v1
      with:
        python-version: ${{ matrix.python-version }}

    # install virtual environment with caching, so only updates when requirements.txt changes,
    # based on https://github.com/marketplace/actions/restore-or-create-a-python-virtualenv#custom_virtualenv_dir
    # Note: virtual environment by default will be created under ~/.venv
    - uses: syphar/restore-virtualenv@v1
      id: cache-virtualenv
      with:
        requirement_files: docs/requirements_py.txt
    - uses: syphar/restore-pip-download-cache@v1
      if: steps.cache-virtualenv.outputs.cache-hit != 'true'

    # install python dependencies    
    - name: Install python dependencies
      run: pip install -r docs/requirements_py.txt
      if: steps.cache-virtualenv.outputs.cache-hit != 'true'

    # update kernel of all jupyter notebooks to .venv to match GH action environment
    - name: Update Jupyter Notebook kernels 
      run: python docs/update_jupyter_kernels.py .venv |
           python -m ipykernel install --user --name=.venv

    # install R
    - name: Install R
      uses: r-lib/actions/setup-r@v2
      with:
        use-public-rspm: true
    # install R dependencies
    - name: Install R dependencies
      run: sh docs/install_R_dependencies.sh

    # Build the example book
    - name: Build book
      run: jupyter-book build --all docs/

    # Deploy html to gh-pages
    - name: GitHub Pages action
      uses: peaceiris/actions-gh-pages@v3.6.1
      with:
        github_token: ${{ secrets.GITHUB_TOKEN }}
        publish_dir: docs/_build/html
        publish_branch: gh-pages

I borrowed initially from the example Jupyter Books deployment that can be found here. This Action (Line 3) has a number of steps, which will be outlined sequentially. First, in Lines 5-16, we establish the set of conditions that trigger automated execution of the Action: (1) the Action will be triggered each time an update is pushed to the main branch of the repository; and (2) the Action will be triggered on the first of each month, regardless of whether any new updates have been pushed to the repository, which should help ensure that software dependencies remain up to date over time.

Next, in Lines 18-27, we specify that the Action should be executed on an Ubuntu virtual machine (latest available version). In Lines 29-48, we instruct the Action to install Python 3.8 on the virtual machine, and then install all Python packages listed in the file docs/requirements_py.txt. Standardizing the operating system and software dependencies using a virtual machine, as opposed to building and deploying locally on individual developers’ machines, helps to avoid situations wherein one developer accidentally breaks functionality introduced by another developer due to differences in their local software environments.

One downside to rebuilding a new virtual machine each time the Action is run is that this can be rather slow, especially for complicated software environments, e.g., those with many Python package installations. However, one way to make this process more efficient is to create a virtual environment which can be cached and reloaded into memory. This process is implemented in Lines 35-48, using the restore_virtualenv “task” in the GitHub Action Marketplace. Now the Python environment will only need to be rebuilt when the requirements_py.txt file has changed since the last commit; otherwise, the cached virtual environment can be loaded from memory in order to save time.

Another issue we encountered relates to the naming of virtual environments within Jupyter Notebooks. When you create a new Notebook on a personal computer, you will generally need to specify the Python “Kernel” manually using the graphical user interface. The Kernel can either be the generic Python interpreter you have installed (e.g., Python 3.8), or else a specific virtual environment that you have installed for a particular purpose (e.g., .venv_spatial). Once you have selected a Kernel, the Notebook will expect the same Kernel to be available as the default whenever the Notebook is run. This can cause errors when working on teams where different contributors have different naming conventions for their environments. For example, consider a situation where Contributor A develops a Jupyter Notebook-based page on spatial analysis, using the local virtual environment .venv_spatial. Contributor B then pulls Contributor A’s changes, including the new Notebook on spatial analysis, and then adds her own new Notebook on machine learning using the environment .venv_ML. When Contributer B tries to build the Jupyter Book (more on this process later), the spatial analysis notebook will exit with an error saying that the .venv_spatial environment cannot be found. Contributer B could manually change the Notebook Kernel to .venv_ML or another local environment, but this is not an ideal solution for long-term collaborative interoperability. To circumvent this issue, Lines 50-53 execute the script docs/update_jupyter_kernels.py:

### script to update the kernel for all Jupyter notebooks to user-specified name, given as argv
### this will be run by GitHub Action to set Kernel to .venv before deploying website. Can also be run by user if their env has different name.

import glob
import json
import sys

venv_name = sys.argv[1]
nbpaths = glob.glob('*.ipynb') + glob.glob('*/*.ipynb') + glob.glob('*/*/*.ipynb')

for nbpath in nbpaths:
    ### load jupyter notebook into dict based on its json format
    with open(nbpath) as f:
        nbdict = json.load(f)
    ### update kernel metadata
    nbdict['metadata']['kernelspec']['display_name'] = venv_name
    nbdict['metadata']['kernelspec']['name'] = venv_name
    ### write updated dict to json/ipynb
    with open(nbpath, 'w') as f:
        json.dump(nbdict, f, indent=2)

The purpose of this Python script is to reassign all Jupyter Notebooks within the repository to use the same Kernel name, .venv (supplied as a command line argument). The key insight is that Jupyter Notebooks are actually stored on disc in structured JSON format (try opening one in a text editor to verify this). The script above leverages this fact to programmatically replace the existing Kernel name with our new name, .venv. By running this command in GitHub Actions, we can automatically standardize the Kernel for all Notebooks to use the virtual environment that we created in Lines 35-48 (.venv is the default environment name used by the restore_virtualenv task). Note that this should not impact the Kernel names in the repository itself, since this script is only being run on the GitHub Actions virtual machine. However, if individual contributors experience local errors from incompatible Kernel names, they can either change the Kernel manually in the Notebook GUI (the old fashion way), or else run the update_jupyter_kernels.py script on their own machine with their unique Python virtual environment name given as a command line argument.

Returning to the deploy.yml file, the next step is to install the R programming language (Lines 55-59) and any necessary R libraries (Lines 60-62). The latter step is accomplished by running the script docs/install_R_dependencies.sh.

#!/bin/bash
### script for installing R dependencies from requirements file
### simplified version of code here: https://stackoverflow.com/questions/54534153/install-r-packages-from-requirements-txt-files
while IFS=" " read -r package;
do
        Rscript -e "install.packages('"$package"')";
done < "docs/requirements_r.txt"

This Bash script reads a list of dependencies from the file docs/requirements_r.txt and installs each library on top of the base R installation on our virtual machine. Unfortunately the R installations are quite slow, but I am not aware of any virtual environment caching task for R on GitHub Actions.

Returning again to the deploy.yml file, the next step in Lines 64-66 is to build the static HTML site using the Jupyter Books functionality. The command in Line 66 ( jupyter-book build --all docs/ ) is also the command used to build a site locally when individual contributors want to test their work before pushing changes to the main repository. In this case, this command will create a separate folder of HTML files which can be opened in a browser.

However, when running on the virtual machine through GitHub Actions, the new HTML files are not yet being hosted anywhere. The final set of commands in Lines 68-74 accomplish this by restructuring and organizing the file contents on the virtual machine (including the new HTML files) to match the structure that is expected by GitHub Pages. The updated files are then committed and pushed to a separate branch of the GitHub repository called gh-pages. This gh-pages branch should only be altered by GitHub Actions, as opposed to being developed directly by humans like the main branch.

Finally, because gh-pages is set as the source branch for our GitHub Pages site, the updates pushed to this branch by the GitHub Action will automatically deploy the updated HTML files to our lab manual webpage.

To verify that all of these steps on the virtual machine completed successfully, we can check the Action log on GitHub. This shows that each step completed as expected without errors, along with the time to complete each step. If errors do occur, each of these steps can be expanded to show the errors and other terminal outputs in more detail.

Overall, this workflow based on GitHub Actions is helping us to streamline our processes for collaborative development of a group lab manual. Our hope is that this workflow will continue to adapt over time based on our evolving needs, much like the living document of the lab manual itself.