More on subplots with Matplotlib

I got a little ahead of myself with the title of my last post, “Everything you want to know about subplots in Python’s Matplotlib”. While the information in that post can allow you to do quite a lot, there is in fact more that you might want to know. In this post I’ll go over two more subplot tools that are helpful for designing informative and attractive subplots. First, I’ll discuss working with colorbars, a seemly minor task that can be quite time consuming. Then I’ll discuss a brand new feature of Matplotlib, the subplot_mosaic interface. This feature is still in its testing phase, but will likely be the new standard for making subplots in the future.

Formatting colorbars with constrained_layout

Colorbars play an important role in data visualization. Often, you may use a common color to link multiple views of the same data set, or contrast two data sets. Making a colorbar in matplotlib is fairly easy, but unless you use the right tools, making the colorbar fit into the overall graphic can be unexpectedly difficult. Here’s an example of creating a single colorbar for four different subplots. The fig.colorbar() function, allows you to easily add a colorbar to the set of subplots. Unfortunately, with the default settings, this code will shrink two subplots disproportionately.

from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import numpy as np

# create some sample data
x = np.linspace(-10.0, 10.0, 1000)
y = np.linspace(-10.0, 10.0, 1000)
X, Y = np.meshgrid(x, y)
Z1 = X + Y
Z2 = X - Y
Z3 = X*Y
Z4 = X**2+Y**2

# create a figure object
fig, axes = plt.subplots(2,2, figsize=(8,8))
im1 = axes[0,0].imshow(Z1, cmap='BrBG')
im2 = axes[0,1].imshow(Z2, cmap='BrBG')
im3 = axes[1,0].imshow(Z3, cmap='BrBG')
im4 = axes[1,1].imshow(Z4, cmap='BrBG')

cbar = fig.colorbar(im1, ax=axes[:, 1], shrink=0.8)

One way you may attempt to fix this could be to use the tight_layout() function, which can help align subplots. This will make things worse however, because the colorbar confuses the algorithm that tight_layout uses to arrange the axes objects. The result will look like this (and produce an warning output):

Luckily, there is an alternative to tight_layout, called constrained_layout, which uses a constrained solver to optimize subplot placement. Constrained layout should be called during the creation of the figure object, as demonstrated below. Applying constrained layout to this set of subplots fixes the error and creates a nice looking set of plots.

# create a figure object
fig, axes = plt.subplots(2,2, figsize=(8,8), constrained_layout=True)
im1 = axes[0,0].imshow(Z1, cmap='BrBG')
im2 = axes[0,1].imshow(Z2, cmap='BrBG')
im3 = axes[1,0].imshow(Z3, cmap='BrBG')
im4 = axes[1,1].imshow(Z4, cmap='BrBG')

cbar = fig.colorbar(im1, ax=axes[:, 1], shrink=0.8)

I should note that we can edit the colorbar axes object just like any of the other axes objects, adding labels, adjusting the tick marks etc. For more on colorbars, see the documentation here.

The subplot_mosaic interface

In my last post I discussed the Gridspec interface, which allows you to manually configure custom grid of subplots. While Gridspec can create complex configurations of subplots, manually adjusting the grid can become complicated for complex layouts. Matplotlib recently introduced a new feature, subplot_mosaic, which allows a more intuitive interface for configuring subplots. The subplot_mosaic has a simple and streamlined interface that allows you to easily lay out subplots, then stores these subplots as a dictionary. It’s important to note that this feature is still in testing and (at the time of this posting) is not currently supported by many distributions such as Anaconda. For more examples and full documentation, see the official release page.

Spatial statistics (Part 3): Geographically Weighted (GW) models

Geographically weighted (GW) models are useful when there is non-stationarity across the spatial region. In this case global models cannot represent the local variations across the region. Instead locally weighted regression coefficients, based on specific distance, could be used to adjust their global values. In this blog pot, I am going to introduce an R package “GWmodel” that handles this procedure and has more functionality such as principal components analysis that can be used as an exploratory tool for evaluating data spatial heterogeneity. It also provides some summary statistics that I will cover in this post.

The spatial weighting function is the most important part in GW modeling as it defines the spatial dependency between target data. We can define a matrix with the same dimension of target data to indicate the geographical weighting of each data point for each location.  Users have to specifying type of distance, kernel function, and bandwidth to build this matrix. We can consider different methods of distance calculation (Euclidean, Manhattan, Great Circle distance, or generalized Minkowski distance) and commonly used kernel functions (Gaussian, Exponential, Box-car, Bi-square, Tri-cube).

Gaussian and exponential kernels are continuous functions of the distance between two observation points, while Box-car, Bi-square, and Tri-cube are discontinuous functions. This mean that observations that are further than the specified distance (bandwidth) are excluded. The bandwidth can be a fixed distance, or as a fixed number of local data, for both types of functions, but the actual local sample size is the same as the sample size for the continuous functions.

We can examine the potential local relationships between the variables by applying summary statistics function gwss (), which includes GW mean, standard deviation, a measure of skew and a Pearson’s correlation coefficient for any locations. In addition to this basic summary, we can consider a robust statistics where the effects of outliers on the local statistics are excluded. The robust statistics include GW medians, inter-quartile ranges, and quantile imbalances. Also, local bivariate summary statistics including Pearson’s and Spearman’s are available (basic and robust forms, respectively). I am going to use this function to explore some statistics. The sample data that is similar to my previous blog post is here. First we need to convert it to spatial data that has coordinates, by specifying the latitude and longitude columns:

data<- read.table("---your path---/data_new.csv",header = T)
sample_dataset <- SpatialPointsDataFrame(data[, 1:2], data)

Now, I am going to calculate the summary statistics for a few variables by considering three kernels. For this function we need to specify the bandwidth. Ideally this should be estimated by applying cross-validation across a range of bandwidths to reach the most accurate predictions.

library("GWmodel")
sample_dataset_bx <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"), kernel = "boxcar", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_bs <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "bisquare", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_gu <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "gaussian", adaptive = TRUE, bw = 300, quantile = TRUE)

As an example we can compare the basic measures of the local variability in yield based on three kernels.

library("RColorBrewer")
spplot(sample_dataset_bx$SDF, "WW_Yield_LSD", key.space = "right",col.regions = brewer.pal(8, "Set1") ,cuts = c(345,525,705,885,1065,1245,1425,1605,1770),  main = "GW Standard Deviations for Yield (boxcar)")

Or we can plot the basic local correlation between yield and soil water profile using a box-car kernel. The graphs below present the same concept for all three kernels.

mypalette= c("#FFFFCC","#C7E9B4","#7FCDBB","#41B6C4","#1D91C0","#225EA8","#0C2C84")
spplot(sample_dataset_bx$SDF, "Corr_WW_Yield.Soil_water", key.space = "right",col.regions = mypalette, cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (boxcar)")

As we see in standard deviation graphs, yield appears highly variable. It looks like bi-square kernels with 300 bandwidths (~ 26% of data) is more efficient, compared to two other kernels and the relationship between yield and soil water profile is non-stationary.

I am going to compare the robust GW correlations between yield and soil water profile using a bi-square kernel, with the basic one, that we just created, with the new color scheme for better visualization:

spplot(sample_dataset_bs$SDF, "Corr_WW_Yield.Soil_water", key.space = "right", cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
       main = "GW correlations: WW Yield and Soil-water Profile (bisquare_basic)")
spplot(sample_dataset_bs$SDF, "Spearman_rho_WW_Yield.Soil_water",key.space = "right",cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (bisquare_robust)")

Principle components is another type of analysis that we can apply on our multivariate data to evaluate potential linear combinations of variables that allow sources of variation to be recognized. The “GWmodel” package provides the functionality to account for the spatial heterogeneity in PCA analysis. Here is an example of command lines for basic and robust PCA analysis. But before that, we need to standardize or independent variables by re-scaling them to have a similar magnitude and therefore equal importance for all variables, in the analysis.

scaled_dataset <- scale(as.matrix(sample_dataset@data[,3:10]))
#basic
pca_basic <- princomp(scaled_dataset, cor = F)
(pca_basic$sdev^2 / sum(pca_basic$sdev^2))*100    #percentage of total variance’ (PTV)
Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
5.956296e+01 1.891630e+01 1.202148e+01 6.965907e+00 2.219094e+00 3.114074e-01 
Comp.7       Comp.8 
2.586703e-03 2.653528e-04 

#robust
pca_robust <- covMcd(scaled_dataset, cor = F)
pca.robust <- princomp(scaled_dataset, covmat = R.COV, cor = F)
(pca.robust$sdev^2 / sum(pca.robust$sdev^2))*100   #percentage of total variance’ (PTV)
pca.robust$loadings

   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
42.758379 32.184879 11.967754  5.399829  4.168437  1.731579  1.190092  0.599050 

With bw.gwpca () we can automatically select the bandwidth for GW PCA analysis. The function uses a cross- validation approach to find the optimal bandwidth. However, we need to decide the number of components (k) to include in the analysis. We also need to convert our scaled dataset to a spatial data.

Coords <- as.matrix(cbind(sample_dataset$LON, sample_dataset$LAT))
scaled_dataset.spdf <- SpatialPointsDataFrame(Coords,as.data.frame(scaled_dataset))
bw.gwpca.basic <- bw.gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), k = 4, robust = FALSE,adaptive = TRUE)
#bw.gwpca.basic = 986
bw.gwpca.robust <- bw.gwpca(scaled_dataset.spdf,vars=colnames(scaled_dataset.spdf@data), k = 4, robust = TRUE, adaptive = TRUE)
#bw.gwpca.robust = 767

Once the bandwidth is estimated, we can use gwpca()  to calibrate the basic and robust GW PCA fits. It should be noted that we use all of the components in the fitted model at this step.

gwpca.basic <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.basic, k = 8,robust = FALSE, adaptive = TRUE)
gwpca.robust <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.robust, k = 8,robust = TRUE, adaptive = TRUE)

Now, as an example, we can visualize how data dimensionality varies spatially for the first two components, by extracting the sum of total variance (%) or PTV at each location.

var_pca_basic <- (rowSums(gwpca.basic$var[, 1:2])/rowSums(gwpca.basic$var))*100
sample_dataset$var_pca_basic <- var_pca_basic
var_pca_robust <- (rowSums(gwpca.robust$var[, 1:2])/rowSums(gwpca.robust$var))*100
sample_dataset$var_pca_robust <- var_pca_robust
spplot(sample_dataset, "var_pca_basic", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (basic)")
spplot(sample_dataset, "var_pca_robust", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (robust)")

The differences between these two plots show the effect of local multivariate outliers.

References:

Gollini, I., Lu, B., Charlton, M., Brunsdon, C., Harris, P., 2015. GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models. Journal of Statistical Software 63, 1–50. https://doi.org/10.18637/jss.v063.i17

Lu, B., Harris, P., Charlton, M., Brunsdon, C., Nakaya, T., Murakami, D., Gollini, I., 2020. GWmodel: Geographically-Weighted Models.

Introducing the R Package “biascorrection”

For variety of reasons, we need hydrological models for our short- and long-term predictions and planning.  However, it is no secret that these models always suffer from some degree of bias. This bias can stem from many different and often interacting sources. Some examples are biases in underlying model assumptions, missing processes, model parameters, calibration parameters, and imperfections in input data (Beven and Binley, 1992).

The question of how to use models, given all these uncertainties, has been an active area of research for at least 50 years and will probably remain so for the foreseeable future, but going through that is not the focus of this blog post.

In this post, I explain a technique called bias correction that is frequently used in an attempt to improve model predictions. I also introduce an R package for bias correction that I recently developed; the package is called “biascorrection.” Although most of the examples in this post are about hydrological models, the arguments and the R package might be useful in other disciplines, for example with atmospheric models that have been one of the hotspots of bias correction applications (for example, here, here and here). The reason is that the algorithm follows a series of simple mathematical procedures that can be applied to other questions and research areas.

What Do I Mean by Bias Correction?

Bias correction in the context of hydrological modeling usually means “manual” improvement of model simulations to ensure that they match the observed data. Here’s a simple example: Let’s say we know that our model tends to underestimate summer flow by 30%. Then we might reason that to improve our predictions, we could add 30% to our simulated summer flows. In this blog post, I use a more sophisticated method, but this example should give you an idea of what I’m talking about.

Bias Correction using Quantile Mapping

Quantile mapping is a popular bias correction method that has been used in various applications. In this method, we first create quantiles of observed and simulated data. After that, whenever we have a simulated value we can find its simulated quantile and replace it with the value of the closest observed quantile.

Figure 1 – A simplified workflow of the quantile mapping technique

We generally follow the following steps to do the bias correction (see here):

  • Monthly Quantiles

We need to have two monthly time series of observed and simulated streamflow. If both series use daily time steps, we must aggregate the daily values to monthly values first, to create the monthly quantiles. We then sort the observed and simulated streamflows and assign each value to a quantile. This process generates streamflow quantiles for each month (January, February, etc.).

  • Monthly Bias Correction

When the quantiles are ready, we can start from the first month of the simulated results and determine what quantile its values belong to. Then we can go to the same quantile of the observed values and use it instead of the simulated one. This creates a monthly bias-corrected stream.

  • Annual Adjustment

Hamlet and Lettenmaier 1999, (also here) argue that the monthly bias correction can dramatically change the magnitude of annual streamflow predictions. However, although hydrologic models usually perform poorly at monthly time steps, they are pretty good at capturing annual variations. Therefore, we tend to rely on them. We calculate the annual difference between the bias-corrected and simulated flows and then we apply that to each individual month. This way, we can make sure that while the monthly variations are consistent with the recorded streamflow, our model is still able to determine how the average annual flow looks.

  • Disaggregation to Daily

After we apply the annual adjustments, we can use simulated or historical observed values to disaggregate the monthly time series to a daily one. The “biascorrection” package provides two methods for doing that: (1) Rescaling the raw simulated daily time series to match the monthly bias-corrected values. For example, if the total simulated streamflow for a month is half of the bias-corrected values for that month, the disaggregation module multiplies the raw daily simulated values by two for that specific month. (2) Sampling from the daily observed historical times series. In this case, the model uses KNN to find some of the closest months in the historical period (in terms of average monthly values) and randomly selects one of them.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In climate change studies, if the historical, simulated, and observed quantiles do not differ widely from the projected future scenarios, we can use the historical quantiles. However, if the future values are fundamentally different from the historical time series, this might not be justifiable. In such a case, synthetic generation of future data may be the way to go. The R package doesn’t include this either.

There are just two quick disclaimers:

  1. The R package has been tested and seems to perform well, but its complete soundness is not guaranteed.
  2. This blog post and the R package only introduce this bias correction as a common practice; I do not endorse it as a remedy for all the problems of hydrological models. Readers should keep in mind that there are serious criticisms of bias correction (for example here). I will discuss some in the following sections.

Arguments Against Bias Correction

One of the main advantages of hydrological models is that they can simultaneously and, arguably, consistently simulate different water and energy balance processes. However, bias correction manually perturbs streamflow without taking into account its effects on other components of the system. Therefore, it takes away one of the main advantages of hydrological models. In some cases, this can even distort climate change signals.

The other problem is that bias correction tries only to match the overall, aggregate statistics of the simulated flow with the observed flow, although streamflow has many more attributes. For example, current bias-correction algorithms can systematically ignore extremes that occur in daily to weekly time steps.

The “biascorrection” Package

I recently developed an R package that can be used for bias correction in simulated streamflow. The package has four main functions. Its workflow is consistent with the four bias-correction steps described above: (1) quantile creation, (2) monthly quantile mapping, (3) annual adjustment, (4) disaggregation to daily. The package doesn’t have a prescribed unit, and it can be used in different applications that require bias correction.

How to Install “biascorrection” Package

The package is available on GitHub (here) and it can be installed using the following command:

devtools::install_github("keyvan-malek/biascorrection")

You can also go to my “blog” folder on GitHub to download simulated and observed datasets of streamflow at inlet of the Arrow dam in British Columbia (AKA Keenleyside Dam). The simulated flow has been generated using the Variable Infiltration Capacity model (VIC), and the observed flow is from Bonneville Power Administration’s No Regulation-No Irrigation streamflow datasets.

observed_input<-read.table("sample_data/Arrow_observed.txt", header = T)
simulated_input<-read.table("sample_data/Arrow_simulated.txt", header = T)

Note that the two datasets have different starting and ending dates. I intentionally used them to show how biascorrection package handles these types of datasets.

However, I plotted the overlapping period of the two datasets to demonstrate the difference between them. The simulated data set tends to underestimate streamflow during low-flow periods and overestimate during the high-flow periods.

Figure 2 – Observed vs simulated inflow to Arrow dam

Monthly Bias-Corrections

To run the monthly bias-correction function, first, you need to define the following starting and ending dates of your observed and simulated data frames:

start_date_observed<-"1929-01-01"
end_date_observed<-"2007-12-31"
start_date_simulated<-"1979-01-01"
end_date_simulated<-"2015-12-31"

You also need to define the following two conditions:

time_step<-"day"
date_type<-"JY" ## Water year (WY) or Julian Year (JY)

Finally, we can use the following to calculate the monthly bias corrected flow:

observed_flow=observed_input$ARROW_obs
simulatred_flow=simulated_input$ARROW_sim

df_bc_month<-bias_correct_monthly(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type)

Note that the format of observed and simulated inputs to the “bias_correct_monthly” function are not data frames. Here is how bias correction changes the simulated streamflow. As you see in the followin figure the bias-corrected flow doesn’t seem to have the underestimation problem during the low-flow anymore.

Figure 3- Monthly VIC simulated flow vs. bias-corrected flow

Annual Adjustment

You can use the following command to return the average annual flow back to what model originally simulated. Note that the inputs to the function is output of the monthly function.

df_bc_annual<-bias_correct_annual(df_bc_month$simulated, df_bc_month$bias_corrected, start_date_simulated, end_date_simulated)

Daily Disaggregation

The following function first uses the monthly function, then applies annual adjustment, and finally disaggregate the monthly streamflow to daily. The package has two options to convert monthly to daily: 1- it can multiply the simulated streamflow of each month by its bias-correction coefficient 2- it can use the K-nearest Neighbors method to find the closest month in the observed record

disaggregation_method<-"scaling_coeff" # "scaling_coeff" or "knn"

df_bc_daily<-bias_correct_daily(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type, disaggregation_method)

Here is how the entire bias-correction procedure affect the simulated streamflow:

Figure 4 – Simulated vs. bias-corrected flow after annual adjustment and daily disaggregation

Known Issues and Future Plans

  • Currently the biascorrection package only accepts complete years. For example, if your year starts in January it has to end in December, and it can not continue to, let’s say, next February.
  • I am thinking about adding some more functionalities to the biascorrection package in a few months. Some built-in post-processing options, built-in example datasets, and at least one more bias-correction techniques are some of the options.
  • For the next version, I am also thinking about releasing the model through CRAN.

How to automate scripts on a cluster

There are several reasons why you might need to schedule or automate your scripts on a personal machine or a cluster:

  • You’re waiting for a job to finish before submitting another
  • You’d like to automate regular backups or cleanups of your data (e.g., move new data to another location or remove unnecessary output files)
  • You need to submit jobs to get around node limitations (e.g., you’d like to spread out the submissions over several days)
  • You need to retrieve regularly updated data (e.g., you have a model that uses daily precipitation data and you’d like to automatically collect them every day)

Cron is a utility program on Unix operating systems that allows you to schedule or repeat such tasks in the future. There’s a crontab file associated with every user in a cluster, where you’ll input all the information needed to schedule and automate your tasks. Note that not all clusters automatically allow their users to run cron jobs[1], for example, I can use it on the Reed Group’s Cube cluster, but not on XSEDE’s Comet.

To edit the crontab file associated with your user, type the following in your command line:

crontab -e

This will open a text editor (like Vim) which you can edit. To simply view your current crontab without editing, run:

crontab -l

Crontab syntax is made up of two parts: the timer indicating when to run and the command to run:

Source

The timer accepts five fields, indicating the time and day for the command to run:

  • Minute — minute of the hour, from 0 to 59
  • Hour — hour of the day, from 0 to 23
  • Day of the month — day of the month, from 1 to 31
  • Month — month of the year, from 1 to 12
  • Day of the week — day of the week, from 0 to 7

For example the following would execute script.sh on January 2nd at 9:00AM:

0 9 2 1 * /home/user/scripts/script.sh

Special characters are naturally very useful here, as they allow multiple execution times or ranges:

Asterisk (*) — to use all scheduling parameters in a field, for example, run the script, every day at midnight:

0 0 * * * /home/user/scripts/script.sh

Comma (,) — to use more than one scheduling parameter in a field, for example, run the script every day at midnight and 12PM:

0 0,12 * * * /home/user/scripts/script.sh

Slash (/) — to create predetermined time intervals, for example, run the script every four hours:

0 */4 * * * /home/user/scripts/script.sh

Hyphen (-) — to determine a range of values in a field, for example, run the script every minute during the first 10 minutes of every hour, every day

0-10 * * * * /home/user/scripts/script.sh

Hyphens and slashes can be combined, for example, to run a script every 5 minutes during the first 30 minutes of every hour, every day:

0-30/5 * * * * /home/user/scripts/script.sh

Last (L) — this character can only be used in the day-of-the-month and day-of-the-week fields to specify the last occurrence of something, for example the last day of the month (which could differ):

0 9 L * * /home/user/scripts/script.sh

or, to specify constructs such as “the last Friday” of a every month:

0 9 * * 5L /home/user/scripts/script.sh

Weekday ( W) — this character is only allowed on the day-of-month field and is used to determine the closest weekday to that day of the month. For instance, using “15W” indicates to cron to run the script on the nearest weekday to the 15th day of the month. If the 15th is a Saturday, the script will be executed on Friday the 14th. If the 15th is a Sunday, the script will be executed on Monday the 16th. If the 15th is a weekday, the script will be executed on the same day:

0 0 15W * * /home/user/scripts/script.sh

Hash (#) — this character is only allowed in the day-of-week field and is used to specify constructs such as the second Friday of every month:

0 0 * * 5#2 /home/user/scripts/script.sh

Lastly, if you’d like to be notified whenever a script is executed you can use the MAILTO parameter, with your email address.

The important thing to remember when running cron on a cluster (as opposed to your own machine) is that it will launch a shell that with a new clean environment (i.e., without the environment variables that are automatically applied when you log on an interactive shell) and it will likely not be able to recognize some commands or where your modules are. This can be easily addressed by sourcing your bash_rc or bash_profile from your home directory before running anything. You also need to remember that it will launch at your home directory and you need to specify the absolute path of the scripts to be executed, or change directory before executing them.

For example my crontab file on the Reed Group cluster looks like this:

#!/bin/bash
MAILTO=myemail@cornell.edu
00 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git pull; sbatch ./script.sh
30 10 * * * . $HOME/.bashrc; cd /directory/where/my/project/is; git add . ; git commit -m 'fetched data'; git push

This does the following:
Every day at 10am it sources my bashrc profile so it knows all my environment variables. It changes to the directory of my project and pulls from git any new updates to that project. It then submits a script using sbatch. I get an email at the same time, with the text that would that would have appeared in my command line had I executed these commands in an interactive node (i.e., the git information and a line saying Submitted batch job xxxxx).
Then, every day at 10:30 am, I commit and push the new data back to git.


[1] If you’re just a regular user on a cluster you might need to request to be granted access. If you have root privileges (say, on a personal machine), you need to edit your cron allow and deny files:

/etc/cron.allow
/etc/cron.deny