Nonstationary stochastic watershed modeling

In this post, I will describe the motivation for and implementation of a nonstationary stochastic watershed modeling (SWM) approach that we developed in the Steinschneider group during the course of my PhD. This work is in final revision and should be published in the next month or so. This post will attempt to distill key components of the model and their motivation, saving the full methodological development for those who’d like to read the forthcoming paper.

SWMs vs SSM/SSG

Before diving into the construction of the model, some preliminaries are necessary. First, what are SWMs, what do they do, and why use them? SWMs are a framework that combine deterministic, process-based watershed models (think HYMOD, SAC-SMA, etc.; we’ll refer to these as DWMs from here forward) with a stochastic model that capture their uncertainty. The stochastic part of this framework can be used to generate ensembles of SWM simulations that both represent the hydrologic uncertainty and are less biased estimators of the streamflow observations (Vogel, 2017).

Figure 1: SWM conceptual diagram

SWMs were developed to address challenges to earlier stochastic streamflow modeling/generation techniques (SSM/SSG; see for instance Trevor’s post on the Thomas-Fiering SSG; Julie’s post and Lillian’s post on other SSG techniques), the most important of which (arguably) being the question of how to formulate them under non-stationarity. Since SSMs are statistical models fitted directly to historical data, any attempt to implement them in a non-stationary setting requires strong assumptions about what the streamflow response might look like under an alternate forcing scenario. This is not to say that such an approach is not useful or valid for exploratory analyses (for instance Rohini’s post on synthetic streamflow generation to explore extreme droughts). SWMs attempt to address this issue of non-stationarity by using DWMs in their stochastic formulation, which lend some ‘physics-based’ cred to their response under alternate meteorological forcings.

Construction of an SWM

Over the years, there have been many SWM or SWM-esque approaches devised, ranging from simple autoregressive models to complex Bayesian approaches. In this work, we focus on a relatively straightforward SWM approach that models the hydrologic predictive uncertainty directly and simply adds random samples of it to the DWM simulations. The assumption here being that predictive uncertainty is an integrator of all traditional component modeling uncertainties (input, parameter, model/structural), so adding it back in can inject all these uncertainties into the SWM simulations at once (Shabestanipour et al., 2023).

Figure 2: Uncertainty components

By this straightforward approach, the fitting and parameter estimation of the DWM is accomplished first (and separately) via ‘standard’ fitting procedures; for instance, parameter optimization to minimize Nash-Sutcliffe Efficiency (NSE). Subsequently, we develop our stochastic part of the model on the predictive uncertainty that remains, which in this case, is defined simply by subtracting the target observations from the DWM predictions. This distribution of differenced errors is the ‘predictive uncertainty distribution’ or ‘predictive errors’ that form the target of our stochastic model.

Challenges in modeling predictive uncertainty

Easy, right? Not so fast. There is a rather dense and somewhat unpalatable literature (except for the masochists out there) on the subject of hydrologic uncertainty that details the challenges in modeling these sorts of errors. Suffice it to say that they aren’t well behaved. Any model we devise for these errors must be able to manage these bad behaviors.

So, what if we decide that we want to try to use this SWM thing for planning under future climates? Certainly the DWM part can hack it. We all know that lumped, conceptual DWMs are top-notch predictors of natural streamflow… At the least, they can produce physically plausible simulations under alternate forcings (we think). What of the hydrologic predictive uncertainty then? Is it fair or sensible to presume that some model we constructed to emulate historical uncertainty is appropriate for future hydrologic scenarios with drastically different forcings? My line of rhetorical questioning should clue you in on my feelings on the subject. YES!, of course. ‘Stationarity is immortal!’ (Montanari & Koutsoyiannis, 2014).

Towards a hybrid, state-variable dependent SWM

No, actually, there are a number of good reasons why this statement might not hold for hydrologic predictive uncertainty under non-stationarity. You can read the paper for the laundry list. In short, hydrologic predictive uncertainty of a DWM is largely a reflection of its structural misrepresentation of the true process. Thus, the historical predictive uncertainty that we fit our model to is a reflection of that structural uncertainty propagated through historical model states under historical, ‘stationary’ forcings. If we fundamentally alter those forcings, we should expect to see model states that do not exist under historical conditions. The predictive errors that result from these fundamentally new model states are thus likely to not fall neatly into the box carved out by the historical scenarios.

Figure 3: Structural uncertainty

To bring this back to the proposition for a nonstationary SWM approach. The intrinsic link between model structure and its predictive uncertainty raises an interesting prospect. Could there be a way to leverage a DWM’s structure to understand its predictive uncertainty? Well, I hope so, because that’s the premise of this work! What I’ll describe in the ensuing sections is the implementation of a hybrid, state-variable dependent SWM approach. ‘Hybrid’ because it couples both machine learning (ML) and traditional statistical techniques. ‘State-variable dependent’ because it uses the timeseries of model states (described later) as the means to infer the hydrologic predictive uncertainty. I’ll refer to this as the ‘hybrid SWM’ for brevity.

Implementation of the hybrid SWM

So, with backstory in hand, let’s talk details. The remainder of this post will describe the implementation of this hybrid SWM. This high-level discussion of the approach supports a practical training exercise I put together for the Steinschneider group at the following public GitHub repo: https://github.com/zpb4/hybrid-SWM_training. This training also introduces a standard implementation of a GRRIEN repository (see Rohini’s post). Details of implementing the code are contained in the ‘README.md’ and ‘training_exercise.md’ files in the repository. My intent in this post is to describe the model implementation at a conceptual level.

Model-as-truth experimental design

First, in order to address the problem of non-stationary hydrologic predictive uncertainty, we need an experimental design that can produce it. There is a very real challenge here of not having observational data from significantly altered climates to compare our hydrologic model against. We address this problem by using a ‘model-as-truth’ experimental design, where we fit one hydrologic model (‘truth’ model) to observations, and a second hydrologic model (‘process’ model) to the first truth model. The truth model becomes a proxy for the true, target flow of the SWM modeling procedure, while the process model serves as our proposed model, or hypothesis, about that true process. Under this design, we can force both models with any plausible forcing scenario to try to understand how the predictive uncertainty between ‘truth’ and ‘process’ models might change.

Figure 4: Conceptual diagram of ‘model-as-truth’ experimental design

For the actual work, we consider a very simple non-stationary scenario where we implement a 4oC temperature shift to the temperature forcing data, which we refer to as the ‘Test+4C’ scenario. We choose this simple approach to confine non-stationarity to a high-confidence result of anthropogenic climate change, namely, thermodynamic warming. We compare this Test+4C scenario to a ‘Test’ scenario, which is the same out-of-sample temporal period (WY2005-2018) of meteorological inputs under historical values. SAC-SMA and HYMOD are the truth model and process model for this experiment, respectively. Other models could have been chosen. We chose these because they are conceptually similar and commonly used.

Figure 5: Errors between truth and process models in 5 wettest years of Test/Test+4C scenarios.

Hybrid SWM construction

The core feature of the hybrid SWM is a model for the predictive errors (truth model – process model) that uses the hydrologic model state-variables as predictors. We implement this model in two steps that have differing objectives, but use the same state-variable predictor information. An implicit assumption in using state-variable dependencies in both steps is that these dependencies can exist in both stages. In other words, we do not expect the error-correction step to produce independent and identically distributed residuals. We call the first step an ‘error-correction model’ and the second step a ‘dynamic residual model’. Since we use HYMOD as our process model, we use its state-variables (Table 1) as the predictors for these two steps.

Table 1: HYMOD state variables

Short NameLong NameDescription
simSimulationHYMOD predicted streamflow in mm
runoffRunoffUpper reservoir flow of HYMOD in mm
baseflowBaseflowLower reservoir flow of HYMOD in mm
precipPrecipitationBasin averaged precipitation in mm
tavgAverage temperatureBasin averaged temperature in oC
etEvapotranspirationModeled evapotranspiration (Hamon approach) in mm
upr_smUpper soil moistureBasin averaged soil moisture content (mm) in upper reservoir
lwr_smLower soil moistureBasin averaged soil moisture (mm) in lower reservoir
sweSnow water equivalentBasin averaged snow water equivalent simulated by degree day snow module (mm)

Hybrid SWM: Error correction

The error-correction model is simply a predictive model between the hydrologic model (HYMOD) state-variables and the raw predictive errors. The error-correction model also uses lag-1 to 3 errors as covariates to account for autocorrelation. The objective of this step is to infer state-dependent biases in the errors, which are the result of the predictive errors subsuming the structural deficiencies of the hydrologic model. This ‘deterministic’ behavior in the predictive errors can also be conceived as the ‘predictive errors doing what the model should be doing’ (Vogel, 2017). Once this error-correction model is fit to its training data, it can be implemented against any new timeseries of state-variables to predict and debias the errors. We use a Random Forest (RF) algorithm for this step because they are robust to overfitting, even with limited training data. This is certainly the case here, as we consider only individual basins and a ~15 year training period (WY1989-2004). Moreover, we partition the training period into a calibration and validation subset and fit the RF error-correction model only to the calibration data (WY1989-1998), reducing available RF algorithm training data to 9 years.

Hybrid SWM: Dynamic residual model

The dynamic residual model (DRM) is fit to the residuals of the error correction result in the validation subset. We predict the hydrologic model errors for the validation subset from the fitted RF model and subtract them from the empirical errors to yield the residual timeseries. By fitting the DRM to this separate validation subset (which the RF error-correction model has not seen), we ensure that the residuals adequately represent the out-of-sample uncertainty of the error-correction model.

A full mathematical treatment of the DRM is outside the scope of this post. In high-level terms, the DRM is built around a flexible distributional form particularly suited to hydrologic errors, called the skew exponential power (SEP) distribution. This distribution has 4 parameters (mean: mu, stdev: sigma, kurtosis: beta, skew: xi) and we assume a mean of zero (due to error-correction debiasing), while setting the other 3 parameters as time-varying predictands of the DRM model (i.e. sigmat, betat, xit). We also include a lag-1 autocorrelation term (phit) to account for any leftover autocorrelation from the error-correction procedure. We formulate a linear model for each of these parameters with the state-variables as predictors. These linear models are embedded in a log-likelihood function that is maximized (i.e. MLE) against the residuals to yield the optimal set of coefficients for each of the linear models.

With a fitted model, the generation of a new residual at each timestep t is therefore a random draw from the SEP with parameters (mu=0, sigmat, betat, xit) modified by the residual at t-1 (epsilont-1) via the lag-1 coefficient (phit).

Figure 6: Conceptual diagram of hybrid SWM construction.

Hybrid SWM: Simulation

The DRM is the core uncertainty modeling component of the hybrid SWM.  Given a timeseries of state-variables from the hydrologic model for any scenario, the DRM simulation is implemented first, as described in the previous section. Subsequently, the error-correction model is implemented in ‘predict’ mode with the timeseries of random residuals from the DRM step. Because the error-correction model includes lag-1:3 terms, it must be implemented sequentially using errors generated at the previous 3 timesteps. The conclusion of these two simulation steps yields a timeseries of randomly generated, state-variable dependent errors that can be added to the hydrologic model simulation to produce a single SWM simulations. Repeating this procedure many times will produce an ensemble of SWM simulations.

Final thoughts

Hopefully this discussion of the hybrid SWM approach has given you some appreciation for the nuanced differences between SWMs and SSM/SSGs, the challenges in constructing an adequate uncertainty model for an SWM, and the novel approach developed here in utilizing state-variable information to infer properties of the predictive uncertainty. The hybrid SWM approach shows a lot of potential for extracting key attributes of the predictive errors, even under unprecedented forcing scenarios. It decouples the task of inferring predictive uncertainty from features of the data like temporal seasonality (e.g. day of year) that may be poor predictors under climate change. When linked with stochastic weather generation (see Rohini’s post and Nasser’s post), SWMs can be part of a powerful bottom-up framework to understand the implications of climate change on water resources systems. Keep an eye out for the forthcoming paper and check out the training noted above on implementation of the model.

References:

Brodeur, Z., Wi, S., Shabestanipour, G., Lamontagne, J., & Steinschneider, S. (2024). A Hybrid, Non‐Stationary Stochastic Watershed Model (SWM) for Uncertain Hydrologic Simulations Under Climate Change. Water Resources Research, 60(5), e2023WR035042. https://doi.org/10.1029/2023WR035042

Montanari, A., & Koutsoyiannis, D. (2014). Modeling and mitigating natural hazards: Stationarity is immortal! Water Resources Research, 50, 9748–9756. https://doi.org/10.1002/ 2014WR016092

Shabestanipour, G., Brodeur, Z., Farmer, W. H., Steinschneider, S., Vogel, R. M., & Lamontagne, J. R. (2023). Stochastic Watershed Model Ensembles for Long-Range Planning : Verification and Validation. Water Resources Research, 59. https://doi.org/10.1029/2022WR032201

Vogel, R. M. (2017). Stochastic watershed models for hydrologic risk management. Water Security, 1, 28–35. https://doi.org/10.1016/j.wasec.2017.06.001

Weather Regime-Based Stochastic Weather Generation (Part 1/2)

*This post is written in collaboration with Nasser Najibi from Steinschneider Research Group*

In this blog post, we debut the weather-regime-based stochastic weather generator on the Water Programming Blog! This weather generator was first published in the following article:

Steinschneider, S., Ray, P., Rahat, S. H., & Kucharski, J. (2019). A weather‐regime‐based stochastic weather generator for climate vulnerability assessments of water systems in the western United States. Water Resources Research, 55(8), 6923-6945.

Since its inception, the generator has been applied in studies led by researchers primarily affiliated with Cornell or the Army Corp of Engineers. In this blog post, we’ll step through the key theory and components of this weather generator and show how to run a simple simulation and diagnostics with code that is now provided in an official GitHub repo! Future blog posts in this series will step through more sophisticated examples and additions to the generator. Readers can refer to the paper above for more details on the specific components or methods used in this generator.   

Why use this generator?

If you are looking to develop alternative local weather scenarios informed by climate changes, you can go down two avenues: (1) using projections from General Circulation Models (GCMs) or (2) using statistical weather generation approaches. GCMs have a rich physical representation of ocean and atmospheric processes and are useful to generate internally consistent scenarios that can ultimately be downscaled to the region of interest. However, due to coarse model resolution, GCMs exhibit significant variability and biases in processes that are important for regional planning and vulnerability assessments like precipitation, atmospheric river flux and droughts. These models are also computationally expensive to run and so you are limited to a few storylines of warming scenarios to explore. 

Stochastic weather generators, however, are a complementary way to develop local weather scenarios. They are directly conditioned on the weather of the region of interest and therefore can better capture regional processes and be more applicable to regional water systems planning. This generator can be used to create many scenarios that preserve regional and temporal statistics but are not bounded by the historical record; notably these scenarios can also be generated in a computationally efficient manner. The output from these generators can also be systematically adjusted to create basic climate change scenarios. What is often missing from these statistical models, however, is process information, and this is where the weather-regime-based stochastic generator can be useful.

What are the key components of this generator?

Figure 1: Generator flow chart (Steinschneider et al., 2019)

Figure 1 shows the main components of the first version of the generator, some of which have been extended or altered in newer studies. There are three primary levels to this generator listed below.

Simulation of Weather Regimes: The first step of using this generator is identifying weather regimes that bring regional weather to the application area. Weather regimes are large scale atmospheric flow patterns that tend to persist and organize local weather systems and can effectively be modeled by a Markovian process. Weather regimes are often identified by either clustering or fitting a Hidden Markov Model (HMM) to a selected number of principal components of geopotential height (GPH) anomalies. In many of the studies that use this generator in the Western U.S. (Steinschneider et al., 2019, Najibi et al., 2021, Rahat et al., 2022, Gupta et al., (in review), 500 hPa GPH anomalies are used to identify cool season weather regimes because this GPH effectively capture the weather systems organized by the jet stream. The same GPH anomalies also have been used to identify North Atlantic weather regimes that bring weather systems to Europe (Charlton-Perez et al., 2018). A combination of 500 and 250 hPa GPH anomalies have been used to identify weather regimes in Chile (Meseguer-Ruiz et al., 2020), for example. The choice of number of weather regimes is also region specific. In the case of Western US weather regimes, prior work has developed a framework to identify the optimal number of weather regimes that produce the best results in reproducing regional weather statistics over a large geographic area in California (Najibi et al., 2021). In the case of the Western U.S., 4-5 metastable regimes are defined (both in Guirguis et al., 2020 and Najibi et al., 2021), but other regions might be defined by more. Once weather regimes are identified, then they can be simulated as Markov chains, effectively dictating the large-scale atmospheric flow processes that are taking place on a given day.

Simulation of Local Weather: The next step is a bootstrapping procedure to create local weather. Each day of the historical record is classified to be in a specific weather regime and has a certain value of precipitation and temperature associated with each grid cell in the basin of interest. Thus, when chains of weather regimes are simulated, precipitation and temperature can be bootstrapped from the historical record from time periods where simulated blocks of weather regimes match with historical blocks. The block bootstrapping procedure helps maintain the joint distribution of the weather variables, but an additional copula-based jittering technique helps to broaden the range of simulated precipitation outside of the historical distribution while still maintaining the shape of the underlying distribution.

Dynamic and Thermodynamic Perturbations: Dynamic and thermodynamic climate changes are imposed separately in this generator. This allows a way to systematically understand how specific climate changes lead to changes to local climate. Dynamical changes influence how weather regimes evolve and can change the frequency of weather regimes. This can be done by directly changing the transition probabilities of the weather regimes or indirectly by imposing covariates that dictate the evolution of the weather regimes. In Steinschneider et al. (2019), a Niño 3.4 index is used to force weather regime evolution and is systematically adjusted to create more frequent El Niño and La Niña events. In Gupta et al. (in review), a 600-year long sequence of tree-ring reconstructed principal components of weather regime occurrence are used as an alternative covariate to better capture natural variability inherent in the weather regimes. Thermodynamic trends are applied post generation of local weather and consist of manipulations to temperature or precipitation. Stepwise temperature trends can be added to each grid cell, or elevation-dependent warming can be imposed. Based on an imposed temperature, the mean or an upper quantile of the precipitation distribution can be scaled to reflect an increased capacity for the atmosphere to hold moisture due to increased temperature.

Where can I use this generator?

A weather-regime based stochastic weather generator is not applicable everywhere. It can be utilized for any region in which weather regimes dictate regional weather, which are the midlatitude regions of the Earth. Examples of locations where weather regimes are active and have been identified in studies include locations like the Western U.S, the Great lakes region of the U.S., parts of Western Europe, Morocco, Chile, and New Zealand, among others. If a midlatitude location is not applicable to your case study, consider using other synthetic weather generation techniques extensively covered in this series of posts by Julie Quinn. It is worth noting that the quality of the generator is also going to be dependent on having a sufficient historical precipitation and temperature record to bootstrap off of.

Where can I find the code?

Now, for the first time ever, demo code is publicly available in an official GitHub repo. In this demo, created by Nasser, the user can simulate precipitation and temperature for 12 randomly selected grid cells in the Tuolumne River Basin in California. In its most complex form, the generator can be used to generate weather across 12 basins simultaneously in the Central Valley region of California. Download or clone the GitHub repository and then download the necessary data from the Zenodo archive here. Then, drag these data to the Data folder in the repo. Then, the user can proceed with stepping through the rest of the demo.

  1. You will first simulate weather regimes for a certain number of years using either a parametric (config.WRs.param.NHMM.R) or non-parametric method (config.WRs.non_param.R).
    • Parametric Method: In this method, NHMMs are fitted using the following covariates: the first four PCs of the SPI index over California and two harmonics for the cold season. The two harmonics are the only covariates used during the warm season. From this fitted NHMM, one can simulate an arbitrary sequence of WRs. The parametric method is particularly useful if the user is interested in using specific covariates to force the evolution of the weather regime sequence.
    • Non-parametric Method: Randomly resample a block of m-years (here, m=4 years) of observed WRs to create a sequence of 1000-yr simulated WRs. Because the non-parametric method doesn’t requires covariates, it is particularly effective for flexibly generating very long sequences of weather regimes that can more appropriate capture natural variability for example.
  2. Once the WRs are simulated, you can simulate local precipitation and temperature. This is done by running the config.simulations.R script.
#Demo code- simulates precipitation and temperature for 12 grid cells in the Tuolumne Basin
rm(list=ls())

library(MASS)
library(fExtremes)
library(evmix)
library(zoo)
library(abind)
library(parallel)
library(mvtnorm)
library(tictoc)
library(lubridate)

#adjust main directory and directory for simulation files
mainDir <- "D:/Projects/Tuolumne_River_Basin/GitHub_WGENv2.0"
setwd(mainDir)

dir.to.sim.files <- "./Data/simulated.data.files/WGEN.out"
dir.create(file.path(mainDir, dir.to.sim.files), showWarnings = FALSE)

num.iter <- 1 # A single long trace (e.g., thousand years) is sufficient although we create like 5 ensembles in the simulated WRs
basin.cnt <- 'myPilot' # for a set of 12 randomly selected Livneh grids in the Tuolumne River Basin 
number.years.long <- 3050 # e.g., 500, 1000, 2000, 3000, 5000 years, etc [note: current NHMM output (parametric) is for 1036 years; current non-parametric is for 3050 years]

##############################Define perturbations#######################################
## Climate changes and jitter to apply:
change.list <- data.frame("tc"=  c(0), # e.g., 1, 2, ...
                          "jitter"=  c(TRUE),
                          "pccc"=c( 0), # e.g., 0.07, 0.14, ...
                          "pmuc"=c( 0)# e.g., -.125
)

# For simulating the WRs (i.e., 'config.WRs.non_param.R'; 'config.WRs.param.NHMM.R'), do you use non-parametric or parametric method
use.non_param.WRs <- TRUE #TRUE for non-parametric, FALSE for parametric simulated WRs
####### Choose A (FALSE) or B (TRUE) below for the simulated WRs #################
#-- A) load in NHMM with WRs (parametric)
tmp.list <- readRDS(paste0("./Data/simulated.data.files/WRs.out/final.NHMM.output.rds"))
weather.state.assignments <- tmp.list$WR.historical # this is for BOTH the parametric and non-parametric approach 
sim.weather.state.assignments <- tmp.list$WR.simulation[,1:num.iter] # this is only for the parametric approach of simulating WRs
num.states <- length(unique(as.vector(weather.state.assignments)))    #number of WRs in the model
rm(tmp.list) # for memory

#-- B) load in non-parametric simulation with WRs
np.list.sim.weather.state.assignments <- readRDS(paste0("./Data/simulated.data.files/WRs.out/final.non_param.WRs.output.rds")) #this is only for the non-parametric approach of simulating WRs
num.states <- length(unique(np.list.sim.weather.state.assignments$markov.chain.sim[[num.iter]]))    #number of WRs in the model

# load in supporting functions
files.sources = list.files("./Programs/functions",full.names = TRUE)
my.functions <- sapply(files.sources, source)
##################################

#dates for the historical WRs
start_date_synoptic="1948-01-01"; end_date_synoptic="2021-12-31"
dates.synoptic <- seq(as.Date(start_date_synoptic),as.Date(end_date_synoptic), by="days")
#create dates for the simulated WRs
my.num.sim = ceiling(number.years.long/length(unique(format(dates.synoptic,'%Y')))) # number of chunks of historical periods; e.g., 1 is one set of simulation equal to the historical
long.dates.sim <- rep(dates.synoptic,times=my.num.sim)


  • The overarching parameters of the simulation are listed in Lines 14-23 of the code. Here you can adjust paths to directories, how many iterations you want to run (set to “1” currently) and the length of the simulation.
  • Lines 26-31: These are the thermodynamic climate change parameters. Here you can impose a temperature increase (tc), a precip scaling factor for the upper quantile of the distribution (pccc), and a mean shift to the simulated precipitation distribution (pmuc). Currently, the simulation is set to no imposed climate changes but copula-based jittering is imposed and so is kept as TRUE.  
  • Lines 33-45: Depending on which method was used to simulate the WRs, load in the WRs accordingly, using the blocks of code under either A or B. Be sure to run Line 38 no matter which method was used.  
  1. Lines 62-98 contain more details on precipitation characteristics.
#########Precipitation characteristics#########

#location of obs weather data (RData format): weather data (e.g., precip and temp) as matrices (time x lat|lon: t-by-number of grids); dates vector for time; basin average precip (see the example meteohydro file)
processed.data.meteohydro <- paste0("./Data/processed.data.files/processed.meteohydro/processed.meteohydro.myPilot.RData")
load(processed.data.meteohydro) #load in weather data


qq <- .99              # percentile threshold to separate Gamma and GPD distributions
thshd.prcp <- apply(prcp.site,2,function(x) {quantile(x[x!=0],qq,na.rm=T)})


#Bootstrapping choices###
window.size <- rep(3,length(months))   #the size of the window (in days) from which runs can be bootstrapped around the current day of simulation, by month: Jan -- Dec

trace <- 0.25     #trace prcp threshold. 0.25 mm (for Livneh dataset); or 0.01 inches


#The spearman correlation between basin and site precipitation, used in the copula-based jitters
prcp.basin.site <- cbind(prcp.basin,prcp.site)
S <- cor(prcp.basin.site,method="spearman")
n.sites <- dim(prcp.site)[2] # Number of gridded points for precipitation


#fit emission distributions to each site by month. sites along the columns, parameters for month down the rows
emission.fits.site <- fit.emission(prcp.site=prcp.site,
                                   months=months,
                                   months.weather=months.weather,
                                   n.sites=n.sites,
                                   thshd.prcp=thshd.prcp)


#how often is prcp under threshold by month and site
qq.month <- sapply(1:n.sites,function(i,x=prcp.site,m,mm) {
  sapply(m,function(m) {
    length(which(as.numeric(x[x[,i]!=0 & mm==m & !is.na(x[,i]),i])<=thshd.prcp[i]))/length(as.numeric(x[x[,i]!=0 & mm==m & !is.na(x[,i]),i]))
  })},
  m=months, mm=months.weather)

  • thshd.prcp: the threshold precipitation quantile that dictates the separation between extreme precipitation and the rest of the distribution (set at .99). Precipitation scaling is applied to this upper quantile.
  • trace: the minimum precipitation for which a day is classified as “dry”
  • window.size: The window around which data is bootstrapped from the historical datasetS: The Spearman rank correlation between individual basin sites and the basin average (for the copula-based jittering)
  • The fit.emission.R function which fits gamma and GPD distributions to the prcp.site data. A GPD is fit to the precipitation that lies above the thshd.prcp at each site, while the gamma distribution is fit to the rest of the precipitation data by month. 
  1. Line 122 is where the bootstrapping process occurs (wgen.simulator.R)
##########################Simulate model with perturbations#######################################

#decide whether or not to use historic or simulated WRs
if(use.non_param.WRs) {
  
  dates.sim <- np.list.sim.weather.state.assignments$dates.sim
  markov.chain.sim <- np.list.sim.weather.state.assignments$markov.chain.sim
  identical.dates.idx <- dates.synoptic%in%dates.weather
  weather.state.assignments <- weather.state.assignments[identical.dates.idx]
  
} else {
  dates.sim <- long.dates.sim
  markov.chain.sim <- as.list(data.frame(sim.weather.state.assignments))
}

#run the daily weather generate num.iter times using the num.iter Markov chains
mc.sim <- resampled.date.sim <- resampled.date.loc.sim <- prcp.site.sim <- tmin.site.sim <- tmax.site.sim <-  list()
start_time <- Sys.time()
for (k in 1:num.iter) {
  my.itertime <- Sys.time()
  my.sim <- wgen.simulator(weather.state.assignments=weather.state.assignments,mc.sim=markov.chain.sim[[k]],
                           prcp.basin=prcp.basin,dates.weather=dates.weather,
                           first.month=first.month,last.month=last.month,dates.sim=dates.sim,
                           months=months,window.size=window.size,min.thresh=trace)
  print(paste(k,":", Sys.time()-my.itertime))
  
  #each of these is a list of length iter
  mc.sim[[k]] <- my.sim[[1]]
  resampled.date.sim[[k]] <- my.sim[[2]]
  resampled.date.loc.sim[[k]] <- my.sim[[3]]
  prcp.site.sim[[k]] <- prcp.site[resampled.date.loc.sim[[k]],]
  tmin.site.sim[[k]] <- tmin.site[resampled.date.loc.sim[[k]],]
  tmax.site.sim[[k]] <- tmax.site[resampled.date.loc.sim[[k]],]
}
end_time <- Sys.time(); run.time.sim <- end_time - start_time
print(paste("SIMULATION started at:",start_time,", ended at:",end_time)); print(round(run.time.sim,2))
#remove for memory
rm(my.sim)
  1. Once baseline weather is generated, then perturbations can be applied in Line 156 using the perturb.climate.R function. In this function, temperature trends and precipitation scaling is applied if applicable. If not, just the copula-based jittering is applied. This output is then saved for analysis.
#once the simulations are created, we now apply post-process climate changes (and jitters)
for (change in 1:nrow(change.list)) {
  start_time <- Sys.time()
  cur.tc <- change.list$tc[change]
  cur.jitter <- change.list$jitter[change]
  cur.pccc <- change.list$pccc[change]
  cur.pmuc <- change.list$pmuc[change]
  
  #precipitation scaling (temperature change dependent)
  perc.q <- (1 + cur.pccc)^cur.tc    #scaling in the upper tail for each month of non-zero prcp
  perc.mu <- (1 + cur.pmuc)          #scaling in the mean for each month of non-zero prcp
  
  #perturb the climate from the simulations above (the longest procedure in this function is saving the output files)
  set.seed(1)   #this ensures the copula-based jitterrs are always performed in the same way for each climate change
  perturbed.sim <-  perturb.climate(prcp.site.sim=prcp.site.sim,
                                    tmin.site.sim=tmin.site.sim,
                                    tmax.site.sim=tmax.site.sim,
                                    emission.fits.site=emission.fits.site,
                                    months=months,dates.sim=dates.sim,n.sites=n.sites,
                                    qq=qq,perc.mu=perc.mu,perc.q=perc.q,S=S,cur.jitter=cur.jitter,cur.tc=cur.tc,
                                    num.iter=num.iter,thshd.prcp=thshd.prcp,qq.month=qq.month)
  set.seed(NULL)
  prcp.site.sim.perturbed <- perturbed.sim[[1]]
  tmin.site.sim.perturbed <- perturbed.sim[[2]]
  tmax.site.sim.perturbed <- perturbed.sim[[3]]
  
  #remove for memory
  rm(perturbed.sim)
  end_time <- Sys.time(); run.time.qmap <- end_time - start_time
  print(paste("QMAPPING started at:",start_time,", ended at:",end_time)); print(round(run.time.qmap,2))
  
  #how to name each file name to track perturbations in each set of simulations
  file.suffix <- paste0(".temp.",cur.tc,"_p.CC.scale.",cur.pccc,"_p.mu.scale.",cur.pmuc,"_hist.state.",use.non_param.WRs,"_jitter.",cur.jitter,"_s",num.states,"_with_",num.iter,".",basin.cnt)
  
  print(paste("|---start saving---|"))
  write.output.large(dir.to.sim.files,
                     prcp.site.sim=prcp.site.sim.perturbed,
                     tmin.site.sim=tmin.site.sim.perturbed,
                     tmax.site.sim=tmax.site.sim.perturbed,
                     mc.sim=mc.sim,resampled.date.sim=resampled.date.sim,
                     dates.sim=dates.sim,file.suffix=file.suffix)
  print(paste("|---finished saving---|"))
  
}

#remove for memory
rm(resampled.date.sim,resampled.date.loc.sim,markov.chain.sim,mc.sim)
##################################################################################################
print(paste0("--- done.  state= ", num.states," --- ensemble member:",num.iter))
gc()

# The End
#####################################################################################
  1. The R script plot.statistics.full.diagnostics.long.R can then be used to generate a variety of figures that demonstrate the quality of the generator in preserving and/or expanding around the observed statistics of each site. Options include:
Exceedance probability curves for all 12 sites
Simulated vs. observed maximum precipitation for all 12 sites
Simulated vs. observed 1-20 year droughts consolidated across all locations

And much more! Code is provided for 11 different diagnostic assessments of the quality of the generator. Look out for additions to the repository over the next months and feel free to open a GitHub issue if you have questions!

References

Charlton‐Perez, A. J., Ferranti, L., & Lee, R. W. (2018). The influence of the stratospheric state on North Atlantic weather regimes. Quarterly Journal of the Royal Meteorological Society, 144(713), 1140-1151.

Guirguis, K., Gershunov, A., DeFlorio, M.J., Shulgina, T., Delle Monache, L., Subramanian, A.C., Corringham, T.W. and Ralph, F.M., 2020. Four atmospheric circulation regimes over the North Pacific and their relationship to California precipitation on daily to seasonal timescales. Geophysical Research Letters, 47(16), p.e2020GL087609.

Gupta, R.S., Steinschneider S., Reed, P.M. Understanding Contributions of Paleo-Informed Natural Variability and Climate Changes on Hydroclimate Extremes in the Central Valley Region of California. Authorea. March 13, 2023. DOI: 10.22541/essoar.167870424.46495295/v1

Meseguer-Ruiz, O., Cortesi, N., Guijarro, J. A., & Sarricolea, P. (2020). Weather regimes linked to daily precipitation anomalies in Northern Chile. Atmospheric Research, 236, 104802.

Najibi, N., Mukhopadhyay, S., & Steinschneider, S. (2021). Identifying weather regimes for regional‐scale stochastic weather generators. International Journal of Climatology41(4), 2456-2479.

Rahat, S. H., Steinschneider, S., Kucharski, J., Arnold, W., Olzewski, J., Walker, W., … & Ray, P. (2022). Characterizing hydrologic vulnerability under nonstationary climate and antecedent conditions using a process-informed stochastic weather generator. Journal of Water Resources Planning and Management148(6), 04022028.

Steinschneider, S., Ray, P., Rahat, S. H., & Kucharski, J. (2019). A weather‐regime‐based stochastic weather generator for climate vulnerability assessments of water systems in the western United States. Water Resources Research, 55(8), 6923-6945.

Copula Coding Exercise

This Halloween, you may be scared by many things, but don’t let one of them be copulas. Dave Gold wrote an excellent post on the theory behind copulas and I’ll just build off of that post with a simple coding exercise. When I first heard about copulas, they sounded really relevant to helping me come up with a way to capture joint behavior across variables (which I then wanted to use to capture joint hydrologic behavior across watersheds). However, because I hadn’t learned about them formally in a class, I had a hard time translating what I was reading in dense statistics textbooks into practice and found myself blindly trying to use R packages and not making much progress. It was very helpful for me to start to wrap my head around copulas by just starting with coding a simple trivariate Gaussian copula from scratch. If copulas are new to you, hopefully this little exercise will be helpful to you as well.

Let’s pose a scenario that will help orient the reason why a copula will be helpful. Let’s imagine that we have historical daily full natural flow for three gauged locations in the Central Valley region of California: Don Pedro Reservoir, Millerton Lake, and New Melones Lake. Because these three locations are really important for water supply for the region, it’s helpful to identify the likelihood of joint flooding or drought tendencies across these basins. For this post, let’s focus on joint flooding.

First let’s fit distributions for each basin individually. I calculate the water year and create another column for that and then take the daily flow and determine the aggregate peak three-day flow for each year and each basin. Given the different residence times in each basin, a three-day aggregate flow can better capture concurrent events than using a one-day flow. Let’s term the flows in each basin [xt,1…xt,n] which is the 3-day peak annual flows in each of the n basins in year t.

library(evd)
library(mvtnorm)
library(extRemes)
library(mnormt)

#Read in annual observed flows 
TLG_flows=read.table("C:/Users/rg727/Downloads/FNF_TLG_mm.txt")
MIL_flows=read.table("C:/Users/rg727/Downloads/FNF_MIL_mm.txt")
NML_flows=read.table("C:/Users/rg727/Downloads/FNF_NML_mm.txt")

#Calculate water year column
TLG_flows$water_year=0

for (i in 1:dim(TLG_flows)[1]){
  if (TLG_flows$V2[i]==10|TLG_flows$V2[i]==11|TLG_flows$V2[i]==12){
    TLG_flows$water_year[i]=TLG_flows$V1[i]+1
  } else{
    TLG_flows$water_year[i]=TLG_flows$V1[i]}
}


MIL_flows$water_year=0

for (i in 1:dim(MIL_flows)[1]){
  if (MIL_flows$V2[i]==10|MIL_flows$V2[i]==11|MIL_flows$V2[i]==12){
    MIL_flows$water_year[i]=MIL_flows$V1[i]+1
  } else{
    MIL_flows$water_year[i]=MIL_flows$V1[i]}
}


NML_flows$water_year=0

for (i in 1:dim(NML_flows)[1]){
  if (NML_flows$V2[i]==10|NML_flows$V2[i]==11|NML_flows$V2[i]==12){
    NML_flows$water_year[i]=NML_flows$V1[i]+1
  } else{
    NML_flows$water_year[i]=NML_flows$V1[i]}
}



#Calculate 3-day total flow


TLG_flows$three_day=0

for (i in 1:length(TLG_flows$V4)){
  if (i == 1){
    TLG_flows$three_day[i]=sum(TLG_flows$V4[i],TLG_flows$V4[i+1])
  }
  else
    TLG_flows$three_day[i]=sum(TLG_flows$V4[i-1],TLG_flows$V4[i],TLG_flows$V4[i+1])
}

MIL_flows$three_day=0

for (i in 1:length(MIL_flows$V4)){
  if (i == 1){
    MIL_flows$three_day[i]=sum(MIL_flows$V4[i],MIL_flows$V4[i+1])
  }
  else
    MIL_flows$three_day[i]=sum(MIL_flows$V4[i-1],MIL_flows$V4[i],MIL_flows$V4[i+1])
}

NML_flows$three_day=0

for (i in 1:length(NML_flows$V4)){
  if (i == 1){
    NML_flows$three_day[i]=sum(NML_flows$V4[i],NML_flows$V4[i+1])
  }
  else
    NML_flows$three_day[i]=sum(NML_flows$V4[i-1],NML_flows$V4[i],NML_flows$V4[i+1])
}



#Calculate peak annual flow
TLG_flow_peak_annual=aggregate(TLG_flows,by=list(TLG_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
MIL_flow_peak_annual=aggregate(MIL_flows,by=list(MIL_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
NML_flow_peak_annual=aggregate(NML_flows,by=list(NML_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)

#Remove extra year (1986) from TLG
TLG_flow_peak_annual=TLG_flow_peak_annual[2:33,]


Next, we need to fit individual marginal distributions for each location. Since we are interested in capturing flood risk across the basins, a common marginal distribution to use is a Generalized Extreme Value distribution (GEV). We fit a different GEV distribution for each basin and from here, we create a bit of code to determine the peak three-day flows associated with a historical 10-year return period event.

#Determine level of a 10-year return period

getrlpoints <- function(fit){
  
  xp2 <- ppoints(fit$n, a = 0)
  ytmp <- datagrabber(fit)
  y <- c(ytmp[, 1])
  sdat <- sort(y)
  npy <- fit$npy
  u <- fit$threshold
  rlpoints.x <- -1/log(xp2)[sdat > u]/npy
  rlpoints.y <- sdat[sdat > u]
  rlpoints <- data.frame(rlpoints.x, rlpoints.y)
  
  return(rlpoints)
}


getcidf <- function(fit){
  
  rperiods = c(2,5,10,20,50,100,500,1000)
  bds <- ci(fit, return.period = rperiods)
  c1 <- as.numeric(bds[,1])
  c2 <- as.numeric(bds[,2])
  c3 <- as.numeric(bds[,3])
  ci_df <- data.frame(c1, c2, c3, rperiods)
  
  return(ci_df)
}


gevfit_MIL <- fevd(MIL_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_MIL)
ci_df <- getcidf(gevfit_MIL)

gevfit_NML <- fevd(NML_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_NML)
ci_df <- getcidf(gevfit_NML)

gevfit_TLG <- fevd(TLG_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_TLG)
ci_df <- getcidf(gevfit_TLG)



#These are the historical thresholds defined from the GEV fit to the three-day sum of the peak flows 
ten_yr_event_TLG=58.53738	
ten_yr_event_MIL=40.77821
ten_yr_event_NML=58.95369

Now that we have our marginals fit, we need to use these in some way to fit a Gaussian copula. By definition, a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. So we need to transform our observations to be psuedo-uniform. To do so, we push the values of [xt,1…xt,n] through the inverse of the CDF of the respective fitted GEV function. These marginals are now uniformly distributed between 0 and 1. Let’s term these values now [ut,1…ut,n].

#Now we want to fit the copula on our historical flows. First create u's

u_TLG = pgev(TLG_flow_peak_annual$three_day, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3],lower.tail = TRUE)
u_MIL = pgev(MIL_flow_peak_annual$three_day, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML = pgev(NML_flow_peak_annual$three_day, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

#Let's also convert the thresholds to u's for each basin 
u_TLG_event = pgev(ten_yr_event_TLG, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3], lower.tail = TRUE)
u_MIL_event = pgev(ten_yr_event_MIL, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML_event = pgev(ten_yr_event_NML, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

Now we address the Gaussian part of the copula. Recall the following in Dave’s post:

Where Φ_R is the joint standard normal CDF with: 

mean and var

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

So we have our u vectors that we now need to push through an inverse standard normal distribution to get Z scores.

#Now takes these u's and push them through a standard normal to get Z scores 
 
Z_TLG=qnorm(u_TLG)
Z_MIL=qnorm(u_MIL)
Z_NML=qnorm(u_NML)

#Let's do the same for the historical events
Z_TLG_event=qnorm(u_TLG_event)
Z_MIL_event=qnorm(u_MIL_event)
Z_NML_event=qnorm(u_NML_event)

The last part that we need for our copula is a spearman rank correlation matrix that captures the structural relationship of the peak flows across the three basins.

cor_matrix=cor(cbind(Z_TLG,Z_MIL,Z_NML),method = "spearman")

Finally, we have all the components of our copula. What question can we ask now that we have this copula? How about “What is the likelihood of exceeding the historical 10-year event in each basin?”. We code up the following formula which expresses this exact likelihood. More details can be found in this very helpful book chapter:

We calculate this with the following line and we find an exceedance probability of 0.0564.

#Now calculate the probability of exceeding the historical threshold in all basins

exceedance=1-pnorm(Z_TLG_event)-pnorm(Z_MIL_event)-pnorm(Z_NML_event)+pmnorm(c(Z_TLG_event,Z_MIL_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_MIL)))+pmnorm(c(Z_TLG_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_NML)))+pmnorm(c(Z_MIL_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_MIL,Z_NML)))-pmnorm(c(Z_TLG_event,Z_MIL_event,Z_NML_event),mean=rep(0,3),varcov = cor(cbind(Z_TLG,Z_MIL,Z_NML)))

Now this intuitively makes sense because a 10-year event has a 10% or 0.1 probability of being exceeded in any given year. If you calculate a joint likelihood of exceedance of this event across multiple basins, then this event is even less likely, so our lower probability of 0.0564 tracks. Note that you can create synthetic realizations of correlated peak flows by simulating from from a multivariate normal distribution with a mean of 0 and the Spearman rank correlation matrix defined above. Hopefully this was a simple but helpful example to demonstrate the key components of the Gaussian copula. The script and corresponding datasets can be found in this repo.

Efficient Storage and Querying of Geospatial Data with Parquet and DuckDB

This post is dedicated to Chris Vernon and Travis Thurber from PNNL who taught me how to use these tools!

Lately, I’ve been generating synthetic weather scenarios for basins in California. The weather scenarios are created at a daily time step and are informed by tree-ring products which span across 600 years. We want to develop multiple ensembles of these scenarios that are representative of plausible future climate changes that the region could experience. It became clear to me that I would need to make sure that I was storing data efficiently and that I was able to parse through these data quickly to generate plots and develop metrics.

Below, I wanted to discuss two points that have really helped my workflow: (1) focusing on file structure and compression and (2) changing the way that I explore the resulting scenarios.

File choice and compression

To illustrate the importance of considering how you choose to save your data, I grab the historical trace of daily precipitation over the last 50 years for the Tuolumne River Basin and upload it into R as a dataframe. Below I show the code to take this dataframe and save it in various formats.

#Traditional RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne.rds"))

#Compressed RDS
system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne_compressed.rds",compress = 'xz'))

#Text 
system.time(write.table(prcp_tuolumne,"E:/blog/prcp_tuolumne.txt"))

#CSV
system.time(write.csv(prcp_tuolumne,"E:/blog/prcp_tuolumne.csv"))

#Parquet
library(arrow)
system.time(write_parquet(prcp_tuolumne,"E:/blog/prcp_tuolumne.parquet",compression = "gzip"))

#HDF5 
library(rhdf5)
system.time(h5write(prcp_tuolumne,"E:/blog/prcp_tuolumne.h5","df"))


#NetCDF
library(ncdf4)
# path and file name, set dname
ncpath <- "E:/blog/"
ncname <- "prcp_tuolumne"  
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "prcp"  # note: tmp means temperature (not temporary)
timedim <- ncdim_def("time","days since 1950-11-1",as.numeric(dates))
# define variables
fillvalue <- 1e32
dlname <- "observed precipitation"
prcp_def <- ncvar_def("prcp","mm",list(timedim),1e32,"observed prcp",prec = "double")
# create netCDF file and put arrays
ncout <- nc_create(ncfname,list(prcp_def),force_v4=TRUE)
# put variables
system.time(ncvar_put(ncout,prcp_def,as.matrix(prcp_tuolumne[ ,1])))

Here, I summarize the resulting size and write times for each format.

File TypeSize (Kb)Writing Time (seconds)
RDS9,0652.13
Compressed RDS4,01217.73
HDF55,204323.03
Text26,85112.69
CSV27,02412.14
netCDF141,3603.12
Parquet8,3960.98

Note that the original file size was on the smaller side so this exercise may seem trivial, but when you are creating many ensembles of both precipitation and temperature, across hundreds of years, across may basins, across many climate change scenarios, small files will add up and can potentially limit the scope of experiment that you can conduct if you don’t have enough file storage. Of these file types, one of my recent favorites has been Apache Parquet. Parquet is an open-source column-oriented format that is specifically designed for efficient storage and accessibility of data. Whereas the compressed RDS file and HDF5 beat Parquet in terms of size, it takes much longer to write to these files and subsequently read them back in. Another advantage of Parquet is that it is recognized by common coding languages (R, Matlab, Python) which allows for a more seamless workflow between models of different languages. If you have a little more storage to work with, Parquet is a good choice to balance size and writing time tradeoffs.

Querying data with SQL and DuckDB

Once your data are in an efficient format like Parquet, the next order of business is to make sure that you can easily sort through it and, as Chris would say, “talk to your data and let it talk back”. One way to communicate with your data and ask questions in a more semantically meaningful way is to use Structured Query Language (SQL). There are many methods of querying in R, but a useful data management tool that I have enjoyed interacting with is DuckDB which is a relational database management system. DuckDB has an R API and allows you to use the SQL syntax to query data in an efficient way. It should be noted that you can perform similar aggregation and subsetting using other R base functions, but DuckDB allows you to perform these tasks faster and across multiple files that you may have in a directory.

Let’s use a case where you have developed 30 traces of daily basin-averaged synthetic precipitation for the Tuolumne based on the historical period and each trace is stored in a respective parquet file (“ensemble1.parquetgzip, ensemble2.parquetgzip…ensemble30.parquetgzip). Each ensemble looks something like this:

Example of the parquet file structure

Now let’s state a question that I want to ask to the data and see how to structure these questions using SQL and DuckDB.

“Hey DuckDB, can you create a dataframe for me that provides an annual average across all my synthetic traces? Let’s plot that with respect to the observed annual average and show the bounds of the synthetic traces.”

First we need to open a connection to DuckDB. Then we are going to have to create annual averages from the daily historical and synthetic traces and then create max/min bounds to show the additional variability we are getting from the generator.

In order to answer this question, DuckDB will query all our parquet files simultaneously (!) and return a dataframe with annual averages and we can do some clever sub-querying to get the max/min of the averages. I really enjoy that I don’t have to read in the files into my R workspace. R reads in files and stores them based on memory, so those of you who have used larger datasets might have found that you are limited in how many datasets you can have open, which can be frustrating! This is not an issue with DuckDB.

#Historical trace 
historical=data.frame(dates.year,dates.month,dates.day,rowMeans(prcp_sacsma))
historical_yearly=aggregate(historical,by=list(dates.year),FUN=mean)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

# query to find annual average across all synthetic traces
df=dbGetQuery(con, 
              "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM 'E:/NHMM/blog/*.parquet'
           GROUP BY year
           ORDER BY year")

#For the max/min, we need to find the average first and then return the max/min values in something like a nested approach 

#First create a dataframe with all files 
all_files=dbGetQuery(con,"SELECT *
                FROM 'E:/NHMM/blog/*.parquet'")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "all_files_table", all_files)


annual_average=dbGetQuery(con, 
               "SELECT year,
           AVG(precipitation) AS yearly_average
           FROM all_files_table
           GROUP BY year,sample
           ORDER BY year")

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "annual_table", annual_average)

#query to find max
max=dbGetQuery(con, 
               "SELECT year,
           MAX(yearly_average) AS max_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")

#query to find min
min=dbGetQuery(con, 
               "SELECT year,
           min(yearly_average) AS min_yearly_average
           FROM annual_table
           GROUP BY year
           ORDER BY year")
#Plot the results!
library(ggplot2)

ggplot(df, aes(x = year, y = yearly_average)) +
  geom_ribbon(aes(ymin = min$min_yearly_average,
                  ymax = max$max_yearly_average),
              alpha = 0.2,fill="blue") +
  geom_line()+geom_line(color="blue",size=2)+ geom_line(data=basin.wide.yearly[3:64,],aes(x=year,y=precipitation), color="Black",size=2)+ ggtitle("Annual Average Precipitation") +
  xlab("Year") + ylab("Annual Average Precipitation (mm)")+theme_ridges()

Here’s the resulting figure:

Synthetic generation in blue compared to the historical trace in black

“Hey DuckDB, from all of the synthetic traces that I have generated, how many instances are there where we are producing daily precipitation that is above 10 mm for specifically the years 1980 or 2000?”

In order to answer this question, DuckDB will once again query all our parquet files simultaneously and return a dataframe with all the instances of >10 mm and only for the years 1980 or 2000. This is how you would write this question in “code”.

library(arrow)
library(duckdb)
library(fs)
library(tidyverse)
library(DBI)
library(glue)

# open connection to DuckDB
con <- dbConnect(duckdb::duckdb())

df=dbGetQuery(con,"SELECT *
                FROM 'E:/blog/*.parquet'
                WHERE precipitation > 10
                AND (year = 1980 OR year = 2000)
                order BY year")

If we look at the size of the resulting dataframe it looks like we have generated 1335 instances of daily precipitation that are greater than 10 mm in specifically the years 1980 or 2000:

Hey DuckDB, I want to think about meteorological drought. In how many instances am I half a standard deviation below the long-term average monthly precipitation?

To try this out, let’s just look at one synthetic trace. First we need to find the long term average across the trace and let’s plot what a drought might look like.

# query to grab the first synthetic trace
scenario1=dbGetQuery(con, 
              "SELECT month,year,
           AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'
           GROUP by year, month")

# query to find mean
long_term_average=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_avg
           FROM 'E:/NHMM/blog/ensemble1.parquet'") 

# query to standard deviation
stdev=dbGetQuery(con, 
              "SELECT AVG(precipitation) AS monthly_stdev
           FROM 'E:/NHMM/blog/ensemble1.parquet'")

We can then plot the synthetic trace (blue), mean (black), and a 1/2 standard deviation (dashed line) below the mean.

Now we can count the number of months that we are classified in a drought across the synthetic trace. Based on the stats that we calculated in the last block, I hard coded in the drought threshold, but in theory, you can nest many functions with more complicated sub-queries.

# register the dataset as a DuckDB table, and give it a name
duckdb::duckdb_register_arrow(con, "scenario_1", scenario1)


instances=dbGetQuery(con, 
                     "SELECT month,year
           FROM scenario_1
           WHERE monthly_avg < 3.125291
           GROUP by year, month
           ORDER by year,month")

It looks like 115/325 months or about 35% of the time we are in a drought in this particular synthetic trace. Yikes! But it makes for a good way to assess the vulnerability of the system to future plausible meteorological drought conditions.

I hope that these approaches can be as useful to you as they have been for me!

The ma-R-velous tidyverse

I am currently taking a class in environmental economics, which of late seems to be doubling as a crash course in the R tidyverse. This blog post will discuss several R packages and demonstrate the application of several tidyverse sub-packages and commands that I have learned to use through this class. Two datasets will be used: the storm dataset, which contains data on the types of storms that occurred in the United States from 1975 to 2015, and the damages dataset, which is a record of the dollar cost of the damages caused by every tropical storm that occurred from 1975 to 2015. The datasets required for this example can be found in in this GitHub repository.

Tidy data and the tidyverse?

‘Tidy’ datasets provide a standardized way to link the physical layout of a dataset with the meaning associated with it [1]. A ‘tidy’ dataset has the following characteristics:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each cell contains only one value of only one data type

On the other hand, the tidyverse is a meta-package that bundles the packages facilitating data ‘tidying’, or data cleaning such that the final form of the dataset is ‘tidy’. You can check the full set of packages bundled into the tidyverse using tidyverse_packages(), which will show you the following output in the console:

Within this set, we will be using the dplyr, tidyr and lubridate packages. The dplyr and tidyr packages are “workhorse” packages – their main purpose is to clean and wrangle large datasets, and will be the two packages that we use most often in this example. The lubridate package is used to convert strings into dates, which we will use later in this example.

In this example, we will transform a ‘messy’ dataset into a ‘tidy’ one, and perform several other operations that are enabled by the R tidyverse.

The pacman package management tool

No, this is not the Toru Iwatani-created, dot-munching yellow puck released in 1980 by Namco. Instead, the pacman tool is a convenient library and package wrapper that combines the functionality of base library-related functions into intuitively named functions [2]. It was created to improve workflow by reducing the time required in calling and re-calling obscurely named functions. Ideally, it should be called at the beginning of the R script, like so:

# Run this line
if (!require("pacman")) install.packages("pacman")

This enables you to automatically install and load packages throughout the base R libraries and the R tidyverse using the p_load function. For this example, we will only require the tidyverse package. Load this package using the following script:

# Then run p_load to load the necessary packages
pacman::p_load(
 tidyverse
)

As previously mentioned, this automatically loads the dplyr, tidyr and lubridate packages that we will need for this example.

Before beginning, make sure that you are running the correct versions of dplyr and tidyr. Check the package versions using the packageVersion('package_name') function. All packages should be at least version 1.0.0.

Working with the datasets

Some key tidyr verbs

With the explanations out of the way, we begin by loading our two datasets:

# Load dataset
storms <- read.csv("storms.csv")
damages <- read.csv("damages.csv")

Note that the damages dataset is in an inconvenient ‘wide’ shape. To convert damages into ‘tidy’ format, we use pivot_longer(). pivot_longer() is one of the four key tidyr verbs, the remaining three being pivot_wider(), separate() and unite(). These verbs perform the following uses:

  1. pivot_longer(): Vertically stretches the data, increasing the number of rows and decreasing the number of columns
  2. pivot_wider(): Horizontally stretches the data, increasing the number of columns and decreasing the number of rows
  3. separate(): Turns a single-character column into multiple columns
  4. unite(): Pastes multiple columns into one

For this example, pivot_longer() is used to turn damages from its original wide shape into a long, 3-column dataframe where each column contains data on the storm’s name, type, and total dollar cost of damages. The code is as follows:

# tidy-up damages
damages <- damages %>% mutate(status = "tropical storm") %>%
  pivot_longer(-status, names_to="name", values_to="damages")

This script first categorizes all the storms within damages as tropical storms, and them assigns the names of each storm to the column ‘name’ and the cost of their damages to ‘damages’. This turns the original dataset:

into the following shape:

This results in a more readable dataset!

It’s a lubridate!

Next, we will visit out storm dataset. Note that the information regarding the date and time for each storm is inconveniently split between four columns:

We use the as_datetime() function within the lubridate package to combine these columns into one, easy-to-interpret column called ‘date’. Use the following script to:

# Paste four columns and convert to date-time format
storms <- storms %>% mutate(date=as_datetime(
  paste0(year, "-", month, "-", day, " ", hour, ":00:00")))

This pastes the data in the ‘year’, ‘month’, ‘day’ and ‘hour’ columns together and inserts formatted date and time into the newly-added ‘date’ columns:

Great! Our storm dataset is now more readable.

Some dplyr verbs and operations

There are five key dplyr verbs, some of which you have already seen:

  1. filter(): Obtains a subset of rows based on their column values
  2. arrange(): Reorders rows (in ascending or descending order) based on their values
  3. select(): Selects columns or variables of interest
  4. mutate(): Creates new columns or variables and automatically appends them to the dataframe
  5. summarize(): Collapses multiple rows into a single summary value, commonly by grouping a pre-selected variable

The dplyr package also come with a set of join operations, which enables the merging of two dataframes, x and y. The types of operations include:

  1. inner_join(): Matches pairs of observations whenever their values are equal
  2. left_join(): Keeps all x-observations that appear in either x or y
  3. right_join(): Keeps all y-observations that appear in either x or y
  4. full_join(): Keeps all x– and y-observations that appear in either x or y
  5. semi_join(): Keeps all observations in x that have a match in y
  6. anti_join(): Drops all observations in x that have a match in y

In this example, we will only be using the inner_join() function to merge a subset of storms and the whole of damages.

The verbs can be used to modify datasets using ‘pipes’, represented in R as %>%. We have previously seen the application of the mutate verb when we added the new columns ‘storm’ and ‘date’ to the damages and storms dataset respectively. Now, let’s apply the filter() verb to obtain only tropical storm data from storms:

# filter out only the tropical storms
ts <- storms %>% filter(
    stringr::str_detect(status, "tropical storm")
)

Here, we use stringr::str_detect() to detect the string ‘tropical storm’ within the ‘status’ column of storms and store it within the new ts dataframe.

Next, we use the select() function to select all columns but the columns containing the diameter of the hurricane/storm, since this is data that we will not use in this example:

# select a subset of columns
ts <- ts %>% select(!ts_diameter:hu_diameter)

Using ! indicates to the function that you would like to select the complement of the set following the exclamation point. Using : is useful when there are consecutive columns you would like to (de)select. Following this, you will have a pared-down dataset:

Now it’s time to apply our inner_join() operation to merge ts with damages in a new dataframe joined_sd that contains both geophysical information of each tropical storm, as well as its dollar cost of damages.

# joining tropical storms and damages
joined_sd <- inner_join(ts, damages)

where joined_sd has the following form:

Now, we perform some simple operations to obtain the mean dollar cost of damages per windspeed, and arrange each tropical storm in order of decreasing damage per windspeed:

# calculate average cost of damage per windspeed
joined_sd <- joined_storms_damages %>% 
  mutate(damage_per_mph = mean(damages)/wind) %>% 
  arrange(desc(damage_per_mph))

This results in the following final dataset:

From here, you can see that tropical storm Amy resulted in the most expensive damages. Finally, you can also use the summarize() verb to identify the mean of the cost of damages per windspeed of all the tropical storms:

# summarize by mean and sd of cost per windspeed
summary_sd <- joined_sd %>% summarize(
  damage_per_mph=mean(damage_per_mph))

You will find that, on average, each tropical storm costs USD$11,244,531. Plotting the cost of damages with respect to time using

# plot damages wrt time
ggplot(joined_sd, aes(year, damages)) + 
  geom_point() + ylim(min(joined_sd$damages), max(joined_sd$damages)) +
  geom_smooth(method = lm) + ggtitle("Damages ($) with respect to time")

you will find a slight increase in the cost of damages incurred by tropical storms as the years go by.

Conclusion

In this blog post, we walked through two example datasets to demonstrate three sub-packages within the tidyverse: tidyr, lubridate and dplyr. For the full version of this example, please visit the author’s GitHub repository.

References

pacman package – RDocumentation. (2021). Retrieved 14 November 2021, from https://www.rdocumentation.org/packages/pacman/versions/0.5.1

Rudik, I. (2021). AEM 6510 Chapter 10: R and the tidyverse. Presentation.

Tidy data. (2021). Retrieved 14 November 2021, from https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Tips for Creating Watershed Maps in R

There have been a few posts on this blog on creating watershed maps (here, here, and here), but this post is going to be focused on some of my recent experiences on creating watershed maps in R with files that may be missing attributes, are in the wrong projection, and contain data that need to be clipped to a specific boundary shapefile. There are lots of packages that exist to do one or more of these things, but anyone who has ever tried to create watershed maps in R knows that there isn’t one package that does it all. My main goal for this post is to outline the most efficient workflow and use of packages that also allow for the most compatibility when plotting shapefiles and raster files in one figure.

In this post, we are going to be creating a map of the Tuolumne River Basin boundary and plot elevation data within the basin. All the data are found here. First we will read in the Tuolumne boundary shape file (.shp) and the elevation raster file (.asc which is an ASCII file) using the appropriate functions and do some preliminary plotting to see what we have.

#Import libraries 

library(rgdal)
library(ggplot2)
library(raster)

#Read in Tuolumne shapefile

tuolumne.basin <- readOGR(dsn = "doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/Tuolumne_utm.shp")

#Read in elevation raster

elevation.raster = raster("doi_10.6071_M3FH3D__v5/tuolumne_merced_data_2009-2015/Merced_Tuolumne_Dataset_SpatialData/SpatialData/merced_tuolumne_100mdem_utm.asc")

#Plot the files
 
ggplot() +  geom_polygon(data = tuolumne.basin, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)
plot(elevation.raster)
Raw shapefile and raster data

So we have the pieces that we need to build the map, but notice that the latitude and longitude are in the wrong projection. We can use the following command to check what projection the shapefile is in:

proj4string(tuolumne.basin) 

We see that the output is: “+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs”. So we are in the Universal Transverse Mercator coordinate system, but we should change to WGS 84. We can do this using the function “spTransform” which will swap out the projection by adjusting the CRS (Coordinate Reference System) attribute of the shapefile. You can use “proj4string” to verify that the transformation took place.

 tuolumne.basin.transformed <- spTransform(tuolumne.basin, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

Now we need to transform the raster files coordinates. Note that the raster file doesn’t have an associated coordinate reference system listed. If you try to change the projection at this point, you will get an error. This is a minor inconvenience since we know that the coordinate system should match that of the raw Tuolumne shapefile and we can just insert the original coordinate system as a string under the projargs attribute. Then we can transform it to match the coordinate system of the transformed shapefile using “projectRaster”.

elevation.raster@crs@projargs <- "+proj=utm +zone=11 +datum=NAD83 +units=m +no_defs" 
  
elevation.raster.transformed <- projectRaster(elevation.raster,crs=crs(tuolumne.basin.transformed))

Great, now we have all the data in the right projection. Now we have to clip the raster layer to show only the data in the bounds of our shapefile. We first use the “crop” function in the raster library to clip the layer based on the extent of the shapefile boundary as well as the mask function. It is important to do both otherwise the clip will not work!

elevation.raster.transformed.cropped <- crop(elevation.raster.transformed, extent(tuolumne.basin.transformed))
elevation.raster.transformed.cropped <- mask(elevation.raster.transformed, tuolumne.basin.transformed)

Now we need to get the appropriate elevation values and coordinates from the raster object so that we can plot it using ggplot.When we use ggplot here, notice that we only need to use geom_raster and elevation data since the clipped data will perfectly follow the shapefile boundary.

#Isolate elevation values from the raster file

val <- getValues(elevation.raster.transformed.cropped)
xy <-as.data.frame(xyFromCell(elevation.raster.transformed.cropped,1:ncell(elevation.raster.transformed.cropped)))
xy <- cbind(xy,val)

#Plot it!

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c()+theme_bw()
A not so pretty watershed figure

It’s almost perfect aside from that gray box that results from clipping and masking. After we clip, we are converting all values outside the boundary of the shapefile to NAs, which falls out of the bounds of our color scale. To fix this, we simply insert an additional argument to scale_fill_viridis_c() and we also make some additional aesthetic changes to the theme.

#Final plot function

ggplot()+geom_raster(data=xy, aes(x=x, y=y, fill=val))+ scale_fill_viridis_c(na.value=NA,name = "Elevation (m)")+theme_bw()+ggtitle("Tuolumne River Basin Elevation (m)")+xlab("Longitude") + ylab("Latitude")+theme(text = element_text(size = 20)) 
A pretty watershed figure!
Basics of data visualization with ggplot2

Basics of data visualization with ggplot2

In my previous post, I showed how wonderful the ggplot2 library in R is for visualizing complex networks. I realized that while there are several posts on this blog going over the advanced visualization capabilities of the ggplot2 library, there isn’t a primer on structuring code for creating graphs in R…yet. In this post, I will go over the syntax for creating pretty ggplot2 graphs and tweaking various parameters. I am a self-declared Python aficionado, but love using ggplot2 because it is intuitive to use, beginner-friendly, and highly customizable all at the same time.

Dataset and setup

For this tutorial, I will be using one of the built-in datasets in R called mtcars which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Further documentation on this dataset can be found here. We import the data into our RStudio workspace.

# import the library into our workspace
library(ggplot2)

# import dataset
data(mtcars)
head(mtcars)

The resultant dataset looks something like this.

Basic plot

Now that we have the data, we can get to plotting with ggplot2. We can declaratively create graphics using this library. We just have to provide the data, specify how to map properties to graph aesthetics, and the library takes care of the rest for us! We need to specify three things for each ggplot — 1) the data, 2) the aesthetics, and 3) the geometry.

Let us start by creating a basic scatterplot of the mileage (mpg) of each car as a function of its horsepower (hp). In this case the data is our dataframe mtcars, and the aesthetics x and y will be defined as the names of the columns we wish to plot along each axis — hp and mpg. We can also set the color aesthetic to indicate the number of cylinders (cyl) in each car. One of the reasons ggplot2 is so user-friendly is because each graph property can be tacked on to the same line of code with a + sign. Since we want a scatterplot, the geometry will be defined using geom_point().

# basic scatterplot
g <- ggplot(data = mtcars, aes(x = hp, y = mpg, color=cyl))
g + geom_point()

Excellent! The library automatically assigns the column names as axis labels, and uses the default theme and colors, but all of this can be modified to suit our tastes and to create pretty graphs. It is also important to note that we could have visualized the same data (less helpfully) as a line plot instead of a scatterplot, just by tweaking the geometry function.

# basic line plot
g + geom_line()

Well, this looks unpleasant. But wait, we can do so much more. We can also layer multiple geometries on the same graph to make more interesting plots.

# basic scatter+line plot
g + geom_line() + geom_point()

Additionally, we can tweak the geometry properties in each graph. Here is how we can transform the lines to dotted, and specify line widths and marker shape and size.

# change properties of geometry
g + geom_point(shape = "diamond", size = 3) +
  geom_line(color = "black", linetype = "dotted", size = .3) 

While our graph looks much neater now, using a line plot is actually pretty unhelpful for our dataset since each data point is a separate car. We will stick with a scatterplot for the rest of this tutorial. However, the above sort of graph would work great for time series data or other data that measures change in one variable.

Axis labels

One of the cardinal rules of good data visualization is to add axis labels to your graphs. While R automatically set the axis labels to be column headers, we can override this to make the axis labels more informative with just one extra function.

# change axis titles
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)")

Title

This graph is in serious need of a title to provide a reader some idea of what they’re looking at. There are actually multiple ways to add a graph title here, but I find it easiest to use ggtitle().

# add title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") 

Alright, having a title is helpful, but I don’t love it’s placement on the graph. R automatically left-aligns the title, where it clashes with the y-axis. I would much rather have the title right-aligned, in a bigger font, and bolded. Here is how to do that.

# change position of title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Theme

There are ways to manually change the background and gridlines of ggplot2 graphs using theme(), but an easy workaround is to use the built-in themes. Which theme you use depends greatly on the graph type and formatting guidelines, but I personally like a white background, faint gridlines, and a bounding box. One thing to note here though is that theme_bw() overrides theme() so the order of these two matters.

# add theme
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also use the theme() function to change the base font size and font family. Shown below is how to increase the base font size to 15 and change the base font family to Courier.

# use theme to change base font family and font size
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme_bw(base_size = 15, base_family = "Courier") +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Legend

It has been bothering me for the last seven paragraphs that my legend title still uses the column name. However, this is an easy fix. All I have to do is add a label to the color aesthetic in the labs() function.

# change legend title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also change the position of the legend. R automatically places legends on the right, and while I like having it to the right in this case, I could also place the legend at the bottom of the graph. This automatically changes the aspect ratio of the graph.

# change legend position
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "bottom")

Margins

The theme() function is of endless use in ggplot2, and can be used to manually adjust the graph margins and add/remove white space padding. The order of arguments in margin() is counterclockwise — top, right, bottom, left (helpfully remembered by the pneumonic TRouBLe).

# add plot margins
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "right") +
  theme(plot.margin = margin(t = 1, r = 1, b = 1, l = 2, unit = "cm"))

Conclusion

I have barely scratched the surface of what can be achieved using ggplot2 in this post. There are hundreds of excellent tutorials online that dive deeper into ggplot2, like this blog post by Cedric Scherer. I have yet to learn so much about this library and data visualization in general, but have hopefully made a solid case for using ggplot2 to create clean and aesthetically-pleasing data visualizations.

Custom Plotting Symbols in R

R has a lot of options for plotting symbols, but sometimes you might want a something not in the base set. This post will cover a way to get a custom plotting symbol in R. The code is available on github.

I will use data is for 24 nations that competed in the 2021 Euros (men’s) tournament, specifically the FIFA rankings of the men’s and women’s national teams (accessed 12 Aug 2021). Using a flag to represent the country seems reasonable, and easier than text or plotting symbol and color combinations. Plus, it might add a bit of visual interest.

Although this might seem specialized, it highlights changing backgrounds in R base graphics, controlling aspect ratios, and adding mathematical notation to plots.

Some prep work before coding comes in handy. First, I downloaded PNG files of each country’s flag and saved it with the country’s name. Next, I checked Wikipedia for the aspect ratio of each flag and added it to my CSV file of the FIFA rankings.

The code begins by loading the library we will be using for a custom PCH and then reading in my data from the CSV. The men’s rankings range from 1 to 62 and the women’s rankings from 2 to 131.

library(png) # library for a custom pch

df <- read.csv('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_national_team_rankings.csv',
               header=T) # read-in data from csv

# calculating ranges for men's and women's teams
men_range <- range(df$Men.s.National.Team)
women_range <- range(df$Women.s.National.Team)

Next, we set the basic parameters of the plot. Using the PNG function allows me to save it directly to a file and have a stable image process I an fine tune. The height and width needed to me modified until the axes were close to the same scale (10 ranks on x = 10 ranks on y). This will make life easier when trying to scale the flags with the aspect ratio.

# setting plot to save as a png file
# checked the output file to make sure
# axes were on the same scale, or close to it
png(filename='/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_euro_plot.png'
    ,width=round(women_range[2]/15,2)
    ,height=round(men_range[2]/11.5,2)
    ,units='in'
    ,res=300)

par(mar=c(5,5,5,5)) # plot margins

Now, we can initialize the plot. Because we are not plotting the data yet, we can simply label the axes. Default axis ticks would include 0, so I used more natural axis ticks. The gray background makes flags with white more visible.

# setting up the base scatter plot
# used \ to get the "'" in titles
plot(x=NA
     ,y=NA
     ,ylab=expression('Men\'s Ranking')
     ,xlab=expression('Women\'s Ranking')
     ,main='Euros 2021 FIFA Nations\nMen\'s vs Women\'s National Teams'
     ,xlim=rev(women_range) # reverse axes to rank 1 is top right
     ,ylim=rev(men_range) # reverse axes to rank 1 is top right
     ,las=1 # rotate vertical axis labels to be horizontal
     ,xaxt='n' # no default x axis
     ,yaxt='n' # no default y axis
)

# adding axes
axis(side=1, at=c(1,seq(10,women_range[2],10)),las=1)
axis(side=2, at=c(1,seq(10,men_range[2],12)),las=1)

# adding a gray background to the plot
# many flgs have a white portion that would disappear otherwise
rect(xleft=par('usr')[1],xright=par('usr')[2]
     ,ybottom=par('usr')[3],ytop=par('usr')[4]
     ,col='lightgray')

Now comes the fun part. Not all flags are shaped the same. The two extremes in these countries are Switzerland (aspect ratio = 1) and Croatia, Hungary, and North Macedonia (aspect ratio = 2). We will be telling the function the corners of where to plot the flag, so using the same x and y offset for all flags will lead to distortions.

To solve this problem, I set a value for the area of all flags. This will ensure that each country has the same visual impact. After some math, that means the x and y dimensions for each flag can be found based on the aspect ratio. Once we have all these, we can loop through the countries, calculate the coordinates for the flag corners, and use the country’s flag as the plotting symbol.

# setting approximate "area" for each country flag
flag_area <- 25

# looping over all the rankings
for(i in 1:nrow(df)){
  # read in the image and set it to a variable
  img <- readPNG(paste('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/flags/',df$Country[i],'.png',sep=''))
  
  # math for same flag areas (approximately)
  # area = dx*dy = (dy^2)*(aspect_ratio)
  # dy = sqrt(area/aspect_ratio)
  # dx = sqrt(area*aspect_ratio)
  dy <- sqrt(flag_area/df$Aspect.Ratio[i])
  dx <- sqrt(flag_area*df$Aspect.Ratio[i])
  
  # plot the png image as a raster
  # need the image and coordinates
  # men's team on y axis, women's team on x axis
  rasterImage(image=img
              ,ytop=df$Men.s.National.Team[i]+dy/2
              ,ybottom=df$Men.s.National.Team[i]-dy/2
              ,xleft=df$Women.s.National.Team[i]-dx/2
              ,xright=df$Women.s.National.Team[i]+dx/2)
}

For a little bit of fun, and to illustrate one way to get Greek letters in a plot notation, I calculated the correlation of these ranks. The last line of this code closes the figure file.

# adding a text correlation
# using linear correlation of ranks, not rank correlation
correl <- round(cor(df$Men.s.National.Team,df$Women.s.National.Team),2)

# adding text notation for correlation
text(x=women_range[2]-20,
     y=men_range[1]+10,
     labels = bquote(paste(," Linear Corr. (",rho,") = ",.(correl),sep=""))
     )

# close file
dev.off()

And we are done! We see that the flags do look to be about the same area and have approximately correct aspect ratios. (I won’t say low long it took me to get that part figured out…) I hope this provides a fun way to change up your plots and help an audience better understand the data. Although I used flags here because it seemed the most natural for this data, any PNG file could be used based on your data and needs.

fifa_euro_plot

Visualizing large directed networks with ggraph in R

How you choose to visualize complex multidimensional data can significantly shape the insights your audience derives from the plots. My colleague Antonia has written a couple of excellent blog posts on analyzing geospatial networks in Python using the NetworkX library which can be found here and here. I generally lean towards Python for coding but have recently come around on R, mostly because of how easy it is in R to make pretty network visualizations. I will go over some basic network visualizations in R using the igraph and ggraph libraries in this blog post. All the code and data I am using can be found here.

The data I will be using in this post is a processed and cleaned csv-file of Upper Colorado River Basin (UCRB) user interactions obtained from CDSS. The Colorado River is one of the most important river systems in North America, supplying water to millions of Americans. The river has been facing a record 20 year-drought — a situation that is further complicated by prior appropriation doctrine where senior users can put junior ones out of priority until their own needs are met.

Shown below is a snippet of the dataset. Column A shows the user whose water right was put out of priority in 2002 by a water right of column C. Column E shows the total number of days that a water right of column A was put out priority by one of column C in 2002. The rest of the columns contain user attributes.

Now on to turning this lifeless spreadsheet into some pretty pictures. We begin by importing all the necessary libraries.

library(tidyverse)
library(igraph)
library(ggraph)
library(dplyr)
library(ggplot2)

Next we will import the csv-file shown above, and create a list of nodes and edges. This information can be used by the igraph library to create a directed network object. In this directed network, the source node of each edge is priorityWdid (column C) while the destination node is analysisWdid (column A), since the former is putting a call on the latter to divert flow away.

# read network .csv file
data <- read.csv('network_csv_files/priorityWdid/v2_network_2002.csv')

# create nodes
from <- unique(data[c('priorityWdid','priorityStructure', 'priorityNetAbs', 
                      'priorityStreamMile')]) %>% rename(wdid = priorityWdid) %>%
  rename(structure = priorityStructure) %>% rename(netAbs = priorityNetAbs) %>%
  rename(streamMile = priorityStreamMile)
to <- unique(data[c('analysisWdid','analysisStructure', 'analysisNetAbs',
                    'analysisStreamMile')]) %>% rename(wdid = analysisWdid) %>% 
  rename(structure = analysisStructure) %>% rename(netAbs = analysisNetAbs) %>%
  rename(streamMile = analysisStreamMile)
nodes <- unique(rbind(from, to))

# create edges
edges <- data[c('priorityWdid', 'analysisWdid', 'sumWtdCount')] %>% rename(from = priorityWdid) %>% rename(to = analysisWdid)

# create network using igraph package
network <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE)

This network has over 1400 nodes! That would be an unreadable mess of a visualization, so let us filter down this network to only include interactions between the 200 senior-most water rights in the UCRB.

# only include top 200 users by seniority
users <- read.csv('../data/CDSS_WaterRights.csv')
users <- head(users[order(users$Priority.Admin.No),], 200)

top_users <- users$WDID
for (i in 1:length(top_users)){
  top_users[i] <- toString(top_users[i])
}

# create subnetwork with only top 200 users by seniority
sub_nodes <- intersect(nodes$wdid, top_users)
subnet <- induced.subgraph(network, sub_nodes)

Excellent! Only 37 of the top 200 users in the UCRB interacted with each other in 2002. This is a much more manageable number for plotting. Let us start with the most basic visualization. In keeping with ggplot syntax, we can conveniently just keep adding plot specifications with a “+”.

# basic network graph
ggraph(subnet, layout = 'stress') +
  ggtitle('2002') + 
  geom_edge_link() +
  geom_node_point() +
  theme_graph()

Alright this is nice, but not particularly insightful. We have no idea which user each node corresponds to, and this figure would have us believe that all nodes and edges were created equal. When we constructed the network we actually added in a bunch of node and edge attributes, which we can now use to make our visuals more informative.

Shown below is a more helpful visualization. I went ahead and added some attributes to the network, and labelled all the nodes with their structure names. The nodes are sized by their out-degree and colored by their stream mile. The edge widths are determined by the total number of days the source node put the destination out of priority. In order to do this, I leveraged an amazing feature of ggplot2 and ggraph called aesthetic mapping which is quick way to map variables to visual cues on a plot. It automatically scales the data and creates a legend, which we can then further customize.

# plot graph in circular layout
ggraph(subnet, layout = "circle") +
  ggtitle('2002: "circle" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure), repel = TRUE, size=2) +
  theme_graph()

The above network has a circle layout because it’s the easiest to read and is replicable. But there are actually a number of layouts available to choose from. Here is another one of my favorites, graphopt. While this layout is harder to read, it does a better job of revealing clusters in the network. The only change I had to make to the code above was swap out the word ‘circle’ for ‘graphopt’!

# plot graph in graphopt layout
set.seed(1998)
ggraph(subnet, layout = "graphopt") +
  ggtitle('2002: "graphopt" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure), repel = TRUE, size=2) +
  theme_graph()

The above graph would be a lot easier to read if it weren’t for the long labels cluttering everything up. One way to deal with this is to adjust the opacity (alpha) of the text by the degree of the node. This way only the important and central nodes will have prominent labels. Again, all I have to do is add two extra words in line 12 of the code block above. Notice that I did set show.legend to False because I don’t want a legend entry for text opacity in my plot.

ggraph(subnet, layout = "graphopt") +
  ggtitle('2002: "graphopt" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure, alpha = deg), repel = TRUE, size=2, show.legend = F) +
  theme_graph()

This is just a small sampling of the possibilities for network visualization in R. I have only just begun exploring the igraph and ggraph libraries, but the syntax is fairly intuitive, and the resultant plots are highly customizable. The data-to-viz blog is a pretty incredible resource to look at other network visualizations in R, if you are interested.

ggplot (Part 4) – Animated Geospatial Maps

Most of the time, we need to deal with complex systems that have several components, each with different properties and behavior. Usually, these properties vary with time and space, and understanding their spatiotemporal dynamics plays a key role in our understanding of the system performance and its vulnerabilities. Therefore, aggregation in time and space would hide variations that might be important in our analysis and understanding of the system behavior. In this blog post, I am going to show how we can add bar plots onto a map at different locations for better visualization of variables. Some examples for when this type of visualization is useful are irrigation districts that have different water rights or cropping patterns that affect their crop production and water requirements in dry or wet years; another example would be the sensitivity indices of specific model parameters that are clustered based on their location and varying based on the wetness and dryness of the system. To demonstrate this, I am going to create an animated graph that shows annual crop yield variations in different counties in the State of Washington. You can download the Washington counties’ data layer (WA_County_Boundaries-shp.zip) and NASS historical yield data for corn, winter wheat, and spring wheat (NASS.txt) from here.

library(rgdal)  
shp_counties <- readOGR(dsn = file.path("(your directory)/WA_County_Boundaries-shp/WA_County_Boundaries.shp"))

“shp_countiesis” is a SpatialPolygonsDataFrame object and has a different format than a regular dataframe. It usually has a few slots that contain different type of information. For example, “data” has non-geographic properties. We can explore the information for each polygon with:

head(shp_counties@data)
#The "JURISDIC_2" and "JURISDIC_3" columns both contain the names of counties.

To visualize it with ggplot, it has to first be converted into a dataframe. Then, we can use it with geom_polygon function:

library(ggplot2)
counties <- fortify(shp_counties, region="JURISDIC_2")
head(counties)
#long     lat order  hole piece    id   group
#1 -13131351 5984710     1 FALSE     1 Adams Adams.1
#2 -13131340 5983102     2 FALSE     1 Adams Adams.1
ggplot() +  geom_polygon(data = counties, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)

Now, I am going to extract the center of each polygon so that I can later add the bar plots to these coordinates:

library(rgeos)
counties_centroids<- as.data.frame(gCentroid(shp_counties, byid=T))

We are going to extract just a few years of data from the yield dataset to show on the map:

library(data.table)
yield<- fread("(your directory)/NASS.txt")
#   Year  Ag_District     Ag_District_Code  Data_Item  Value  County  JURISDIC_5 Group
#1: 1947 EAST CENTRAL               50     CORN, GRAIN  41.0   Adams    53001   CORN
#2: 1948 EAST CENTRAL               50     CORN, GRAIN  40.0   Adams    53001   CORN
years<- as.data.frame(unique(yield$Year))
years<- as.data.frame(years[34:42,])

For the final map, I am going to need a common legend for all of the bar plots in all of the counties and years in which I am interested. So, I need to know all of the categories that are available:

unique(yield$Group)
unique(yield$Data_Item)

Based on the “Group” column, we know that there are three main groups of crops: corn, winter wheat and spring wheat. Based on the “Data_Item” column, we know that there are three different types of winter and spring wheat (non-irrigated, non-irrigated with continuous cropping (CC), and non-irrigated with summer fallow (SF)). Note that we do not have all of these crop types for all of the counties and all the years, and the common legend should be created from a location that all the crop types are available, so that it can be applicable for all of the counties and years that I am eventually going to plot. For this reason, I am going to subset a dataset and create a bar plot for one county and year when all of the crop types are available:

sample_yield<- subset(yield, Year=="1981" & County=="Adams")

I want to use custom colors so that each crop Group has a different color and each Data_item corresponding to a Group share the same Group color theme, which gradually changes from darker to brighter. First, I create three color functions, each corresponding to 3 Groups in my dataset.

colfunc1 <- colorRampPalette(c("darkRED"))
colfunc2 <- colorRampPalette(c("darkBLUE", "lightBLUE"))
colfunc3 <- colorRampPalette(c("darkGREEN", "lightgreen"))

Now I am going to use these functions to create a color code for each Group and save it in “colors”:

colors<-c(colfunc1(nrow(subset(sample_yield,sample_yield$Group=="CORN"))),colfunc2(nrow(subset(sample_yield,sample_yield$Group=="SW"))),colfunc3(nrow(subset(sample_yield,sample_yield$Group=="WW"))))

Next, in the template bar plot, I can use the customized “colors” in scale_fill_manual() function:

ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
  scale_fill_manual(values=colors)+
  geom_bar(stat='identity', color = "black")

From this plot, we just need to extract the legend. Also, we need to change its font size since we are going to add it to another plot. You should try different sizes to find out which one is more appropriate for your dataset/figure. I am saving this plot in “tmp” while I remove the legend title and change the font size. Then, I extract the legend section by using the get_legend() function and convert it to a ggplot by using the as_ggplot() function.

tmp <- ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
  scale_fill_manual(values=colors)+
  geom_bar(stat='identity', color = "black")+theme(legend.title = element_blank(),legend.text =element_text(size=19,face="bold",colour="black") )
library(cowplot)
legend<- get_legend(tmp)
library(ggpubr)
tmp_legend<- as_ggplot(legend)

Now, I am going to save the counties map with the appropriate font size and title and add the extracted legend from the yield data to it with the geom_subview() funtion. You need to adjust the coordinates of that you want to show the legend on the map based on your data.

tmp_map <- ggplot(counties)+
  geom_polygon(aes(long, lat, group=group), fill="white")+
  geom_path(color="black", mapping=aes(long, lat, group=group), size=1.5)+
  theme(axis.text.y=element_text(size=17,face="bold",colour="black"),
        axis.text.x=element_text(size=17,face="bold",colour="black"),
        axis.title.y =element_text(size=17,face="bold",colour="black"),
        axis.title.x =element_text(size=17,face="bold",colour="black"),
        plot.title =element_text(size=25,face="bold",colour="black"))+
  labs(title = "NASS Yield Data (1980-1995)")
library(ggimage)
tmp_map_final<- tmp_map+ geom_subview(x=-13830000, y=5730000, subview=tmp_legend)

In the below section, I am going to show how we can loop through all of the years and then all the counties in the county map, use the center coordinate of each county (polygon), plot the corresponded bar plot, and print it in the center of each polygon. We can use the ggplotGrob() funtion to extract different pieces of the bar plot created with gpplot. With this function, we can treat each part of the bar plot (background, axis, labels, main plot panel, etc.) as a graphical object and move it to the coordinates that are in our interest. For example, if we just want to use the main panel and not any other components, we can extract the panel, and adjust the other components as we wish to present in the final graph.

In this example, I am adding all of the bar plots for all counties in one year as a graphical object in a list “barplot_list”.  Then, by using the annotation_custom() function I add each item in the “barplot_list” to the coordinates at the center of the polygons (counties) that I already extracted. Note that the orders of center coordinates and plots in the “barplot_list” are the same.

At the end, I just add the base map with the customized legend (“tmp_map_final”) and with the list of all bar plots with their customized locations (“barplot_annotation_list”). Then, I add them all in a list (“all_list“) and repeat this process for every year. The last step is to save this list with saveGIF()  to create an animated gif. Note that we can use the same procedure but replace the bar plot with other types of plots such as pie charts.

counties_list<- as.data.frame(unique(counties$id))  #list of counties  
all_list<- list()
for (y in 1:nrow(years)){   #loop through all of the years
nass_oneyear<- subset(yield,yield$Year==years[y,])   #extract one year  
barplot_list <- 
  ##create bar plot for each county
  lapply(1:length(shp_counties$JURISDIC_2), function(i) { 
    #extract one county  
    nass_oneyear_onecounty<- subset(nass_oneyear,nass_oneyear$County==counties_list[i,])
    # As I mentioned, for each county and year the number of crop types might be different. So, I need to customize the color for each sub-dataset using the manual color ramp that I previously defined for each item.
    nass_oneyear_onecounty$itemcolor<- "NA"
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="CORN, GRAIN"]<- colors[1]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED"]<- colors[2]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, CC"]<- colors[3]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, SF"]<- colors[4]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED"]<- colors[5]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, CC"]<- colors[6]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, SF"]<- colors[7]
    plots_comp <- ggplotGrob(
      ggplot(nass_oneyear_onecounty,aes(x=Group,y=Value,group=(itemcolor),fill=factor(Data_Item)))+
        scale_fill_manual(values=nass_oneyear_onecounty$itemcolor)+
        geom_bar(stat='identity', color = "black") +
        labs(x = NULL, y = NULL) + 
        theme(legend.position = "none", rect = element_blank(), axis.title.x = element_blank(), 
              axis.title.y = element_blank(),
              axis.text.x= element_blank(),
              axis.ticks = element_blank(),
              axis.text.y=element_text(size=14,face="bold",colour="black")) + coord_cartesian(expand=FALSE) 
    )})
barplots_list <- lapply(1:length(shp_counties), function(i) 
  annotation_custom(barplot_list[[i]], xmin = counties_centroids[i,1]- 28000, ymin = counties_centroids[i,2]- 28000, xmax =  counties_centroids[i,1]+ 28000, ymax = counties_centroids[i,2]+ 28000))
# xmin, ymin, xmax and ymax can be used to adjust  are horizontal and vertical location of the bar plots
all_list[[y]] <- list(tmp_map_final + barplots_list)
}

library(animation)
saveGIF(
  {lapply(all_list, print)}
  , "(your directory)/final.gif", interval = 2, ani.width = 1600, ani.height = 1200)

If we want to have labels for our bar plots (instead of having yield values at the y-axis), we may want to show the yield value corresponding to each item in a group. In this case I can use the below command lines within a loop before we create a graphical objects of ggplot (before plots_comp <- ggplotGrob(….) ), to add a label column that shows the cumulative yield value in each group.

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%  
  group_by(Group) %>%
mutate(label_y = cumsum(Value)-10)  #I subtracted 10 from these cumulative values just to print them inside of the bar plot sections. 

Or we can just have one label that shows the total yield in each group:

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%  
group_by(Group) %>%
 mutate(total = sum(Value))

Then we can add these labels to the bar plots by adding the geom_text() function in the ggplot section within ggplotGrob(), and specifying the column of interest: 

geom_text(aes(y = label_y, label = round(Value)),colour = "white",size=3,fontface='bold')

Instead of manually adjusting the position the of label (such as in the first example), “vjust” can be added to the geom_text() for modifying text alignment:

geom_text(aes(y = total, label = round(total)),colour = "black",size=3,fontface='bold',vjust = -0.35)