This Halloween, you may be scared by many things, but don’t let one of them be copulas. Dave Gold wrote an excellent post on the theory behind copulas and I’ll just build off of that post with a simple coding exercise. When I first heard about copulas, they sounded really relevant to helping me come up with a way to capture joint behavior across variables (which I then wanted to use to capture joint hydrologic behavior across watersheds). However, because I hadn’t learned about them formally in a class, I had a hard time translating what I was reading in dense statistics textbooks into practice and found myself blindly trying to use R packages and not making much progress. It was very helpful for me to start to wrap my head around copulas by just starting with coding a simple trivariate Gaussian copula from scratch. If copulas are new to you, hopefully this little exercise will be helpful to you as well.
Let’s pose a scenario that will help orient the reason why a copula will be helpful. Let’s imagine that we have historical daily full natural flow for three gauged locations in the Central Valley region of California: Don Pedro Reservoir, Millerton Lake, and New Melones Lake. Because these three locations are really important for water supply for the region, it’s helpful to identify the likelihood of joint flooding or drought tendencies across these basins. For this post, let’s focus on joint flooding.
First let’s fit distributions for each basin individually. I calculate the water year and create another column for that and then take the daily flow and determine the aggregate peak three-day flow for each year and each basin. Given the different residence times in each basin, a three-day aggregate flow can better capture concurrent events than using a one-day flow. Let’s term the flows in each basin [xt,1…xt,n] which is the 3-day peak annual flows in each of the n basins in year t.
library(evd)
library(mvtnorm)
library(extRemes)
library(mnormt)
#Read in annual observed flows
TLG_flows=read.table("C:/Users/rg727/Downloads/FNF_TLG_mm.txt")
MIL_flows=read.table("C:/Users/rg727/Downloads/FNF_MIL_mm.txt")
NML_flows=read.table("C:/Users/rg727/Downloads/FNF_NML_mm.txt")
#Calculate water year column
TLG_flows$water_year=0
for (i in 1:dim(TLG_flows)[1]){
if (TLG_flows$V2[i]==10|TLG_flows$V2[i]==11|TLG_flows$V2[i]==12){
TLG_flows$water_year[i]=TLG_flows$V1[i]+1
} else{
TLG_flows$water_year[i]=TLG_flows$V1[i]}
}
MIL_flows$water_year=0
for (i in 1:dim(MIL_flows)[1]){
if (MIL_flows$V2[i]==10|MIL_flows$V2[i]==11|MIL_flows$V2[i]==12){
MIL_flows$water_year[i]=MIL_flows$V1[i]+1
} else{
MIL_flows$water_year[i]=MIL_flows$V1[i]}
}
NML_flows$water_year=0
for (i in 1:dim(NML_flows)[1]){
if (NML_flows$V2[i]==10|NML_flows$V2[i]==11|NML_flows$V2[i]==12){
NML_flows$water_year[i]=NML_flows$V1[i]+1
} else{
NML_flows$water_year[i]=NML_flows$V1[i]}
}
#Calculate 3-day total flow
TLG_flows$three_day=0
for (i in 1:length(TLG_flows$V4)){
if (i == 1){
TLG_flows$three_day[i]=sum(TLG_flows$V4[i],TLG_flows$V4[i+1])
}
else
TLG_flows$three_day[i]=sum(TLG_flows$V4[i-1],TLG_flows$V4[i],TLG_flows$V4[i+1])
}
MIL_flows$three_day=0
for (i in 1:length(MIL_flows$V4)){
if (i == 1){
MIL_flows$three_day[i]=sum(MIL_flows$V4[i],MIL_flows$V4[i+1])
}
else
MIL_flows$three_day[i]=sum(MIL_flows$V4[i-1],MIL_flows$V4[i],MIL_flows$V4[i+1])
}
NML_flows$three_day=0
for (i in 1:length(NML_flows$V4)){
if (i == 1){
NML_flows$three_day[i]=sum(NML_flows$V4[i],NML_flows$V4[i+1])
}
else
NML_flows$three_day[i]=sum(NML_flows$V4[i-1],NML_flows$V4[i],NML_flows$V4[i+1])
}
#Calculate peak annual flow
TLG_flow_peak_annual=aggregate(TLG_flows,by=list(TLG_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
MIL_flow_peak_annual=aggregate(MIL_flows,by=list(MIL_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
NML_flow_peak_annual=aggregate(NML_flows,by=list(NML_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
#Remove extra year (1986) from TLG
TLG_flow_peak_annual=TLG_flow_peak_annual[2:33,]
Next, we need to fit individual marginal distributions for each location. Since we are interested in capturing flood risk across the basins, a common marginal distribution to use is a Generalized Extreme Value distribution (GEV). We fit a different GEV distribution for each basin and from here, we create a bit of code to determine the peak three-day flows associated with a historical 10-year return period event.
#Determine level of a 10-year return period
getrlpoints <- function(fit){
xp2 <- ppoints(fit$n, a = 0)
ytmp <- datagrabber(fit)
y <- c(ytmp[, 1])
sdat <- sort(y)
npy <- fit$npy
u <- fit$threshold
rlpoints.x <- -1/log(xp2)[sdat > u]/npy
rlpoints.y <- sdat[sdat > u]
rlpoints <- data.frame(rlpoints.x, rlpoints.y)
return(rlpoints)
}
getcidf <- function(fit){
rperiods = c(2,5,10,20,50,100,500,1000)
bds <- ci(fit, return.period = rperiods)
c1 <- as.numeric(bds[,1])
c2 <- as.numeric(bds[,2])
c3 <- as.numeric(bds[,3])
ci_df <- data.frame(c1, c2, c3, rperiods)
return(ci_df)
}
gevfit_MIL <- fevd(MIL_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_MIL)
ci_df <- getcidf(gevfit_MIL)
gevfit_NML <- fevd(NML_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_NML)
ci_df <- getcidf(gevfit_NML)
gevfit_TLG <- fevd(TLG_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_TLG)
ci_df <- getcidf(gevfit_TLG)
#These are the historical thresholds defined from the GEV fit to the three-day sum of the peak flows
ten_yr_event_TLG=58.53738
ten_yr_event_MIL=40.77821
ten_yr_event_NML=58.95369
Now that we have our marginals fit, we need to use these in some way to fit a Gaussian copula. By definition, a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. So we need to transform our observations to be psuedo-uniform. To do so, we push the values of [xt,1…xt,n] through the inverse of the CDF of the respective fitted GEV function. These marginals are now uniformly distributed between 0 and 1. Let’s term these values now [ut,1…ut,n].
#Now we want to fit the copula on our historical flows. First create u's
u_TLG = pgev(TLG_flow_peak_annual$three_day, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3],lower.tail = TRUE)
u_MIL = pgev(MIL_flow_peak_annual$three_day, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML = pgev(NML_flow_peak_annual$three_day, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)
#Let's also convert the thresholds to u's for each basin
u_TLG_event = pgev(ten_yr_event_TLG, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3], lower.tail = TRUE)
u_MIL_event = pgev(ten_yr_event_MIL, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML_event = pgev(ten_yr_event_NML, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)
Now we address the Gaussian part of the copula. Recall the following in Dave’s post:
Where Φ_R is the joint standard normal CDF with:
ρ_(i,j) is the correlation between random variables X_i and X_j.
Φ^-1 is the inverse standard normal CDF.
So we have our u vectors that we now need to push through an inverse standard normal distribution to get Z scores.
#Now takes these u's and push them through a standard normal to get Z scores
Z_TLG=qnorm(u_TLG)
Z_MIL=qnorm(u_MIL)
Z_NML=qnorm(u_NML)
#Let's do the same for the historical events
Z_TLG_event=qnorm(u_TLG_event)
Z_MIL_event=qnorm(u_MIL_event)
Z_NML_event=qnorm(u_NML_event)
The last part that we need for our copula is a spearman rank correlation matrix that captures the structural relationship of the peak flows across the three basins.
cor_matrix=cor(cbind(Z_TLG,Z_MIL,Z_NML),method = "spearman")
Finally, we have all the components of our copula. What question can we ask now that we have this copula? How about “What is the likelihood of exceeding the historical 10-year event in each basin?”. We code up the following formula which expresses this exact likelihood. More details can be found in this very helpful book chapter:
We calculate this with the following line and we find an exceedance probability of 0.0564.
#Now calculate the probability of exceeding the historical threshold in all basins
exceedance=1-pnorm(Z_TLG_event)-pnorm(Z_MIL_event)-pnorm(Z_NML_event)+pmnorm(c(Z_TLG_event,Z_MIL_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_MIL)))+pmnorm(c(Z_TLG_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_NML)))+pmnorm(c(Z_MIL_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_MIL,Z_NML)))-pmnorm(c(Z_TLG_event,Z_MIL_event,Z_NML_event),mean=rep(0,3),varcov = cor(cbind(Z_TLG,Z_MIL,Z_NML)))
Now this intuitively makes sense because a 10-year event has a 10% or 0.1 probability of being exceeded in any given year. If you calculate a joint likelihood of exceedance of this event across multiple basins, then this event is even less likely, so our lower probability of 0.0564 tracks. Note that you can create synthetic realizations of correlated peak flows by simulating from from a multivariate normal distribution with a mean of 0 and the Spearman rank correlation matrix defined above. Hopefully this was a simple but helpful example to demonstrate the key components of the Gaussian copula. The script and corresponding datasets can be found in this repo.