Copula Coding Exercise

This Halloween, you may be scared by many things, but don’t let one of them be copulas. Dave Gold wrote an excellent post on the theory behind copulas and I’ll just build off of that post with a simple coding exercise. When I first heard about copulas, they sounded really relevant to helping me come up with a way to capture joint behavior across variables (which I then wanted to use to capture joint hydrologic behavior across watersheds). However, because I hadn’t learned about them formally in a class, I had a hard time translating what I was reading in dense statistics textbooks into practice and found myself blindly trying to use R packages and not making much progress. It was very helpful for me to start to wrap my head around copulas by just starting with coding a simple trivariate Gaussian copula from scratch. If copulas are new to you, hopefully this little exercise will be helpful to you as well.

Let’s pose a scenario that will help orient the reason why a copula will be helpful. Let’s imagine that we have historical daily full natural flow for three gauged locations in the Central Valley region of California: Don Pedro Reservoir, Millerton Lake, and New Melones Lake. Because these three locations are really important for water supply for the region, it’s helpful to identify the likelihood of joint flooding or drought tendencies across these basins. For this post, let’s focus on joint flooding.

First let’s fit distributions for each basin individually. I calculate the water year and create another column for that and then take the daily flow and determine the aggregate peak three-day flow for each year and each basin. Given the different residence times in each basin, a three-day aggregate flow can better capture concurrent events than using a one-day flow. Let’s term the flows in each basin [xt,1…xt,n] which is the 3-day peak annual flows in each of the n basins in year t.

library(evd)
library(mvtnorm)
library(extRemes)
library(mnormt)

#Read in annual observed flows 
TLG_flows=read.table("C:/Users/rg727/Downloads/FNF_TLG_mm.txt")
MIL_flows=read.table("C:/Users/rg727/Downloads/FNF_MIL_mm.txt")
NML_flows=read.table("C:/Users/rg727/Downloads/FNF_NML_mm.txt")

#Calculate water year column
TLG_flows$water_year=0

for (i in 1:dim(TLG_flows)[1]){
  if (TLG_flows$V2[i]==10|TLG_flows$V2[i]==11|TLG_flows$V2[i]==12){
    TLG_flows$water_year[i]=TLG_flows$V1[i]+1
  } else{
    TLG_flows$water_year[i]=TLG_flows$V1[i]}
}


MIL_flows$water_year=0

for (i in 1:dim(MIL_flows)[1]){
  if (MIL_flows$V2[i]==10|MIL_flows$V2[i]==11|MIL_flows$V2[i]==12){
    MIL_flows$water_year[i]=MIL_flows$V1[i]+1
  } else{
    MIL_flows$water_year[i]=MIL_flows$V1[i]}
}


NML_flows$water_year=0

for (i in 1:dim(NML_flows)[1]){
  if (NML_flows$V2[i]==10|NML_flows$V2[i]==11|NML_flows$V2[i]==12){
    NML_flows$water_year[i]=NML_flows$V1[i]+1
  } else{
    NML_flows$water_year[i]=NML_flows$V1[i]}
}



#Calculate 3-day total flow


TLG_flows$three_day=0

for (i in 1:length(TLG_flows$V4)){
  if (i == 1){
    TLG_flows$three_day[i]=sum(TLG_flows$V4[i],TLG_flows$V4[i+1])
  }
  else
    TLG_flows$three_day[i]=sum(TLG_flows$V4[i-1],TLG_flows$V4[i],TLG_flows$V4[i+1])
}

MIL_flows$three_day=0

for (i in 1:length(MIL_flows$V4)){
  if (i == 1){
    MIL_flows$three_day[i]=sum(MIL_flows$V4[i],MIL_flows$V4[i+1])
  }
  else
    MIL_flows$three_day[i]=sum(MIL_flows$V4[i-1],MIL_flows$V4[i],MIL_flows$V4[i+1])
}

NML_flows$three_day=0

for (i in 1:length(NML_flows$V4)){
  if (i == 1){
    NML_flows$three_day[i]=sum(NML_flows$V4[i],NML_flows$V4[i+1])
  }
  else
    NML_flows$three_day[i]=sum(NML_flows$V4[i-1],NML_flows$V4[i],NML_flows$V4[i+1])
}



#Calculate peak annual flow
TLG_flow_peak_annual=aggregate(TLG_flows,by=list(TLG_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
MIL_flow_peak_annual=aggregate(MIL_flows,by=list(MIL_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)
NML_flow_peak_annual=aggregate(NML_flows,by=list(NML_flows$water_year),FUN=max,na.rm=TRUE, na.action=NULL)

#Remove extra year (1986) from TLG
TLG_flow_peak_annual=TLG_flow_peak_annual[2:33,]


Next, we need to fit individual marginal distributions for each location. Since we are interested in capturing flood risk across the basins, a common marginal distribution to use is a Generalized Extreme Value distribution (GEV). We fit a different GEV distribution for each basin and from here, we create a bit of code to determine the peak three-day flows associated with a historical 10-year return period event.

#Determine level of a 10-year return period

getrlpoints <- function(fit){
  
  xp2 <- ppoints(fit$n, a = 0)
  ytmp <- datagrabber(fit)
  y <- c(ytmp[, 1])
  sdat <- sort(y)
  npy <- fit$npy
  u <- fit$threshold
  rlpoints.x <- -1/log(xp2)[sdat > u]/npy
  rlpoints.y <- sdat[sdat > u]
  rlpoints <- data.frame(rlpoints.x, rlpoints.y)
  
  return(rlpoints)
}


getcidf <- function(fit){
  
  rperiods = c(2,5,10,20,50,100,500,1000)
  bds <- ci(fit, return.period = rperiods)
  c1 <- as.numeric(bds[,1])
  c2 <- as.numeric(bds[,2])
  c3 <- as.numeric(bds[,3])
  ci_df <- data.frame(c1, c2, c3, rperiods)
  
  return(ci_df)
}


gevfit_MIL <- fevd(MIL_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_MIL)
ci_df <- getcidf(gevfit_MIL)

gevfit_NML <- fevd(NML_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_NML)
ci_df <- getcidf(gevfit_NML)

gevfit_TLG <- fevd(TLG_flow_peak_annual$three_day)
rlpoints <- getrlpoints(gevfit_TLG)
ci_df <- getcidf(gevfit_TLG)



#These are the historical thresholds defined from the GEV fit to the three-day sum of the peak flows 
ten_yr_event_TLG=58.53738	
ten_yr_event_MIL=40.77821
ten_yr_event_NML=58.95369

Now that we have our marginals fit, we need to use these in some way to fit a Gaussian copula. By definition, a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. So we need to transform our observations to be psuedo-uniform. To do so, we push the values of [xt,1…xt,n] through the inverse of the CDF of the respective fitted GEV function. These marginals are now uniformly distributed between 0 and 1. Let’s term these values now [ut,1…ut,n].

#Now we want to fit the copula on our historical flows. First create u's

u_TLG = pgev(TLG_flow_peak_annual$three_day, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3],lower.tail = TRUE)
u_MIL = pgev(MIL_flow_peak_annual$three_day, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML = pgev(NML_flow_peak_annual$three_day, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

#Let's also convert the thresholds to u's for each basin 
u_TLG_event = pgev(ten_yr_event_TLG, loc=gevfit_TLG$results$par[1], scale=gevfit_TLG$results$par[2], shape=gevfit_TLG$results$par[3], lower.tail = TRUE)
u_MIL_event = pgev(ten_yr_event_MIL, loc=gevfit_MIL$results$par[1], scale=gevfit_MIL$results$par[2], shape=gevfit_MIL$results$par[3],lower.tail=TRUE)
u_NML_event = pgev(ten_yr_event_NML, loc=gevfit_NML$results$par[1], scale=gevfit_NML$results$par[2], shape=gevfit_NML$results$par[3],lower.tail=TRUE)

Now we address the Gaussian part of the copula. Recall the following in Dave’s post:

Where Φ_R is the joint standard normal CDF with: 

mean and var

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

So we have our u vectors that we now need to push through an inverse standard normal distribution to get Z scores.

#Now takes these u's and push them through a standard normal to get Z scores 
 
Z_TLG=qnorm(u_TLG)
Z_MIL=qnorm(u_MIL)
Z_NML=qnorm(u_NML)

#Let's do the same for the historical events
Z_TLG_event=qnorm(u_TLG_event)
Z_MIL_event=qnorm(u_MIL_event)
Z_NML_event=qnorm(u_NML_event)

The last part that we need for our copula is a spearman rank correlation matrix that captures the structural relationship of the peak flows across the three basins.

cor_matrix=cor(cbind(Z_TLG,Z_MIL,Z_NML),method = "spearman")

Finally, we have all the components of our copula. What question can we ask now that we have this copula? How about “What is the likelihood of exceeding the historical 10-year event in each basin?”. We code up the following formula which expresses this exact likelihood. More details can be found in this very helpful book chapter:

We calculate this with the following line and we find an exceedance probability of 0.0564.

#Now calculate the probability of exceeding the historical threshold in all basins

exceedance=1-pnorm(Z_TLG_event)-pnorm(Z_MIL_event)-pnorm(Z_NML_event)+pmnorm(c(Z_TLG_event,Z_MIL_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_MIL)))+pmnorm(c(Z_TLG_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_TLG,Z_NML)))+pmnorm(c(Z_MIL_event,Z_NML_event),mean=rep(0,2),varcov = cor(cbind(Z_MIL,Z_NML)))-pmnorm(c(Z_TLG_event,Z_MIL_event,Z_NML_event),mean=rep(0,3),varcov = cor(cbind(Z_TLG,Z_MIL,Z_NML)))

Now this intuitively makes sense because a 10-year event has a 10% or 0.1 probability of being exceeded in any given year. If you calculate a joint likelihood of exceedance of this event across multiple basins, then this event is even less likely, so our lower probability of 0.0564 tracks. Note that you can create synthetic realizations of correlated peak flows by simulating from from a multivariate normal distribution with a mean of 0 and the Spearman rank correlation matrix defined above. Hopefully this was a simple but helpful example to demonstrate the key components of the Gaussian copula. The script and corresponding datasets can be found in this repo.

Markdown-Based Scientific and Computational Note Taking with Obsidian

Motivation

Over the last year, being in the Reed Research Group, I have been exposed to new ideas more rapidly than I can manage. During this period, I was filling my laptop storage with countless .docx, .pdf, .txt, .html files semi-sporadically stored across different cloud and local storages.

Finding a good method for organizing my notes and documentation has been very beneficial for my productivity, processing of new ideas, and ultimately staying sane. For me, this has been done through Obsidian.

Obsidian is an open-source markdown (.md) based note taking and organization tool, with strengths being:

  • Connected and relational note organization
  • Rendering code and mathematical (LaTeX) syntax
  • Community-developed plugins
  • Light-weight
  • A nice aesthetic

The fact that all notes are simply .md files is very appealing to me. I don’t need to waste time searching through different local and cloud directories for my notes, and can avoid having OneNote, Notepad, and 6 Word windows open at the same time.

Also, I like having localized storage of the notes and documentation that I am accumulating. I store my Obsidian directory on GitHub, and push-pull copies between my laptop and desktop, giving me the security of having three copies of my precious notes at all times.

I’ve been using Obsidian as my primary notetaking app for a few months now, and it become central to my workflow. In this post, I show how Obsidian extends Markdown feature utility, helps to organize topical .md libraries, and generally makes the documentation process more enjoyable. I also try to explain why I believe Obsidian is so effective for researchers or code developers.

Note Taking in Markdown

Markdown (.md) text is an efficient and effective way of not just documenting code (e.g., README.md files), but is great for writing tutorials (e.g., in JupyterNotebook), taking notes, and even designing a Website (as demonstrated in Andrew’s Blog Post last week).

In my opinion, the beauty is that only a few syntax tricks are necessary for producing a very clean .md document. This is particularly relevant as a programmer, when notes often require mathematical and code syntax.

I am not going to delve into Markdown syntax. For those who have not written in Markdown, I suggest you see the Markdown Documentation here. Instead, I focus on how Obsidian enables a pleasant environment in which to write .md documentation.

Obsidian Key Features

Below, I hit on what I view as the key strengths of Obsidian:

  • Connected and relational note organization
  • Rendering code and mathematical (LaTeX) syntax
  • Community-developed plugins

Relational Note Organization

If you take a quick look at any of the Obsidian forums, you will come across this word: Zettelkasten.

Zettelkasten, meaning "slip-box" in German and also referred to as card file, is a note taking strategy which encourages the use of many highly-specific notes which are connected to one another via references.

Obsidian is designed to help facilitate this style of note taking, with the goal of facilitating what they refer to as a "second brain". The goal of this relational note taking is to store not just information about a single topic but rather a network of connected information.

To this end, Obsidian helps to easily navigate these connections and visualize relationships stored within a Vault (which is just a fancy word for one large folder directory). Below is a screen cap of my note network developed over the last ~4 months. On the left is a visualization of my entire note directory, with a zoomed-in view on the right.

Notice the scenario discovery node, which has direct connections to methodological notes on Logistic Regression, Boosted Trees, PRIM, and literature review notes on Bryant & Lempert 2010, a paper which was influential in advocating for participatory, computer-assisted scenario discovery.

Each of these nodes is then connected to the broader network of notes and documentation.

These relations are easily constructed by linking other pages within the directory, or subheadings within those pages. When writing the notes, you can then click through the links to flip between files in the directory.

Links to Internal Pages and Subheadings

Referencing other pages (.md files) in your library is done with a double square bracket on either side: [[Markdown cheatsheet]]

You can link get down to finer resolution and specifically reference various levels of sub-headings within a page by adding a hashtag to the internal link, such as: [[Markdown cheatsheet#Basics#Bold]]

Tags and Tag Searches

Another tool that helps facilitate relational note taking is the use of #tags. Attach a # to any word within any of your notes, and that word becomes connected to other instances of the word throughout the directory.

Once tags have been created, they can be searched. The following Obsidian syntax generates a live list of occurrences of that tag across your entire vault:

```query
tag: #scenarios 

Which produces the window:

Rending code and math syntax

Language-Specific Code Snippets

Obsidian will beautifully stylized code snippets using language-specific formatting, and if you don’t agree then you change change your style settings.

A block of code is specified, for some specific language using the tripple-tic syntax as such:

```langage
<Enter code here>

The classic HelloWorld() function can be stylistically rendered in Python or C++:

LaTeX

As per usual, the $ characters are used to render LaTeX equations. Use of single-$ characters will results in in-line equations ($<Enter LaTeX>$) with double-$$ used for centered equations.

Obsidian is not limited to short LaTeX equations, and has plugins designed to allow for inclusion other LaTeX packages or HTML syntaxes.

latex $$ \phi_{t,i} = \left\{ \begin{array}\\ \phi_{t-1,i} + 1 & \text{if } \ z_t < \text{limit}\\ 0 & \text{otherwise.} \end{array} \right. $$

will produce:

Plugins

Obsidian boasts an impressive 667 community-developed plugins with a wide variety of functionality. A glimpse at the webpage shows plugins that give more control over the visual interface, allow for alternative LaTeX environments, or allow for pages to be exported to various file formats.

Realistically, I have not spent a lot of time working with the plugins. But, if you are someone who likes the idea of a continuously evolving and modifiable documentation environment then you may want to check them out in greater depth.

Conclusion: Why did I write this?

This is not a sponsered post in any way, I just like the app.

When diving into a new project or starting a research program, it is critical to find a way of taking notes, documenting resources, and storing ideas that works for you. Obsidian is one tool which has proven to be effective for me. It may help you.

Best of luck with your learning.

Enhancing Jupyter Notebooks for Teaching

In our research group, we often use Jupyter Notebooks as teaching tools in the classroom and to train new students. Jupyter Notebooks are useful because they integrate code and its output into a single document that also supports visualizations, narrative text, mathematical equations, and other forms of media that often allows students to better contextualize what they are learning.

There are some aspects of Jupyter Notebooks that students sometimes struggle with as they shift from IDEs to this new environment. Many have noted difficulty tracking the value of variables through the notebook and getting stuck in a mindset of just pressing “Shift+Enter”. Since the advent of Project Jupyter, there have been a variety of “unofficial” extensions and widgets that have been put out by the community that I have found can help enhance the learning experience and make Jupyter Notebooks less frustrating. It’s just not apparent that these extensions exist until you go searching for them. I will demonstrate some below that have been particularly helpful when working with our undergrads and hopefully will be helpful for you too!

Variable Inspector

One very easy way to remedy the situation of not having a variable display is to enable the Variable Inspector Extension which allows you to see the value of all of the variables in an external floating window. In your command prompt type:

#Installation
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

#Enable extension
jupyter nbextension enable varInspector/main

Here we are installing a massive library of extensions called Nbextensions that we will enable in this post. When you open your notebooks, you will now see a small icon at the top that you can click to create a variable inspector window that will reside in the margins if the notebook.

Variable Inspector Icon

As you step through the notebook, your variable inspector will become populated.

Variable Inspector Floating Window

Snippets of Code

The Snippets menu is pretty fantastic for students who are new to coding or for those of us who go back and forth between coding languages and keep having to quickly search how to make arrays or need code for example plots.

We can just enable this extension as follows right in Jupyter Notebook under the Nbextensions tab:

Enabling the Snippets Menu

This gif provides an example of some of the options that are available for NumPy, Matplotlib, and Pandas.

Jupyter Snippets Menu

If for instance, you want to create a histogram but can’t remember the form of the code, a snippet for this exists under the Matplotlib submenu.

Making a histogram with Matplotlib Snippets

Rubberband (Highlight Multiple Cells)

If you enable the Rubberband extension in the Nbextensions tab, you can highlight and execute multiple cells, which just gives a bit more precision in selecting a subset of cells that you want to run at once. You hold down “Ctrl+Shift” while dragging the mouse. Then you can execute the highlighted cells using “Shift+Enter”.

Execution Time

You can wrap your code in a variety of timing functions or just enable the Execution Time extension in the Nbextensions tab. For every cell that you execute, the elapsed time will now be recorded under the cell. For me, this is particularly useful when you have some cells that take a longer time to execute. If you are trying to create training, it’s helpful for a user to get a breakdown of exactly how long the code should be taking to run (so that they know if the timing is normal or if there is an issue with the code).

Elapsed time is shown below each cell

Creating a Jupyter Notebook Slideshow with RISE (a reveal.js extension)

The last teaching tool that I want to demonstrate is how to turn your Jupyter Notebook into an interactive slideshow, which has been infinitely more useful to use as a teaching tool than just scrolling through the Jupyter Notebook at a meeting.

First, installation is easy:

pip install RISE

Now, we have to prepare our notebook for a slideshow. Go to View -> Cell Toolbar -> Slideshow. We need to select the Slide Type for each of our cells.

Selecting the slide type for the slideshow

At the top of the screen, we now have a new icon that allows you to enter into slideshow mode when you are ready. We can now step through the slideshow that we have made and interactively execute the coding boxes as well.

I hope these tips help you to have a more pleasant experience while using Jupyter Notebooks. If there are other extensions that you enjoy using regularly, please post below!