Basics of data visualization with ggplot2

Basics of data visualization with ggplot2

In my previous post, I showed how wonderful the ggplot2 library in R is for visualizing complex networks. I realized that while there are several posts on this blog going over the advanced visualization capabilities of the ggplot2 library, there isn’t a primer on structuring code for creating graphs in R…yet. In this post, I will go over the syntax for creating pretty ggplot2 graphs and tweaking various parameters. I am a self-declared Python aficionado, but love using ggplot2 because it is intuitive to use, beginner-friendly, and highly customizable all at the same time.

Dataset and setup

For this tutorial, I will be using one of the built-in datasets in R called mtcars which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Further documentation on this dataset can be found here. We import the data into our RStudio workspace.

# import the library into our workspace
library(ggplot2)

# import dataset
data(mtcars)
head(mtcars)

The resultant dataset looks something like this.

Basic plot

Now that we have the data, we can get to plotting with ggplot2. We can declaratively create graphics using this library. We just have to provide the data, specify how to map properties to graph aesthetics, and the library takes care of the rest for us! We need to specify three things for each ggplot — 1) the data, 2) the aesthetics, and 3) the geometry.

Let us start by creating a basic scatterplot of the mileage (mpg) of each car as a function of its horsepower (hp). In this case the data is our dataframe mtcars, and the aesthetics x and y will be defined as the names of the columns we wish to plot along each axis — hp and mpg. We can also set the color aesthetic to indicate the number of cylinders (cyl) in each car. One of the reasons ggplot2 is so user-friendly is because each graph property can be tacked on to the same line of code with a + sign. Since we want a scatterplot, the geometry will be defined using geom_point().

# basic scatterplot
g <- ggplot(data = mtcars, aes(x = hp, y = mpg, color=cyl))
g + geom_point()

Excellent! The library automatically assigns the column names as axis labels, and uses the default theme and colors, but all of this can be modified to suit our tastes and to create pretty graphs. It is also important to note that we could have visualized the same data (less helpfully) as a line plot instead of a scatterplot, just by tweaking the geometry function.

# basic line plot
g + geom_line()

Well, this looks unpleasant. But wait, we can do so much more. We can also layer multiple geometries on the same graph to make more interesting plots.

# basic scatter+line plot
g + geom_line() + geom_point()

Additionally, we can tweak the geometry properties in each graph. Here is how we can transform the lines to dotted, and specify line widths and marker shape and size.

# change properties of geometry
g + geom_point(shape = "diamond", size = 3) +
  geom_line(color = "black", linetype = "dotted", size = .3) 

While our graph looks much neater now, using a line plot is actually pretty unhelpful for our dataset since each data point is a separate car. We will stick with a scatterplot for the rest of this tutorial. However, the above sort of graph would work great for time series data or other data that measures change in one variable.

Axis labels

One of the cardinal rules of good data visualization is to add axis labels to your graphs. While R automatically set the axis labels to be column headers, we can override this to make the axis labels more informative with just one extra function.

# change axis titles
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)")

Title

This graph is in serious need of a title to provide a reader some idea of what they’re looking at. There are actually multiple ways to add a graph title here, but I find it easiest to use ggtitle().

# add title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") 

Alright, having a title is helpful, but I don’t love it’s placement on the graph. R automatically left-aligns the title, where it clashes with the y-axis. I would much rather have the title right-aligned, in a bigger font, and bolded. Here is how to do that.

# change position of title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Theme

There are ways to manually change the background and gridlines of ggplot2 graphs using theme(), but an easy workaround is to use the built-in themes. Which theme you use depends greatly on the graph type and formatting guidelines, but I personally like a white background, faint gridlines, and a bounding box. One thing to note here though is that theme_bw() overrides theme() so the order of these two matters.

# add theme
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also use the theme() function to change the base font size and font family. Shown below is how to increase the base font size to 15 and change the base font family to Courier.

# use theme to change base font family and font size
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme_bw(base_size = 15, base_family = "Courier") +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Legend

It has been bothering me for the last seven paragraphs that my legend title still uses the column name. However, this is an easy fix. All I have to do is add a label to the color aesthetic in the labs() function.

# change legend title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also change the position of the legend. R automatically places legends on the right, and while I like having it to the right in this case, I could also place the legend at the bottom of the graph. This automatically changes the aspect ratio of the graph.

# change legend position
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "bottom")

Margins

The theme() function is of endless use in ggplot2, and can be used to manually adjust the graph margins and add/remove white space padding. The order of arguments in margin() is counterclockwise — top, right, bottom, left (helpfully remembered by the pneumonic TRouBLe).

# add plot margins
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "right") +
  theme(plot.margin = margin(t = 1, r = 1, b = 1, l = 2, unit = "cm"))

Conclusion

I have barely scratched the surface of what can be achieved using ggplot2 in this post. There are hundreds of excellent tutorials online that dive deeper into ggplot2, like this blog post by Cedric Scherer. I have yet to learn so much about this library and data visualization in general, but have hopefully made a solid case for using ggplot2 to create clean and aesthetically-pleasing data visualizations.

Custom Plotting Symbols in R

R has a lot of options for plotting symbols, but sometimes you might want a something not in the base set. This post will cover a way to get a custom plotting symbol in R. The code is available on github.

I will use data is for 24 nations that competed in the 2021 Euros (men’s) tournament, specifically the FIFA rankings of the men’s and women’s national teams (accessed 12 Aug 2021). Using a flag to represent the country seems reasonable, and easier than text or plotting symbol and color combinations. Plus, it might add a bit of visual interest.

Although this might seem specialized, it highlights changing backgrounds in R base graphics, controlling aspect ratios, and adding mathematical notation to plots.

Some prep work before coding comes in handy. First, I downloaded PNG files of each country’s flag and saved it with the country’s name. Next, I checked Wikipedia for the aspect ratio of each flag and added it to my CSV file of the FIFA rankings.

The code begins by loading the library we will be using for a custom PCH and then reading in my data from the CSV. The men’s rankings range from 1 to 62 and the women’s rankings from 2 to 131.

library(png) # library for a custom pch

df <- read.csv('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_national_team_rankings.csv',
               header=T) # read-in data from csv

# calculating ranges for men's and women's teams
men_range <- range(df$Men.s.National.Team)
women_range <- range(df$Women.s.National.Team)

Next, we set the basic parameters of the plot. Using the PNG function allows me to save it directly to a file and have a stable image process I an fine tune. The height and width needed to me modified until the axes were close to the same scale (10 ranks on x = 10 ranks on y). This will make life easier when trying to scale the flags with the aspect ratio.

# setting plot to save as a png file
# checked the output file to make sure
# axes were on the same scale, or close to it
png(filename='/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/fifa_euro_plot.png'
    ,width=round(women_range[2]/15,2)
    ,height=round(men_range[2]/11.5,2)
    ,units='in'
    ,res=300)

par(mar=c(5,5,5,5)) # plot margins

Now, we can initialize the plot. Because we are not plotting the data yet, we can simply label the axes. Default axis ticks would include 0, so I used more natural axis ticks. The gray background makes flags with white more visible.

# setting up the base scatter plot
# used \ to get the "'" in titles
plot(x=NA
     ,y=NA
     ,ylab=expression('Men\'s Ranking')
     ,xlab=expression('Women\'s Ranking')
     ,main='Euros 2021 FIFA Nations\nMen\'s vs Women\'s National Teams'
     ,xlim=rev(women_range) # reverse axes to rank 1 is top right
     ,ylim=rev(men_range) # reverse axes to rank 1 is top right
     ,las=1 # rotate vertical axis labels to be horizontal
     ,xaxt='n' # no default x axis
     ,yaxt='n' # no default y axis
)

# adding axes
axis(side=1, at=c(1,seq(10,women_range[2],10)),las=1)
axis(side=2, at=c(1,seq(10,men_range[2],12)),las=1)

# adding a gray background to the plot
# many flgs have a white portion that would disappear otherwise
rect(xleft=par('usr')[1],xright=par('usr')[2]
     ,ybottom=par('usr')[3],ytop=par('usr')[4]
     ,col='lightgray')

Now comes the fun part. Not all flags are shaped the same. The two extremes in these countries are Switzerland (aspect ratio = 1) and Croatia, Hungary, and North Macedonia (aspect ratio = 2). We will be telling the function the corners of where to plot the flag, so using the same x and y offset for all flags will lead to distortions.

To solve this problem, I set a value for the area of all flags. This will ensure that each country has the same visual impact. After some math, that means the x and y dimensions for each flag can be found based on the aspect ratio. Once we have all these, we can loop through the countries, calculate the coordinates for the flag corners, and use the country’s flag as the plotting symbol.

# setting approximate "area" for each country flag
flag_area <- 25

# looping over all the rankings
for(i in 1:nrow(df)){
  # read in the image and set it to a variable
  img <- readPNG(paste('/Users/calvinwhealton/Documents/GitHub/euros_2021_rankings/flags/',df$Country[i],'.png',sep=''))
  
  # math for same flag areas (approximately)
  # area = dx*dy = (dy^2)*(aspect_ratio)
  # dy = sqrt(area/aspect_ratio)
  # dx = sqrt(area*aspect_ratio)
  dy <- sqrt(flag_area/df$Aspect.Ratio[i])
  dx <- sqrt(flag_area*df$Aspect.Ratio[i])
  
  # plot the png image as a raster
  # need the image and coordinates
  # men's team on y axis, women's team on x axis
  rasterImage(image=img
              ,ytop=df$Men.s.National.Team[i]+dy/2
              ,ybottom=df$Men.s.National.Team[i]-dy/2
              ,xleft=df$Women.s.National.Team[i]-dx/2
              ,xright=df$Women.s.National.Team[i]+dx/2)
}

For a little bit of fun, and to illustrate one way to get Greek letters in a plot notation, I calculated the correlation of these ranks. The last line of this code closes the figure file.

# adding a text correlation
# using linear correlation of ranks, not rank correlation
correl <- round(cor(df$Men.s.National.Team,df$Women.s.National.Team),2)

# adding text notation for correlation
text(x=women_range[2]-20,
     y=men_range[1]+10,
     labels = bquote(paste(," Linear Corr. (",rho,") = ",.(correl),sep=""))
     )

# close file
dev.off()

And we are done! We see that the flags do look to be about the same area and have approximately correct aspect ratios. (I won’t say low long it took me to get that part figured out…) I hope this provides a fun way to change up your plots and help an audience better understand the data. Although I used flags here because it seemed the most natural for this data, any PNG file could be used based on your data and needs.

fifa_euro_plot

Fitting and Simulating from NHMMs in R

The purpose of this post is to give a workflow that one could follow if your goal is to fit Non-Homogeneous Hidden Markov Models (NHMMs) and to simulate sequences of states. Julie Quinn wrote a series of great posts on fitting Hidden Markov Models (HMMs) here but the goal of this post is to discuss the non-homogeneous counterparts of HMM. NHMMs can be distinguished from the former because they involve non-stationary transition probabilities between states. These dynamic transition probability matrices are conditioned on one or more external covariates that influences transitions between states. One example of the use of NHMMs is to model and simulate precipitation. Precipitation models that simulate without using atmospheric information cannot be expected to perform well on conditions that deviate from those on which the model was fit. Thus using a covariate that shows changes in atmospheric circulation (such as geopotential height), can help capture some of the nonstationarity that a solely precipitation-based model alone could not capture.

I have had a lot of time to work with NHMMs in R; however at the beginning, I found overwhelmingly few examples to work off of, so this post is meant to outline a workflow that I have settled on that you can apply to your own application. First off, there are tons of packages available for HMMs, but very few that handle the dynamic transition matrices of NHMMs. My favorite is depmixS4 found here. This package has all the tools needed to fit your NHMM, but it takes some experimenting to understand the syntax and to piece together the functions to create a full workflow.

First we want to fit a depmix model. In the first line of the code, I am creating a model called modNHMM. The data that I am fitting the NHMM with are the first 9 principal components of geopotential height. It is important that you list these components in this syntax and they should be the column titles of your dataset, which is synoptic.pcs in my case. The number of states is how many states you want to fit your model with. For a precipitation/streamflow application, you could fit a two-state NHMM to represent wet/dry conditions, or if you are trying to identify multiple regimes, you may have more.

library(depmixs4)

modNHMM <- depmix(list(PC1~1,PC2~1,PC3~1, PC4~1, PC5~1, PC6~1, PC7~1, PC8~1,
PC9~1),
nstates = num.states,
               family=list(gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),gaussian(),
                              gaussian()),
ntimes =  nrow(synoptic.pcs),
data = data.frame(synoptic.pcs),transition = ~predictor$PC1+predictor$PC2+predictor$PC3+predictor$PC4)

fit.modNHMM.depmix.paleo.check <- fit(modNHMM)

synoptic.state.assignments<- posterior(fit.modHMMs.depmix.paleo.check)$state # state sequence using the Viterbi algorithm

Then we choose a distribution for our responses, which I choose to be a Gaussian distribution. The argument, ntimes, refers to the length of our time series, which is the number of rows in our dataset. Next we specify the dataset that contains the PC data and the transition which is dictated by our external covariates. In this case, my covariates are in a dataframe called predictor and each column corresponds to the 4 covariates (which happen to be PCs of a different dataset) that will influence my transitions. Then we fit the model with our prescribed attributes using the fit function. Finally, we want to calculate the Viterbi sequence, which is the most probable path or sequence of states over the period that the model is fit. This step (last line of code) will return a vector of the length of the dataset with each day classified into one of num.states.

Now we need to locate the transition probability matrices. We use the following command:

fit.modNHMM.depmix.paleo.check@transition

If we were fitting an HMM, we would get one transition probability matrix. However, we get an output that looks like this:

Transition probability matrix coefficients

Now instead of having constant values for transitions, we have equations of the form: Intercept+b1*PC1+b2*PC2+b3*PC3+b4*PC4 where the b values are the coefficient values listed in the table. If were were to look at the first block [[1]], this block dictates transitions from State 1 to the other 5 states. The transitions from the states are as follows:

State 1 to State 1: 0+0+0+0+0 (State 1 is used as a reference variable in this case and the probability would be found by subtracting the other probabilities from 1 at the end)

State 1 to State 2: -3.11+-0.22*PC1+0.22*PC2+0.014*PC3-0.13*PC4

and so on. You have to remember to divide each value by the total across the row in order to return probabilities.

Once you have these symbolic equations for the transition probability matrices, you can create a list of matrices which will allow you to simulate sequences of states for new sets of the PC1, PC2, PC3, PC4 covariates. You can get a sense how you might create n different transition matrices if you have times series of length n of the covariates. Below I am just representing those symbolic equations in code using the getpars function to acquire the coefficients and store the resulting daily matrices in a list called mm. Depending on the number of covariates or states, you will need to adjust the indices accordingly.

n=dim(df)[[1]]
mm<-matrix(list(), 1,n)


for (j in 1:n){
  transition_matrix=matrix(, nrow = 5, ncol = 5)
  for (i in 6:10){
    transition_matrix[1,i-5]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
     }
  denominator=sum(exp(transition_matrix[1,]))
  for (i in 6:10){
    transition_matrix[1,i-5]=exp(transition_matrix[1,i-5])/denominator
  }
  
  for (i in 31:35){
    transition_matrix[2,i-30]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
  }
  denominator=sum(exp(transition_matrix[2,]))
  for (i in 31:35){
    transition_matrix[2,i-30]=exp(transition_matrix[2,i-30])/denominator
  }
  for (i in 56:60){
    transition_matrix[3,i-55]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[3,]))
  for (i in 56:60){
    transition_matrix[3,i-55]=exp(transition_matrix[3,i-55])/denominator
    
  }
  for (i in 81:85){
    transition_matrix[4,i-80]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[4,]))
  for (i in 81:85){
    transition_matrix[4,i-80]=exp(transition_matrix[4,i-80])/denominator
    
  }
  
  for (i in 106:110){
    transition_matrix[5,i-105]=getpars(fit.modHMMs.depmix.paleo.check)[i]+(getpars(fit.modHMMs.depmix.paleo.check)[i+5]*df$PC1[j])+ (getpars(fit.modHMMs.depmix.paleo.check)[i+10]*df$PC2[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+15]*df$PC3[j])+(getpars(fit.modHMMs.depmix.paleo.check)[i+20]*df$PC4[j])
    
  }
  denominator=sum(exp(transition_matrix[5,]))
  for (i in 106:110){
    transition_matrix[5,i-105]=exp(transition_matrix[5,i-105])/denominator
    
  }
  mm[[j]]=transition_matrix

}

Once we have these matrices, we can then simulate state sequences that can result from the chain of transition matrices. For this part, we need to create markov lists with our transition matrices:

library(markovchain)
mcObject <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(j){ 
  
  mcObject.time.varying <- mclapply(X=1:n.sim,mc.preschedule=TRUE,mc.cores=1,FUN=function(t){
    tr.prob=as.matrix(mm[[t]])
    mcObject.time.varying.out <- new("markovchain", states = c("1","2","3","4","5"),
                                             transitionMatrix = tr.prob, name = paste("McObject",t,sep=""))    
    return(McObject.time.varying.out)
  }
  )
  mcObject.final <- new("markovchainList",markovchains = mcObject.time.varying, name = "mcObject.nh")
  return(

mcObject.final

)
  
}

Finally we simulate using the following:

simulate.mc <- function(mcObject,num.states,dates.sim,last.month,last.day,n.sim,iter) {
  
  #this function will simulate the Markov chain iter times
  
  #Arguments:
  #mcObject = a Markov chain object from the markovchain package
  #num.states = the number of states 
  #date.sim = a time series of dates for the simulation period
  #last.month = last month of the season
  #last.day = last day of the last month of the season
  #iter = the number of iterations
  
  #day and month sequences for the simulation period
  days.sim <- as.numeric(format(dates.sim,"%d"))
  months.sim <- as.numeric(format(dates.sim,"%m"))
  n.sim <- length(dates.sim)  #length of simulation
  
  final.mc.sim <- mclapply(X=1:iter,mc.preschedule=TRUE,mc.cores=1,FUN=function(i){  
    
    mc.sim <- as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[1]]))
    end.state <- paste(mc.sim[length(mc.sim)])
    for (t in 1:n.sim) {
      mc.sim <- c(mc.sim,as.numeric(rmarkovchain(n=1,object=mcObject[[i]][[t]],t0=end.state)))
      end.state <- paste(mc.sim[length(mc.sim)])
      if(months.sim[t]==last.month & days.sim[t]==last.day) {end.state <- paste(sample(1:num.states,size=1))}
    }    
    
    #here is the final mc simulation
    final.mc.sim.iter <- mc.sim[2:(n.sim+1)]
    return(final.mc.sim.iter)
  }
  )
  return(final.mc.sim)
  
}



simulations=matrix(list(), 1,1000)
for (i in 1:1000){

  simulations[[i]] <- simulate.mc(mcObject=mcWeather.Regime,num.states=num.states,
                                  dates.sim=dates.sim,last.month=last.month,last.day=last.day,iter=iter)
}

And that’s it! You can simulate for many different iterations (1000 in my case) and you will be returned a large list with your 1000 sequence of states over the simulation period.

Visualizing large directed networks with ggraph in R

How you choose to visualize complex multidimensional data can significantly shape the insights your audience derives from the plots. My colleague Antonia has written a couple of excellent blog posts on analyzing geospatial networks in Python using the NetworkX library which can be found here and here. I generally lean towards Python for coding but have recently come around on R, mostly because of how easy it is in R to make pretty network visualizations. I will go over some basic network visualizations in R using the igraph and ggraph libraries in this blog post. All the code and data I am using can be found here.

The data I will be using in this post is a processed and cleaned csv-file of Upper Colorado River Basin (UCRB) user interactions obtained from CDSS. The Colorado River is one of the most important river systems in North America, supplying water to millions of Americans. The river has been facing a record 20 year-drought — a situation that is further complicated by prior appropriation doctrine where senior users can put junior ones out of priority until their own needs are met.

Shown below is a snippet of the dataset. Column A shows the user whose water right was put out of priority in 2002 by a water right of column C. Column E shows the total number of days that a water right of column A was put out priority by one of column C in 2002. The rest of the columns contain user attributes.

Now on to turning this lifeless spreadsheet into some pretty pictures. We begin by importing all the necessary libraries.

library(tidyverse)
library(igraph)
library(ggraph)
library(dplyr)
library(ggplot2)

Next we will import the csv-file shown above, and create a list of nodes and edges. This information can be used by the igraph library to create a directed network object. In this directed network, the source node of each edge is priorityWdid (column C) while the destination node is analysisWdid (column A), since the former is putting a call on the latter to divert flow away.

# read network .csv file
data <- read.csv('network_csv_files/priorityWdid/v2_network_2002.csv')

# create nodes
from <- unique(data[c('priorityWdid','priorityStructure', 'priorityNetAbs', 
                      'priorityStreamMile')]) %>% rename(wdid = priorityWdid) %>%
  rename(structure = priorityStructure) %>% rename(netAbs = priorityNetAbs) %>%
  rename(streamMile = priorityStreamMile)
to <- unique(data[c('analysisWdid','analysisStructure', 'analysisNetAbs',
                    'analysisStreamMile')]) %>% rename(wdid = analysisWdid) %>% 
  rename(structure = analysisStructure) %>% rename(netAbs = analysisNetAbs) %>%
  rename(streamMile = analysisStreamMile)
nodes <- unique(rbind(from, to))

# create edges
edges <- data[c('priorityWdid', 'analysisWdid', 'sumWtdCount')] %>% rename(from = priorityWdid) %>% rename(to = analysisWdid)

# create network using igraph package
network <- graph_from_data_frame(d = edges, vertices = nodes, directed = TRUE)

This network has over 1400 nodes! That would be an unreadable mess of a visualization, so let us filter down this network to only include interactions between the 200 senior-most water rights in the UCRB.

# only include top 200 users by seniority
users <- read.csv('../data/CDSS_WaterRights.csv')
users <- head(users[order(users$Priority.Admin.No),], 200)

top_users <- users$WDID
for (i in 1:length(top_users)){
  top_users[i] <- toString(top_users[i])
}

# create subnetwork with only top 200 users by seniority
sub_nodes <- intersect(nodes$wdid, top_users)
subnet <- induced.subgraph(network, sub_nodes)

Excellent! Only 37 of the top 200 users in the UCRB interacted with each other in 2002. This is a much more manageable number for plotting. Let us start with the most basic visualization. In keeping with ggplot syntax, we can conveniently just keep adding plot specifications with a “+”.

# basic network graph
ggraph(subnet, layout = 'stress') +
  ggtitle('2002') + 
  geom_edge_link() +
  geom_node_point() +
  theme_graph()

Alright this is nice, but not particularly insightful. We have no idea which user each node corresponds to, and this figure would have us believe that all nodes and edges were created equal. When we constructed the network we actually added in a bunch of node and edge attributes, which we can now use to make our visuals more informative.

Shown below is a more helpful visualization. I went ahead and added some attributes to the network, and labelled all the nodes with their structure names. The nodes are sized by their out-degree and colored by their stream mile. The edge widths are determined by the total number of days the source node put the destination out of priority. In order to do this, I leveraged an amazing feature of ggplot2 and ggraph called aesthetic mapping which is quick way to map variables to visual cues on a plot. It automatically scales the data and creates a legend, which we can then further customize.

# plot graph in circular layout
ggraph(subnet, layout = "circle") +
  ggtitle('2002: "circle" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure), repel = TRUE, size=2) +
  theme_graph()

The above network has a circle layout because it’s the easiest to read and is replicable. But there are actually a number of layouts available to choose from. Here is another one of my favorites, graphopt. While this layout is harder to read, it does a better job of revealing clusters in the network. The only change I had to make to the code above was swap out the word ‘circle’ for ‘graphopt’!

# plot graph in graphopt layout
set.seed(1998)
ggraph(subnet, layout = "graphopt") +
  ggtitle('2002: "graphopt" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure), repel = TRUE, size=2) +
  theme_graph()

The above graph would be a lot easier to read if it weren’t for the long labels cluttering everything up. One way to deal with this is to adjust the opacity (alpha) of the text by the degree of the node. This way only the important and central nodes will have prominent labels. Again, all I have to do is add two extra words in line 12 of the code block above. Notice that I did set show.legend to False because I don’t want a legend entry for text opacity in my plot.

ggraph(subnet, layout = "graphopt") +
  ggtitle('2002: "graphopt" Layout') +
  geom_edge_link(aes(width = sumWtdCount), alpha = 0.8, color = 'skyblue', 
                 arrow = arrow(length = unit(2, 'mm')), end_cap = circle(2, 'mm')) +
  labs(edge_width = "Number of days") + 
  geom_node_point(aes(size = deg, colour=streamMile)) +
  labs(colour = "Stream mile") +
  labs(size = "Out-degree") +
  scale_color_gradient(low = "skyblue", high = "darkblue") +
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = structure, alpha = deg), repel = TRUE, size=2, show.legend = F) +
  theme_graph()

This is just a small sampling of the possibilities for network visualization in R. I have only just begun exploring the igraph and ggraph libraries, but the syntax is fairly intuitive, and the resultant plots are highly customizable. The data-to-viz blog is a pretty incredible resource to look at other network visualizations in R, if you are interested.

ggplot (Part 2)

This is the second part of the ggplot introduction. In this blog post, I am going to go over how you can make a decent density plot in ggplot. Density plots are basically smoothed versions of the histogram and show the distribution of your data while also presenting the probability distribution of the data using the kernel density estimation procedure. For example, when we have a regional data set, it is important to look at the distribution of our data across the region instead of just considering the region average. In our example (download the data set from here), we are going to visualize the regional distribution of simulated average winter wheat yield for 30 years from 1981 to 2010. The “ID” column in the data set represents one grid cell in the region, and there are 1,812 total grid cells. For each grid cell, the average historical yield and the standard deviation of yield during 30 years were given. First, we need to load the library; then, in the general code structure of “ggplot ( dataframe , aes ( x , y , fill )),” we need to specify x-axis to “yield.” The y-axis will be calculated and added through “geom_density()”. Then, we can add a color, title, and label and customize the background.

example1<- read.csv("(your directory)/example_1.csv")
library(ggplot2)   
ggplot(example1, aes(x=example1$period_ave_Y))+ 
geom_density(fill="blue")+
 theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

Now, we want to know how the standard deviation of 30 years’ average yield for all the grid cells in the region can be mapped into this density plot.

We can add another column (name it “SD_class”) to the data set and classify the standard deviations. The maximum and minimum standard deviations among all the grid cells are the following.

max(example1$period_sd_Y)
# [1] 3.605131
min(example1$period_sd_Y)
# [1] 0.8645882

For example, I want to see this plot categorized by standard deviations between 0.8 to 1.5, 1.5 to 2.5, and 2.5 to the maximum value. Here, I am writing a simple loop to go over each row and check the standard deviation value for each row (corresponding to each grid cell in a region); I fill the newly added column (“SD_class”) with the correct class that I specify in the “if statement.”

example1$SD_class<- NA
for (i in 1:nrow(example1)){
  if(example1[i,2]>0.8 && example1[i,2]<= 1.5) {example1[i,4]<- c("0.8-1.5")}
  if(example1[i,2]>1.5 && example1[i,2]<= 2.5) {example1[i,4]<- c("1.5-2.5")}
  if(example1[i,2]>2.5) {example1[i,4]<- c("2.5-3.6")}
}

Now, we just need to add “fill” to the aesthetics section of the code, specify the column with the classifications, and add “alpha” to make the color transparent in order to see the shapes of the graphs and whether they have overlaps.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

We can also use the “facet_grid()” option, like the plot in Part (1), and specify the column with classification to show each of these classes in a separate panel.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+facet_grid(example1$SD_class ~ .)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

The other interesting variables that we can explore are different percentiles of our data set that correspond to the density plot. For this, we need to obtain the density values (y-axis on the plot) for the percentiles that we are interested in—for example 10%, 25%, 50%, 75%, and 90%. Also we need to find out the actual yield value corresponding to each percentile:

quantiles_yield <- quantile(example1$period_ave_Y, prob=c(0.1, 0.25, 0.5, 0.75, 0.9))
#     10%      25%      50%      75%      90% 
#  4.229513 5.055070 5.582192 5.939071 6.186014

Now, we are going to estimate the density value for each of the yields at the 10th, 25th, 50th, 75th, and 90th percentiles.

df <- approxfun(density(example1$period_ave_Y))

The above function will give us the approximate density value for each point (yield) in which we are interested—in our case, yields for the above percentiles:

df(c(quantiles_yield))
#[1] 0.1176976 0.3267841 0.6129621 0.6615790 0.4345247

Now, we can add several vertical segments to the density plot that show where each percentile is located on this graph. The limits of these segments on the y-axis are based on the density values for each percentile that we got above. Also, note that I used those values to adjust the positions of the labels for the segments.

ggplot()+ 
      geom_density(aes(x=example1$period_ave_Y),fill="blue",alpha=0.4) + 
    geom_segment(aes(x=quantiles_yield, y=0, xend =quantiles_yield,
                     yend= df(c(quantiles_yield))),size=1,colour =c("red","green","blue","purple","orange"),linetype='dashed')+
      theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
            axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
            legend.text=element_text(size=13),legend.title=element_text(size=14))+
      labs(title = paste("Density Plot of Regional Average Historical Yield (30 years) and Percentiles"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")+
    annotate("text", x=4.229513, y=0.15, label=paste("10%"),size=5)+
    annotate("text", x=5.055070, y=0.36, label=paste("25%"),size=5)+
    annotate("text", x=5.582192, y=0.65, label=paste("50%"),size=5)+
    annotate("text", x=5.939071, y=0.7, label=paste("75%"),size=5)+
    annotate("text", x=6.186014, y=0.47, label=paste("90%"),size=5) 

ggplot (Part 1)

In this blog post, I am going to introduce a powerful plotting package in R. The plotting package is called ggplo2. This library allows us to quickly create different plots (scatter plots, boxplots, histograms, density plots, time series plots: you name it!) while also customizing them to produce elegant graphics beyond regular line or bar charts. First, we need to download the library and then activate it:

install.packages("ggplot2")
library(ggplot2)

I am going to outline how to build two different types of map: (1) calendar heat map and (2) alluvial map. The first is used to present the variations of a variable or an activity over a long period of time through color coding so that we can easily recognize the trend, the seasonality, or any patterns or anomalies. The alluvial map provides better visualization for categorical and multidimensional data. They show the changes in magnitude of some variables at different situations that can be any discrete indexes. To create this type of map, we also need the “ggalluvial” library that has to be called with ggplot2.

The general code structure for plotting calendar heat map is the following:

ggplot ( dataframe , aes ( x , y , fill )) + geom_tile ( ) + facet_grid ( )

With “aes,” which stands for aesthetics, we describe how variables in the data frame are mapped to visual properties, so we specify the x- and y-axes. We use the fill in aesthetic to specify the fill color for different variables.

“Geom” specifies the geometric objects that define the graph type—which can be point, line, etc.—and show the value we assigned in “aes(fill)”. The “geom_tile” tiles plane with rectangles uses the center of the tile and its size. A rectangle can be divided into smaller rectangles or squares.

The “facet” command creates a trellis graph or panel by specifying two variables or one on top of “aes.”

We can show the daily, weekly, or monthly values in the calendar heat map. As an example of calendar heat map, I am using weather data for the Yakima River Basin in central Washington State; these data were originally downloaded from Northwest Knowledge Network. The data set includes three downscaled global climate models (GCMs) from 2019 to 2061 with the resolution of 1/24th degree; the data is also aggregated for the basin and monthly time step. You can download data here. Now, let’s read the data.

gcm1 <- read.csv("(your directory)/CanESM2_rcp85.csv")
gcm2 <- read.csv("(your directory)/inmcm4_rcp85.csv")
gcm3 <- read.csv("(your directory)/GFDL-ESM2G_rcp85.csv")

By running header(gcm1) or colnames(gcm1), you can see different columns in each data set, including “Year”; “Month”; name of the “GCM”; and weather variables including tasmin, tasmax, pr, PotEvap corresponded to minimum and maximum temperature, precipitation, and potential evapotranspiration. The goal is to visualize the difference between these realizations of monthly precipitation for 21 future years from 2020 to 2040. To lay out panels in a faced_grid, I want to show years and months. In each month, I am going to show precipitation values for each GCM.

gcms<- rbind(gcm1,gcm2,gcm3)  # Join three data frames into one

# Add a new column to the data frame and call it “Month_name”; then, fill it with the name of the months for each row
tst<- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct",
"Nov","Dec")  
gcms$Month_name <- tst[gcms$Month]  
gcms$nmonth<- as.factor(1)  # add a new column to a data frame and fill it with value of 1, as a factor
gcm_2040<- subset(gcms, gcms$Year<2041) # select just the years before 2041
prec_fut<- ggplot(gcm_2040, aes(x=gcm_2040$gcm,nmonth,fill = pr)) +
  geom_tile()+
  facet_grid(Year~Month_name) +
  theme(strip.text.x = element_text(colour = "black",size =9 ,margin=margin(0.2,0,0.2,0,"cm")),
        strip.text.y = element_text(colour = "black",size = 9,angle =360),
        strip.background = element_rect(colour="black", size=1, linetype="solid"),
        axis.text.x=element_text(size=9,angle = 90),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        panel.background = element_rect(fill = "white"))+
scale_fill_gradient(low="green",high="red",limits=c(0,230),breaks=c(0,25,50,75,100,125,150,175,200,230),labels=c(0,25,50,75,100,125,150,175,200,230)) +
  labs(x="",
       y="",
       title = "Time-Series Calendar Heatmap",
       subtitle="Projected Monthly Precipitation-Yakima Rive Basin",
       fill="Precipitation (mm)") 
ggsave("(your directory)/name.png", prec_fut)

With theme(), you can modify the non-data parts of your plot. For example, “strip.text.x and .y” and “strip.background“ adjust facet labels of each panel along horizontal and vertical directions and background format.

The “axis.text” and “axis.ticks” commands adjust the tick labels and marks along axes, and by using element_blank(), you can remove specific labels. With “panel.background,” the underlying background of the plot can be customized.

With argument “limits” in “scale_fill_gradient,” you can extend the color bar to the minimum and maximum values you need. In our example, these limits are obtained by the following commands.

max(gcm_2040$pr) 
min(gcm_2040$pr)

With the labs() command, you can change axis labels, plot, and legend titles. Finally, ggsave() is used to save a plot; the scale, size, and resolution of the saved plot can also be specified. Ggsave() supports various formats including “eps,” “ps,” “tex” (pictex), “pdf,” “jpeg,” “tiff,” “png,” “bmp,” “svg,” and “wmf” (Windows only).

What can we learn from this graph? Well, we can see the interannual variability of precipitation based on three GCMs and how monthly precipitation varies in different GCMs. Compared to precipitation values for wet months, when variations are generally higher, precipitation values for dry months are more consistent among GCMs.

Now, let’s create an alluvial diagram. For that, we need to prepare the essential components. The data should be categorical, and for each row, there should be a frequency for a category that we are interested in presenting. In this example, I am going to show simulated (by a crop model) winter wheat yield changes in dryland low rainfall zone of the Pacific Northwest during the future period (2055–2085) compared to the historical period (1980–2010). The low zone in this region includes 1,384 grid cells with the dimension of 4 by 4 km. Different GCMs projected different weather scenarios, which we showed in the calendar heat map plot. As you can imagine, if you force your crop model with different GCMs, you will get different projections for the crop yield. The example data set can be read in R by the following:

L_zones<- read.csv("(your directory)/yield_gcms.csv",head=T)

This data set includes a few columns: “fre_yield,” “GCM,” “Zone,” “RCP,” and “Ratio.”

For each row, there is a GCM name that corresponds to RCP and Zone. Then, there is a Ratio column, which shows the category of the yield ratio. These yield ratio categories correspond to the average winter wheat yield during the future period, divided by average yield during the historical period. For each of the categories under the Ratio column, the number of grid cells was counted and is reported under the “fre_ yield” column. For example, Row 2 (L_zones[2,]) shows that 976 grid cells out of 1,384 cells in a low-rainfall zone are projected to have yield ratio between 1.2 and 1.5 during 2055–2085 compared to 1980–2010, under CanESM2 and RCP 4.5 future weather scenarios.

Ggplot and ggalluvial provide an easy way to illustrate this type of data set. At each category on the x-axis, we can have multiple groups, and they are called “strata.” Alluvial diagrams have horizontal splines that span across the categories at the x-axis, and they are called “alluvia.” A fixed value is assigned to an alluvium at each category at the x-axis that can be represented by a fill color.

install.packages("ggalluvial")
library(ggalluvial)

yield_ratio<- ggplot(data = L_zones, aes(axis=GCM, axis2=RCP, y = fre_yield)) + 
 # We can add more (axis4, …) to have more groups in the x-axis
  scale_x_discrete(limits = c("GCM","RCP"), expand = c(.1, .1)) +
  geom_alluvium(aes(fill = Ratio)) +
  geom_stratum(width = 1/12, fill = "lightgrey", color = "black") + 
  geom_label(stat = "stratum", label.strata = TRUE) +
  scale_fill_brewer(type = "qual", palette = "Set1")+
  labs(x="",
       y="Number of Grid cells in the Zone",
       title = "Average Change in Winter Wheat Yield During 2055-2085 Compared to 1980-2010",
       subtitle="Low Rainfall Zone in Pacific Northwest")
ggsave("(your directory)/Future_WW_Yield.jpeg",yield_ratio, width=10, height=10)

Y in the aes() controls the heights of the alluvia and is aggregated across equivalent observations.

“Scale_x_discrete” allows you to place labels between discrete position scales. You can use “limit” to define values of the scale and also their order, and “expand” adds some space around each value of the scale.

“Geom_alluvium” receives the x and y from the data set from ggplot and plots the flows between categories.

“Geom_stratum” plots the rectangles for the categories, and we can adjust their appearance.

 Labels can be assigned to strata by adding “stat = stratum” and “label.strata = TRUE” to the geom_label. Then, the unique values within each stratum are shown on the map.

“Scale_fill_brewer” is useful for displaying discrete values on a map. The type can be seq (sequential), div (diverging), or qual (qualitative). The “palette” can be a string of the named palette or a number, which will index into the list of palettes of appropriate type.

Now, we can easily see in this graph that the three GCMs used in the crop model produced different results. The changes in winter wheat yield during the future period compared to the historical period are not predicted with the same magnitude based on different future weather scenarios, and these differences are more profound under RCP 8.5 compared to RCP 4.5.

R-Markdown

What Is R-Markdown? Why We Are Interested in It?

A few years ago, a very dear friend of mine told me about R-Markdown. I was working on a report, and he said that I should try this very cool tool. I did so but not immediately. I started with an “if it ain’t broke, don’t fix it” attitude. However, I quickly realized that R-Markdown really is helpful—well, at least in many situations.

What is R-Markdown? It is a script-based text-development platform for preparing high-quality papers and reports. This strong tool is effective for use on complicated documents that have various types of diagrams and tables. R-Markdown is a distribution of Markdown language for R. More information about Markdown can be found here.
Personally, I’ve found R-Markdown to be a powerful tool for creating tutorial documents that include figures, tables, blocks of code, and more. R-Markdown can also be very helpful for working on papers; you can have everything in the same place. For example, as you will see in this tutorial, you can generate your figures and tables within documents. Because it is script-based, R-Markdown is reproducible; you will always get the same text format and figure quality. Therefore, if you want to have a professional-looking CV or are working on a paper or report, I suggest giving R-Markdown a try. The tool might become your new best friend.

install R Markdown

There are two steps to install R-Markdown:

1- Install R Markdown


# 1- Install R Markdown

install.packages("rmarkdown")
library(rmarkdown)


2- You also need to install “tinytex”. You can use the following command to install and load “tinytex”

tinytex::install_tinytex()
library(tinytex)

Create an R-Markdown Document

To create your first R-Markdown document, start by installing R-Markdown. Then, open the “File” menu, and click on “New File.” From the dropdown menu, select “R-Markdown.” Doing so will open an R-Markdown file in your RStudio. The file comes with very simple and informative instructions.

On a side note, I use RStudio, which is a popular and user-friendly integrated development environment (IDE) for R. You can find more information about it (here).

Publish your document

The final format of your output document can be pdf, html, or word. To select your favorite output and generate your final document, click on “Knit,” which opens a dropdown menu. Select the output format—for example, pdf—and it will generate your document.

Components of an R-Markdown Code

R-Markdown documents usually include meta-data, text, and code chunks. The following sections briefly describe the components, and more information can be found on R-Markdown’s website.

Meta-Data

When generating documents, R-Markdown requires some initial information and instructions. These can include general data about the documents—for example, date, title, output format, and author’s name.

Text

Text parts in R-Markdown follow the tradition of other document-markup languages such as LaTex (see here). However, R-Markdown is easier than LaTex. Basically, authors can use scripts to adjust document formatting. Many details can be listed about R-Markdown’s text-formatting commands, but I am not going to explain them in this short tutorial. These cheat sheets here and here provide enough information to get you started on writing an R-Markdown document. A few examples: #header creates headers, ‘[]()’ creates a hyperlink. You can use $ to insert equations (e.g. $y=ax^{2}+ bx +c$).

Code Chunks

Different types of code chunks can be used in R-Markdown; the types depend on application of the code. You might want to show your code when you develop an instruction. You can also write a code solely for generating a figure, but you may not want to show the code itself. The following is a generated timeline figure of Michael Jackson’s life; you can see the code.

#You need to uncomment ``` lines

#```{r timeline}

# This code chunck generates a timeline of Michael Jackson life.

library(timelineS)
timelineS(mj_life, main = "Life of Michael Jackson",label.cex =   0.7)

#```

Adding Tables

There are different libraries available in R for generating nice-looking tables. Here I use knitr.

#You need to uncomment ``` lines

#```{r table}

library(ggplot2)

library(knitr)
kable(mpg[1:8,])

#```

Adding Figures to Your Document

R-Markdown allows you to generate plots in your documents. For example, you can use ggplot, which is a powerful figure-creation library in R, to create and insert a plot into your document. See the following. If you include echo = FALSE to the header of your code chunk, the code would disappear on your final pdf file.

#You need to uncomment ``` lines

#```{r ggplot} 

library(ggplot2)

# MPG dataset is already available in ggplot2, I use it to generate the following figure

ggplot(mpg, aes(x=cyl, y=cty)) + geom_boxplot(aes(fill=factor(cyl))) + 
    labs(title="Mileage vs Number of Cylinders", 
         x="Number of Cylinders",
         y="City Mileage",
         fill="City Mileage")

#```

Geospatial Mapping in R

Introduction

Let me start this blog post by stating the obvious: Geospatial maps are interesting to look at and certainly make papers and presentations prettier and more impressive; however, those are not the only reasons that such maps exist. They are used to communicate various types of information including geographical locations of regions in the world.

Why R?

Several available platforms have been used for drawing spatial maps and conducting geospatial data management. An eminent example is ArcGIS, which is a popular, flexible, and user-friendly geospatial mapping tool. Although ArcGIS is powerful and has many features, I am personally interested in open source, and Linux-friendly software.

Although there are several GIS tools such as Python, GRASS, QGIS, and UbuntuGIS, in this blog post, I will explain how R as an alternative tool can be used for geospatial analysis and for drawing spatial maps. R offers several advantages. First, R is an open-source platform, whereas GIS is relatively expensive. Second, R is script based. In some situations, you might have to generate several hundred maps from post-processed results; a tool such as R could offer faster and more-flexible data processing. You can run R on Linux machines and computer clusters and link it to other models that work under the Linux operating system. Different packages in R have been developed for geospatial analysis. In this exercise, I am going to focus on “RGDAL,” a widely used R package. RGDAL is the R distribution of Geospatial Data Abstraction Library (GDAL).

I recently moved to Cornell University, and I am eager to learn more about this region, so I decided to focus on the Susquehanna River Basin (SRB) located in US mid-Atlantic. The SRB drains parts of New York, Pennsylvania, and Maryland to the Chesapeake Bay. Before I get entirely sidetracked by my interest in SRB, let’s go back to the original intention of this blog post, which is making geospatial maps in R.

Prerequisites

Download all necessary data from the following links, then unzip and save them in your preferred folder.

1- Susquehanna River Basin Boundary from here

2- Major Watersheds in the Susquehanna River Basin from here

3- Susquehanna River from here

Open a new R Script in your R-Studio, then install the following R packages, you can use the following commands to install and load the packages:

# install.packages("rgdal")
# install.packages("ggplot2")
# install.packages("RColorBrewer")

library(rgdal)
library(ggplot2)
library(RColorBrewer)

Step 1- Map of Susquehanna River Basin

The first map of this exercise is a simple map of Susquehanna River Basin.

# I) The first step is to draw the map of SRB using the following code

SRB_Boundary <- readOGR(dsn = "spatial maps/Code/Shapefiles/srb/srb.shp")
plot(SRB_Boundary, col="gray90",
     main="Figure 1", sub="Susquehanna River Basin", cex.main=2.5, cex.sub=2.5)

# II) Then we can add Susquehanna River to the map

SR=readOGR("spatial maps/Code/Shapefiles/WtrTrails/WtrTrails.shp")
plot(SR, col="skyblue3", add=T, lwd=2)

# III) Adding information from the attribute table
#Shapefiles usually contain helpful information, such as name of objects, 
#sub-basins, area/length of objects, etc. 
#We are often interested in adding some of that information to our maps. 
#Here is how we can do it in R:

text(SRB_Boundary$NAME, x=coordinates(SRB_Boundary)[1],
     y=coordinates(SRB_Boundary)[2]*1.2, cex=1.2, col="darkblue", font=2)

text(paste("Area=27,500 square miles"), x=coordinates(SRB_Boundary)[1],
     y=(coordinates(SRB_Boundary)[2]*1.15), font=3, cex=1, col="darkblue")

# Let's add coordinates to the map

llgridlines(SRB_Boundary, plotLabels = T, cex=1.5)

Step 2- Selection of objects from an attribute table

If you have already worked with Arc-GIS you probably used its selection tools. What we are doing here is equivalent to selection from the attribute table. If you are not familiar with attribute tables this short explanation from esri should be helpful.

#I) Let's add SRB map again

plot(SRB_Boundary, col="gray90", 
     main="Figure 2", sub="Sub-basins greater than 800 square kilometer", 
     cex.main=2.5, cex.sub=2.5)

#II) Then we can add all the subbasins in SRB to the map

Subbasin <- readOGR(dsn = "spatial maps/Code/Shapefiles/wshedmjr/wshedmjr.shp")
plot(Subbasin, add =T, col=alpha("darkolivegreen1", 0.9))

# III) For this exercise, we are going to select large sub-basins of SRB
# with area of greater than 800 square kilometer

LargestBasins=which(Subbasin$SQM>800) # square kilometer

# IV) Now we are going to change the color of these selected features on the map

plot(Subbasin[LargestBasins,], add =T, col=alpha("seagreen", 0.9))

Step 3- Adding a legend to the map

In this part of the exercise, we are going to add a legend to the map

# I) SRB map 

 plot(SRB_Boundary, col="gray70",lwd=4,
       main="Figure 3", sub="Precipitation contour lines", cex.main=2.5, cex.sub=2.5)

# Now let's add precipitation contours to the SRB map

 isohyet=readOGR(dsn = "spatial maps/Code/Shapefiles/precip_iso/precip_iso.shp")
 plot(isohyet, add=T, col=bpy.colors(11), lwd=4)
   
# We can use the following script to add a legend to the map
llgridlines(SRB_Boundary, plotLabels = T, cex=1.5)

legend("right",box.col = "white", legend = unique(isohyet$INCHES), 
       fill=bpy.colors(11), cex=1.75, title = "Precipitation (inches)")

In this short tutorial, we went over some basic features of RGDAL. However, R can be used for more sophisticated geospatial analysis tasks, which I might cover in my future blog posts.

Root finding in MATLAB, R, Python and C++

In dynamical systems, we are often interested in finding stable points, or equilibria. Some systems have multiple equilibria. As an example, take the lake problem, which is modeled by the equation below where Xt is the lake P concentration, at are the anthropogenic P inputs, Yt~LN(μ,σ2)  are random natural P inputs, b is the P loss rate, and q is a shape parameter controlling the rate of P recycling from the sediment. The first three terms on the right hand side make up the “Inputs” in the figure, while the last term represents the “Outputs.” A lake is in equilibrium when the inputs are equal to the outputs and the lake P concentration therefore is not changing over time.

lakeModel

For irreversible lakes this occurs at three locations, even in the absence of anthropogenic and natural inputs: an oligotrophic equilibrium, an unstable equilibrium (called the critical P threshold) and a eutrophic equilibrium (see figure below).

PcritDiagram

The unstable equilibrium in this case is called the critical P threshold because once it is crossed, it is impossible to return to an oligotrophic equilibrium by reducing anthropogenic and natural P inputs alone. In irreversible lakes like this, we would therefore like to keep the lake P concentration below the critical P threshold. How do we find the critical P threshold? With a root finding algorithm!

As stated earlier, the system above will be in equilibrium when the inputs are equal to the outputs and the P concentration is not changing over time, i.e. when

X_{t+1} - X_t = \frac{X^q_t}{1+X^q_t} - bX_t = 0

Therefore we simply need to find the zero, or “root” of the above equation.  Most of the methods for this require either an initial estimate or upper and lower bounds on the location of the root. These are important, since an irreversible lake will have three roots. If we are only interested in the critical P threshold, we have to make sure that we provide an estimate which leads to the unstable equilibrium, not either of the stable equilibria. If possible, you should plot the function whose root you are finding to make sure you are giving a good initial estimate or bounds, and check afterward to ensure the root that was found is the one you want! Here are several examples of root-finding methods in different programming languages.

In MATLAB, roots can be found with the function fzero(fun,x0) where ‘fun’ is the function whose root you want to find, and x0 is an initial estimate. This function uses Brent’s method, which combines several root-finding methods: bisection, secant, and inverse quadratic interpolation. Below is an example using the lake problem.

myfun = @(x,b,q) x^q/(1+x^q)-b*x;
b = 0.42;
q = 2.0;
fun = @(x) myfun(x,b,q);
pcrit = fzero(fun,0.75);

This returns pcrit = 0.5445, which is correct. If we had provided an initial estimate of 0.25 instead of 0.75, we would get pcrit = 2.6617E-19, basically 0, which is the oligotrophic equilibrium in the absence of anthropogenic and natural P inputs. If we had used 1.5 as an initial estimate, we would get pcrit = 1.8364, the eutrophic equilibrium.

MatlabScreenShot

In R, roots can be found with the function uniroot, which also uses Brent’s method. Dave uses this on line 10 of the function lake.eval in his OpenMORDM example. Instead of taking in an initial estimate of the root, this function takes in a lower and upper bound. This is safer, as you at least know that the root estimate will lie within these bounds. Providing an initial estimate that is close to the true value should do well, but is less predictable; the root finding algorithm may head in the opposite direction from what is desired.

b <- 0.42
q <- 2.0
pcrit <- uniroot(function(x) x^q/(1+x^q) - b*x, c(0.01, 1.5))$root

This returns pcrit = 0.5445145. Good, we got the same answer as we did with MATLAB! If we had used bounds of c(0.75, 2.0) we would have gotten 1.836426, the eutrophic equilibrium.

What if we had given bounds that included both of these equilibria, say c(0.5, 2.0)? In that case, R returns an error: ‘f() values at end points not of opposite sign’. That is, if the value returned by f(x) is greater than 0 for the lower bound, it must be less than 0 for the upper bound and vice versa. In this case both f(0.5) and f(2.0) are greater than 0, so the algorithm fails. What if we gave bounds for which one is greater than 0 and another less, but within which there are multiple roots, say c(-0.5,2.0)? Then R just reports the first one it finds, in this case pcrit = 0.836437, the eutrophic equilibrium. So it’s important to make sure you pick narrow enough bounds that include the root you want, but not roots you don’t!

RscreenShot

In Python, you can use either scipy.optimize.root or scipy.optimize.brentq, which is what Jon uses on line 14 here. scipy.optimize.root can be used with several different algorithms, but the default is Powell’s hybrid method, also called Powell’s dogleg method. This function only requires an initial estimate of the root.

from scipy.optimize import root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.75)

scipy.optimize.root returns an object with several attributes. The attribute of interest to us is the root, represented by x, so we want pcrit.x. In this case, we get the correct value of 0.54454. You can play around with initial estimates to see how pcrit.x changes.

PythonScreenShot1

Not surprisingly, scipy.optimize.brentq uses Brent’s method and requires bounds as an input.

from scipy.optimize import brentq as root
b = 0.42
q = 2.0
pcrit = root(lambda x: x**(1+x**q) - b*x, 0.01, 1.5)

This just returns the root itself, pcrit = 0.5445. Again, you can play around with the bounds to see how this estimate changes.

PythonScreenShot2

In C++, Dave again shows how this can be done in the function ‘main-lake.cpp’ provided in the Supplementary Material to OpenMORDM linked from this page under the “Publications” section. On lines 165-168 he uses the bisect tool to find the root of the function given on lines 112-114. I’ve copied the relevant sections of his code into the function ‘find_Pcrit.cpp’ below.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <boost/math/tools/roots.hpp>

namespace tools = boost::math::tools;
using namespace std;

double b, q, pcrit;

double root_function(double x) {
  return pow(x,q)/(1+pow(x,q)) - b*x;
}

bool root_termination(double min, double max) {
  return abs(max - min) <= 0.000001;
}

int main(int argc, char* argv[])
{
  b = 0.42;
  q = 2.0;

  std::pair<double, double> result = tools::bisect(root_function, 0.01, 1.0, root_termination);
  pcrit = (result.first + result.second)/2;
  cout << pcrit << endl;
}

This yields the desired root of pcrit = 0.54454, but of course, changing the bounds may result in different estimates. In case you missed it, the take home message is to be careful about your initial estimate and bounds ;).

CPPscreenShot

 

From Writing NetCDF Files in C to Loading NetCDF Files in R

So much data from such little models…

It’s been my experience that even simple models can generate lots of data. If you’re a regular reader of this blog, I can imagine you’ve had similar experiences as well. My most recent experience with this is the work I’ve done with the Dynamic Integrated Climate-Economic model (DICE). I had inherited a port of the 2007 version of the model, which would print relevant output to the screen. During my initial runs with the model, I would simply redirect the output to ascii files for post-processing. I knew that eventually I would be adding all sorts of complexity to this model, ultimately leading to high-dimensional model output and rendering the use of ascii files as impractical. I knew that I would need a better way to handle all this data. So in updating the model to the 2013 version, I decided to incorporate support for netCDF file generation.

You can find details about the netCDF file format through Unidata (a University Cooperation for Atmospheric Research [UCAR] Community Program) and through some of our previous blog posts (here, here, and here). What’s important to note here is that netCDF is a self-describing file format designed to manage high-dimensional hierarchical data sets.

I had become accustomed to netCDF files in my previous life as a meteorologist. Output from complex numerical weather prediction models would often come in netCDF format. While I had never needed to generate my own netCDF output files, I found it incredibly easy and convenient to process them in R (my preferred post-processing and visualization software). Trying to incorporate netCDF output support in my simple model seemed daunting at first, but after a few examples I found online and a little persistence, I had netCDF support incorporated into the DICE model.

The goal of this post is to guide you through the steps to generate and process a netCDF file. Some of our earlier posts go through a similar process using the Python and Matlab interfaces to the netCDF library. While I use R for post-processing, I generally use C/C++ for the modeling; thus I’ll step through generating a netCDF file in C and processing the generated netCDF file in R on a Linux machine.

Edit:  I originally put a link to following code at the bottom of this post.  For convenience, here’s a link to the bitbucket repository that contains the code examples below.

Writing a netCDF file in C…

Confirm netCDF installation

First, be sure that netCDF is installed on your computing platform. Most scientific computing clusters will have the netCDF library already installed. If not, contact your system administrator to install the library as a module. If you would like to install it yourself, Unidata provides the source code and great documentation to step you through the process. The example I provide here isn’t all that complex, so any recent version (4.0+) should be able to handle this with no problem.

Setup and allocation

Include the header files

With the netCDF libraries installed, you can now begin to code netCDF support into your model. Again, I’ll be using C for this example. Begin by including the netCDF header file with your other include statements:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>

Setup an error handler

The netCDF library includes a nice way of handling possible errors from the various netCDF functions. I recommend writing a simple wrapper function that can take the returned values of the netCDF functions and produce the appropriate error message if necessary:

void ncError(int val)
{
  printf("Error: %s\n", nc_strerror(val));
  exit(2);
}

Generate some example data

Normally, your model will have generated important data at this point. For the sake of the example, let’s generate some data to put into a netCDF file:

  // Loop control variables
  int i, j, k;
  
  // Define the dimension sizes for
  // the example data.
  int dim1_size = 10;
  int dim2_size = 5;
  int dim3_size = 20;
  
  // Define the number of dimensions
  int ndims = 3;
  
  // Allocate the 3D vectors of example data
  float x[dim1_size][dim2_size][dim3_size]; 
  float y[dim1_size][dim2_size][dim3_size];
  float z[dim1_size][dim2_size][dim3_size];
  
  // Generate some example data
  for(i = 0; i < dim1_size; i++) {
      for(j = 0; j < dim2_size; j++) {
          for(k = 0; k < dim3_size; k++) {
              x[i][j][k] = (i+j+k) * 0.2;
              y[i][j][k] = (i+j+k) * 1.7;
              z[i][j][k] = (i+j+k) * 2.4;
          }
      }
    }

This generates three variables, each with three different size dimensions. Think of this, for example, as variables on a 3-D map with dimensions of [latitude, longitude, height]. In my modeling application, my dimensions were [uncertain state-of-the-world, BORG archive solution, time].

Allocate variables for IDs

Everything needed in creating a netCDF file depends on integer IDs, so the next step is to allocate variables for the netCDF file id, the dimension ids, and the variable ids:

// Allocate space for netCDF dimension ids
int dim1id, dim2id, dim3id;
  
// Allocate space for the netcdf file id
int ncid;
  
// Allocate space for the data variable ids
int xid, yid, zid;

Each one of these IDs will be returned through reference by the netCDF functions. While we’re at it, let’s make a variable to hold the return status of the netCDF function calls:

// Allocate return status variable
int retval;

Define the meta-data

Now we will start to build the netCDF file. This is a two-part process. The first part is defining the meta-data for the file and the second part is assigning the data.

Create an empty netCDF file

First, create the file:

// Setup the netcdf file
if((retval = nc_create("example.nc", NC_NETCDF4, &ncid))) { ncError(retval); }

Note that we store the return status of the function call in retval and test the return status for an error. If there’s an error, we pass retval to our error handler. The first parameter to the function call is the name of the netCDF file. The second parameter is a flag that determines the type of netCDF file. Here we use the latest-and-greatest type of NETCDF4, which includes the HDF5/zlib compression features. If you don’t need these features, or you need a version compatible with older versions of netCDF libraries, then use the default or 64-bit offset (NC_64BIT_OFFSET) versions. The third parameter is the netCDF integer ID used for assigning variables to this file.

 Add the dimensions

Now that we have a clean netCDF file to work with, let’s add the dimensions we’ll be using:

 // Define the dimensions in the netcdf file
 if((retval = nc_def_dim(ncid, "dim1_size", dim1_size, &dim1id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim2_size", dim2_size, &dim2id))) { ncError(retval); }
 if((retval = nc_def_dim(ncid, "dim3_size", dim3_size, &dim3id))) { ncError(retval); }
  
 // Gather the dimids into an array for defining variables in the netcdf file
 int dimids[ndims];
 dimids[0] = dim1id;
 dimids[1] = dim2id;
 dimids[2] = dim3id;

Just as before, we catch and test the function return status for any errors. The function nc_def_dim() takes four parameters. First is the netCDF file ID returned when we created the file. The second parameter is the name of the dimension. Here we’re using “dimX_size” – you would want to use something descriptive of this dimension (i.e. latitude, time, solution, etc.). The third parameter is the size of this dimension (i.e. number of latitude, number of solutions, etc.). The last is the ID for this dimension, which will be used in the next step of assigning variables. Note that we create an array of the dimension IDs to use in the next step.

 Add the variables

The last step in defining the meta-data for the netCDF file is to add the variables:

// Define the netcdf variables
if((retval = nc_def_var(ncid, "x", NC_FLOAT, ndims, dimids, &xid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "y", NC_FLOAT, ndims, dimids, &yid))) { ncError(retval); }
if((retval = nc_def_var(ncid, "z", NC_FLOAT, ndims, dimids, &zid))) { ncError(retval); }

The nc_def_var() function takes 6 parameters. These include (in order) the netCDF file ID, the variable name to be displayed in the file, the type of data the variable contains, the number of dimensions of the variable, the IDs for each of the dimensions, and the variable ID (which is returned through reference). The type of data in our example is NC_FLOAT, which is a 32-bit floating point. The netCDF documentation describes the full set of data types covered. The IDs for each dimension are passed as that combined array of dimension IDs we made earlier.

 Optional: Add variable attributes

This part is optional, but is incredibly useful and true to the spirit of making a netCDF file. When sharing a netCDF file, the person receiving the file should have all the information they need about the data within the file itself. This can be done by adding “attributes”. For example, let’s add a “units” attribute to each of the variables:

 // OPTIONAL: Give these variables units
 if((retval = nc_put_att_text(ncid, xid, "units", 2, "cm"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, yid, "units", 4, "degC"))) { ncError(retval); }
 if((retval = nc_put_att_text(ncid, zid, "units", 1, "s"))) { ncError(retval); }

The function nc_put_att_text() puts a text-based attribute onto a variable. The function takes the netCDF ID, the variable ID, the name of the attribute, the length of the string of characters for the attribute, and the text associated with the attribute. In this case, we’re adding an attribute called “units”. Variable ‘x’ has units of “cm”, which has a length of 2. Variable ‘y’ has units of “degC”, which has a length of 4 (and so on). You can apply text-based attributes as shown here or numeric-based attributes using the appropriate nc_put_att_X() function (see documentation for the full list of numeric attribute functions). You can also apply attributes to dimensions by using the appropriate dimension ID or set a global attribute using the ID “0” (zero).

 End the meta-data definition portion

At this point, we’ve successfully created a netCDF file and defined the necessary meta-data. We can now end the meta-data portion:

 // End "Metadata" mode
  if((retval = nc_enddef(ncid))) { ncError(retval); }

…and move on to the part 2 of the netCDF file creation process.

Populate the file with data

Put your data into the netCDF file

Here, all we do is put data into the variables we defined in the file:

 // Write the data to the file
 if((retval = nc_put_var(ncid, xid, &x[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, yid, &y[0][0][0]))) { ncError(retval); }
 if((retval = nc_put_var(ncid, zid, &z[0][0][0]))) { ncError(retval); }

The function nc_put_var() takes three parameters: the netCDF file ID, the variable ID, and the memory address of the start of the multi-dimensional data array. At this point, the data will be written to the variable in the netCDF file. There is a way to write to the netCDF file in data chunks, which can help with memory management, and a way to use parallel I/O for writing data in parallel to the file, but I have no experience with that (yet). I refer those interested in these features to the netCDF documentation.

Finalize the netCDF file

That’s it! We’re done writing to the netCDF file. Time to close it completely:

 // Close the netcdf file
 if((retval = nc_close(ncid))) { ncError(retval); }

Compile and run the code

Let’s compile and run the code to generate the example netCDF file:

gcc -o netcdf_example netcdf_write_example.c -lnetcdf

Some common problems people run into here are not including the netCDF library flag at the end of the compilation call, not having the header files in the include-path, and/or not having the netCDF library in the library-path. Check your user environment to make sure the netCDF paths are included in your C_INCLUDE_PATH and LIBRARY_PATH:

env | grep –i netcdf

Once the code compiles, run it to generate the example netCDF file:

./netcdf_example

If everything goes according to plan, there should be a file called “example.nc” in the same directory as your compiled code. Let’s load this up in R for some post-processing.

 Reading a netCDF file in R…

Install and load the “ncdf4” package

To start using netCDF files in R, be sure to install the netCDF package “ncdf4”:

install.packages("ncdf4")
library(ncdf4)

Note that there’s also an “ncdf” package. The “ncdf” package reads and writes the classic (default) and 64-bit offset versions of netCDF file. I recommend against using this package as the new package “ncdf4” can handle the old file versions as well as the new netCDF4 version.  Turns out the “ncdf” package has been removed from the CRAN repository.  It’s just as well since the new “ncdf4” package obsoletes the “ncdf” package.


Open the netCDF file

With the library installed and sourced, let’s open the example netCDF file we just created:

 nc <- nc_open("example.nc")

This stores an open file handle to the netCDF file.

View summary of netCDF file

Calling or printing the open file handle will produce a quick summary of the contents of the netCDF file:

 print(nc)

This summary produces the names of the available variables, the appropriate dimensions, and any global/dimension/variable attributes.

Extract variables from the netCDF file

To extract those variables, use the command:

x <- ncvar_get(nc, "x")
y <- ncvar_get(nc, "y")
z <- ncvar_get(nc, "z")

At this point, the data you extracted from the netCDF file are loaded into your R environment as 3-dimensional arrays. You can treat these the same as you would any multi-dimensional array of data (i.e. subsetting, plotting, etc.). Note that the dimensions are reported in reverse order from which you created the variables.

dim(x)

 Close the netCDF file

When you’re done, close the netCDF file:

nc_close(nc)

And there you have it! Hopefully this step-by-step tutorial has helped you incorporate netCDF support into your project. The code I described here is available through bitbucket.

Happy computing!

~Greg