ggplot (Part 4) – Animated Geospatial Maps

Most of the time, we need to deal with complex systems that have several components, each with different properties and behavior. Usually, these properties vary with time and space, and understanding their spatiotemporal dynamics plays a key role in our understanding of the system performance and its vulnerabilities. Therefore, aggregation in time and space would hide variations that might be important in our analysis and understanding of the system behavior. In this blog post, I am going to show how we can add bar plots onto a map at different locations for better visualization of variables. Some examples for when this type of visualization is useful are irrigation districts that have different water rights or cropping patterns that affect their crop production and water requirements in dry or wet years; another example would be the sensitivity indices of specific model parameters that are clustered based on their location and varying based on the wetness and dryness of the system. To demonstrate this, I am going to create an animated graph that shows annual crop yield variations in different counties in the State of Washington. You can download the Washington counties’ data layer (WA_County_Boundaries-shp.zip) and NASS historical yield data for corn, winter wheat, and spring wheat (NASS.txt) from here.

library(rgdal)  
shp_counties <- readOGR(dsn = file.path("(your directory)/WA_County_Boundaries-shp/WA_County_Boundaries.shp"))

“shp_countiesis” is a SpatialPolygonsDataFrame object and has a different format than a regular dataframe. It usually has a few slots that contain different type of information. For example, “data” has non-geographic properties. We can explore the information for each polygon with:

head(shp_counties@data)
#The "JURISDIC_2" and "JURISDIC_3" columns both contain the names of counties.

To visualize it with ggplot, it has to first be converted into a dataframe. Then, we can use it with geom_polygon function:

library(ggplot2)
counties <- fortify(shp_counties, region="JURISDIC_2")
head(counties)
#long     lat order  hole piece    id   group
#1 -13131351 5984710     1 FALSE     1 Adams Adams.1
#2 -13131340 5983102     2 FALSE     1 Adams Adams.1
ggplot() +  geom_polygon(data = counties, aes(x = long, y = lat, group = group), colour = "dark red", fill = NA)

Now, I am going to extract the center of each polygon so that I can later add the bar plots to these coordinates:

library(rgeos)
counties_centroids<- as.data.frame(gCentroid(shp_counties, byid=T))

We are going to extract just a few years of data from the yield dataset to show on the map:

library(data.table)
yield<- fread("(your directory)/NASS.txt")
#   Year  Ag_District     Ag_District_Code  Data_Item  Value  County  JURISDIC_5 Group
#1: 1947 EAST CENTRAL               50     CORN, GRAIN  41.0   Adams    53001   CORN
#2: 1948 EAST CENTRAL               50     CORN, GRAIN  40.0   Adams    53001   CORN
years<- as.data.frame(unique(yield$Year))
years<- as.data.frame(years[34:42,])

For the final map, I am going to need a common legend for all of the bar plots in all of the counties and years in which I am interested. So, I need to know all of the categories that are available:

unique(yield$Group)
unique(yield$Data_Item)

Based on the “Group” column, we know that there are three main groups of crops: corn, winter wheat and spring wheat. Based on the “Data_Item” column, we know that there are three different types of winter and spring wheat (non-irrigated, non-irrigated with continuous cropping (CC), and non-irrigated with summer fallow (SF)). Note that we do not have all of these crop types for all of the counties and all the years, and the common legend should be created from a location that all the crop types are available, so that it can be applicable for all of the counties and years that I am eventually going to plot. For this reason, I am going to subset a dataset and create a bar plot for one county and year when all of the crop types are available:

sample_yield<- subset(yield, Year=="1981" & County=="Adams")

I want to use custom colors so that each crop Group has a different color and each Data_item corresponding to a Group share the same Group color theme, which gradually changes from darker to brighter. First, I create three color functions, each corresponding to 3 Groups in my dataset.

colfunc1 <- colorRampPalette(c("darkRED"))
colfunc2 <- colorRampPalette(c("darkBLUE", "lightBLUE"))
colfunc3 <- colorRampPalette(c("darkGREEN", "lightgreen"))

Now I am going to use these functions to create a color code for each Group and save it in “colors”:

colors<-c(colfunc1(nrow(subset(sample_yield,sample_yield$Group=="CORN"))),colfunc2(nrow(subset(sample_yield,sample_yield$Group=="SW"))),colfunc3(nrow(subset(sample_yield,sample_yield$Group=="WW"))))

Next, in the template bar plot, I can use the customized “colors” in scale_fill_manual() function:

ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
  scale_fill_manual(values=colors)+
  geom_bar(stat='identity', color = "black")

From this plot, we just need to extract the legend. Also, we need to change its font size since we are going to add it to another plot. You should try different sizes to find out which one is more appropriate for your dataset/figure. I am saving this plot in “tmp” while I remove the legend title and change the font size. Then, I extract the legend section by using the get_legend() function and convert it to a ggplot by using the as_ggplot() function.

tmp <- ggplot(sample_yield,aes(x=Group,y=Value,fill=factor(Data_Item)))+
  scale_fill_manual(values=colors)+
  geom_bar(stat='identity', color = "black")+theme(legend.title = element_blank(),legend.text =element_text(size=19,face="bold",colour="black") )
library(cowplot)
legend<- get_legend(tmp)
library(ggpubr)
tmp_legend<- as_ggplot(legend)

Now, I am going to save the counties map with the appropriate font size and title and add the extracted legend from the yield data to it with the geom_subview() funtion. You need to adjust the coordinates of that you want to show the legend on the map based on your data.

tmp_map <- ggplot(counties)+
  geom_polygon(aes(long, lat, group=group), fill="white")+
  geom_path(color="black", mapping=aes(long, lat, group=group), size=1.5)+
  theme(axis.text.y=element_text(size=17,face="bold",colour="black"),
        axis.text.x=element_text(size=17,face="bold",colour="black"),
        axis.title.y =element_text(size=17,face="bold",colour="black"),
        axis.title.x =element_text(size=17,face="bold",colour="black"),
        plot.title =element_text(size=25,face="bold",colour="black"))+
  labs(title = "NASS Yield Data (1980-1995)")
library(ggimage)
tmp_map_final<- tmp_map+ geom_subview(x=-13830000, y=5730000, subview=tmp_legend)

In the below section, I am going to show how we can loop through all of the years and then all the counties in the county map, use the center coordinate of each county (polygon), plot the corresponded bar plot, and print it in the center of each polygon. We can use the ggplotGrob() funtion to extract different pieces of the bar plot created with gpplot. With this function, we can treat each part of the bar plot (background, axis, labels, main plot panel, etc.) as a graphical object and move it to the coordinates that are in our interest. For example, if we just want to use the main panel and not any other components, we can extract the panel, and adjust the other components as we wish to present in the final graph.

In this example, I am adding all of the bar plots for all counties in one year as a graphical object in a list “barplot_list”.  Then, by using the annotation_custom() function I add each item in the “barplot_list” to the coordinates at the center of the polygons (counties) that I already extracted. Note that the orders of center coordinates and plots in the “barplot_list” are the same.

At the end, I just add the base map with the customized legend (“tmp_map_final”) and with the list of all bar plots with their customized locations (“barplot_annotation_list”). Then, I add them all in a list (“all_list“) and repeat this process for every year. The last step is to save this list with saveGIF()  to create an animated gif. Note that we can use the same procedure but replace the bar plot with other types of plots such as pie charts.

counties_list<- as.data.frame(unique(counties$id))  #list of counties  
all_list<- list()
for (y in 1:nrow(years)){   #loop through all of the years
nass_oneyear<- subset(yield,yield$Year==years[y,])   #extract one year  
barplot_list <- 
  ##create bar plot for each county
  lapply(1:length(shp_counties$JURISDIC_2), function(i) { 
    #extract one county  
    nass_oneyear_onecounty<- subset(nass_oneyear,nass_oneyear$County==counties_list[i,])
    # As I mentioned, for each county and year the number of crop types might be different. So, I need to customize the color for each sub-dataset using the manual color ramp that I previously defined for each item.
    nass_oneyear_onecounty$itemcolor<- "NA"
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="CORN, GRAIN"]<- colors[1]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED"]<- colors[2]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, CC"]<- colors[3]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="SW, NON-IRRIGATED, SF"]<- colors[4]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED"]<- colors[5]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, CC"]<- colors[6]
    nass_oneyear_onecounty$itemcolor[nass_oneyear_onecounty$Data_Item=="WW, NON-IRRIGATED, SF"]<- colors[7]
    plots_comp <- ggplotGrob(
      ggplot(nass_oneyear_onecounty,aes(x=Group,y=Value,group=(itemcolor),fill=factor(Data_Item)))+
        scale_fill_manual(values=nass_oneyear_onecounty$itemcolor)+
        geom_bar(stat='identity', color = "black") +
        labs(x = NULL, y = NULL) + 
        theme(legend.position = "none", rect = element_blank(), axis.title.x = element_blank(), 
              axis.title.y = element_blank(),
              axis.text.x= element_blank(),
              axis.ticks = element_blank(),
              axis.text.y=element_text(size=14,face="bold",colour="black")) + coord_cartesian(expand=FALSE) 
    )})
barplots_list <- lapply(1:length(shp_counties), function(i) 
  annotation_custom(barplot_list[[i]], xmin = counties_centroids[i,1]- 28000, ymin = counties_centroids[i,2]- 28000, xmax =  counties_centroids[i,1]+ 28000, ymax = counties_centroids[i,2]+ 28000))
# xmin, ymin, xmax and ymax can be used to adjust  are horizontal and vertical location of the bar plots
all_list[[y]] <- list(tmp_map_final + barplots_list)
}

library(animation)
saveGIF(
  {lapply(all_list, print)}
  , "(your directory)/final.gif", interval = 2, ani.width = 1600, ani.height = 1200)

If we want to have labels for our bar plots (instead of having yield values at the y-axis), we may want to show the yield value corresponding to each item in a group. In this case I can use the below command lines within a loop before we create a graphical objects of ggplot (before plots_comp <- ggplotGrob(….) ), to add a label column that shows the cumulative yield value in each group.

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%  
  group_by(Group) %>%
mutate(label_y = cumsum(Value)-10)  #I subtracted 10 from these cumulative values just to print them inside of the bar plot sections. 

Or we can just have one label that shows the total yield in each group:

nass_oneyear_onecounty <- nass_oneyear_onecounty %>%  
group_by(Group) %>%
 mutate(total = sum(Value))

Then we can add these labels to the bar plots by adding the geom_text() function in the ggplot section within ggplotGrob(), and specifying the column of interest: 

geom_text(aes(y = label_y, label = round(Value)),colour = "white",size=3,fontface='bold')

Instead of manually adjusting the position the of label (such as in the first example), “vjust” can be added to the geom_text() for modifying text alignment:

geom_text(aes(y = total, label = round(total)),colour = "black",size=3,fontface='bold',vjust = -0.35)  

ggplot (Part 3) – Animating Sensitivity Analysis Indices

For part three of this introduction to ggplot, I will go over an example of a user-friendly library that can easily animate your plot: “gganimate”. It works with different types of graphs; I will apply it to the bar plot in order to visualize the variations in sensitivity indices. This would be helpful for time-varying sensitivity analysis, which is an option to prevent information loss and gain understanding of system behavior, compared to analyzing just the aggregated sensitivity indices. You can download the test-case data from here. These are first and total order sensitivity indices corresponded to annual crop yield for several parameters, categorized by different groups (such as parameters that are related to estimate crop phenology and biomass, or parameters that are related to capturing the effect of temperature or soil hydrological variables on the yield). We can create a subset of just one year of data to explore, using the general command line to make a bar plot based on different groups:

library(ggplot2)
library(gganimate)

annual_S1T<- read.csv("(your directory)/testCase_ST_S1.csv",sep="\t")
head(annual_S1T)
#   S1_sobol   ST_sobol Year Parameters Group_orig Group
#1 0.007892755 0.01102304 1990         a1  Hydrology     A
#2 0.019468069 0.02543356 1998         a1  Hydrology     A
#3 0.004058817 0.00962275 1999         a1  Hydrology     A

one_year<- subset(annual_S1T,annual_S1T$Year==2000)
g1<- ggplot(one_year, aes(x=Parameters, y=ST_sobol,fill=Group))+
  geom_bar(stat = "identity", position=position_dodge()) +
  theme(panel.background = element_blank(), axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        panel.grid.major.y = element_blank(),
        axis.title=element_text(size=12,face="bold"),
        plot.title = element_text(size=16),plot.subtitle=element_text(size=18, hjust=0.5,  color="black"),
        axis.text.y=element_text(size = 12,colour = "black"),
        axis.text.x=element_text(size=12,colour = "black"),
    legend.text=element_text(size=18),legend.title=element_text(size=20),legend.key.height=unit(3,"line"))+
  labs(x ="Parameters",y="Total Order Indices",fill="Group")

We can flip Cartesian coordinates so that the horizontal becomes vertical and the vertical becomes horizontal, by adding coord_flip(). Next, we will use the whole dataset to visualize the annual variations in total order indices. By adding transition_time() and specifying the column corresponding to the variations (in our case, “Year”), we can animate our graph. We can also add a label for each time frame in the “labs/title” section.

g2<- ggplot(annual_S1T, aes(x=Parameters, y=ST_sobol,fill=Group)) + 
  geom_bar(stat = "identity", position=position_dodge())  + 
  theme(panel.background = element_blank(), axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        panel.grid.major.y = element_blank(),
        axis.title=element_text(size=12,face="bold"),
        plot.title = element_text(size=16),plot.subtitle=element_text(size=18, hjust=0.5,  color="black"),
        axis.text.y=element_text(size = 12,colour = "black"),
        axis.text.x=element_text(size=12,colour = "black",angle = 90),
        legend.text=element_text(size=18),legend.title=element_text(size=20),legend.key.height=unit(2,"line"))+
  labs(x ="Parameters",y="Total Order Indices",fill="Group",title = 'Year: {frame_time}')+
  transition_time(Year) 

With the animate() function, we can specify how long we want to wait to change the frame in the animated graph, and how long we want to pause between repetitions of the time series. The result is a gif file that can be saved using the animate() function.

It is also possible to compare the three different sensitivity indices. Total order and first order indices are provided in the dataset; by subtracting the first order indices from the total order, we can calculate the total interactions each parameter has with the rest.

annual_S1T$Total_Interaction<- annual_S1T$ST_sobol-annual_S1T$S1_sobol

One of the best ways to provide data to ggplot functions is the format that the melt function provides. This format is in fact the format that ggplot prefers. In these cases, we can reshape the data frame (using the reshape2 library) to be able to use a single ggplot command line with the few different layout panels, or different types of graph. We can us melt() function to perform this task, and reshape the data frame to have all the indices’ variables in a row instead of various columns, with the correct corresponding information about which year, group, and parameter label the data belong to. To use the melt function, we need to specify columns that identify data values and we use the id.vars argument to do that. More information about the reshape2 package can be found here.

library(reshape2)
annual_S1T_melted<- melt(annual_S1T, id.vars = c("Parameters","Group","Year"), measure.vars = c("Total_Interaction", "S1_sobol","ST_sobol"))
head(annual_S1T_melted)
#  Parameters Group Year          variable       value
#1         a1     A 1990 Total_Interaction 0.003130285
#2         a1     A 1998 Total_Interaction 0.005965492
#3         a1     A 1999 Total_Interaction 0.005563933
#4         a1     A 1991 Total_Interaction 0.007473186
#5         a1     A 1995 Total_Interaction 0.006950200
#6         a1     A 1992 Total_Interaction 0.001914845

Now, we can create the same graph that we saw above for three variables. In this case, we can use facet_grid().

ggplot(annual_S1T_melted, aes(x=Parameters, y=value,fill=Group)) + facet_grid(~variable)+
geom_bar(stat = "identity", position=position_dodge())  +
  coord_flip() +
  theme(panel.background = element_blank(), axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        panel.grid.major.y = element_blank(),
        axis.title=element_text(size=12,face="bold"),
        plot.title = element_text(size=16),plot.subtitle=element_text(size=18, hjust=0.5,  color="black"),
        axis.text.y=element_text(size = 12,colour = "black"),
        axis.text.x=element_text(size=12,colour = "black",angle = 90),
        legend.text=element_text(size=18),legend.title=element_text(size=20),legend.key.height=unit(2,"line"))+
  labs(x ="Parameters",y="Total Order Indices",fill="Group",title = 'Year: {frame_time}')+
  transition_time(Year) 

Spatial statistics (Part 3): Geographically Weighted (GW) models

Geographically weighted (GW) models are useful when there is non-stationarity across the spatial region. In this case global models cannot represent the local variations across the region. Instead locally weighted regression coefficients, based on specific distance, could be used to adjust their global values. In this blog pot, I am going to introduce an R package “GWmodel” that handles this procedure and has more functionality such as principal components analysis that can be used as an exploratory tool for evaluating data spatial heterogeneity. It also provides some summary statistics that I will cover in this post.

The spatial weighting function is the most important part in GW modeling as it defines the spatial dependency between target data. We can define a matrix with the same dimension of target data to indicate the geographical weighting of each data point for each location.  Users have to specifying type of distance, kernel function, and bandwidth to build this matrix. We can consider different methods of distance calculation (Euclidean, Manhattan, Great Circle distance, or generalized Minkowski distance) and commonly used kernel functions (Gaussian, Exponential, Box-car, Bi-square, Tri-cube).

Gaussian and exponential kernels are continuous functions of the distance between two observation points, while Box-car, Bi-square, and Tri-cube are discontinuous functions. This mean that observations that are further than the specified distance (bandwidth) are excluded. The bandwidth can be a fixed distance, or as a fixed number of local data, for both types of functions, but the actual local sample size is the same as the sample size for the continuous functions.

We can examine the potential local relationships between the variables by applying summary statistics function gwss (), which includes GW mean, standard deviation, a measure of skew and a Pearson’s correlation coefficient for any locations. In addition to this basic summary, we can consider a robust statistics where the effects of outliers on the local statistics are excluded. The robust statistics include GW medians, inter-quartile ranges, and quantile imbalances. Also, local bivariate summary statistics including Pearson’s and Spearman’s are available (basic and robust forms, respectively). I am going to use this function to explore some statistics. The sample data that is similar to my previous blog post is here. First we need to convert it to spatial data that has coordinates, by specifying the latitude and longitude columns:

data<- read.table("---your path---/data_new.csv",header = T)
sample_dataset <- SpatialPointsDataFrame(data[, 1:2], data)

Now, I am going to calculate the summary statistics for a few variables by considering three kernels. For this function we need to specify the bandwidth. Ideally this should be estimated by applying cross-validation across a range of bandwidths to reach the most accurate predictions.

library("GWmodel")
sample_dataset_bx <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"), kernel = "boxcar", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_bs <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "bisquare", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_gu <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "gaussian", adaptive = TRUE, bw = 300, quantile = TRUE)

As an example we can compare the basic measures of the local variability in yield based on three kernels.

library("RColorBrewer")
spplot(sample_dataset_bx$SDF, "WW_Yield_LSD", key.space = "right",col.regions = brewer.pal(8, "Set1") ,cuts = c(345,525,705,885,1065,1245,1425,1605,1770),  main = "GW Standard Deviations for Yield (boxcar)")

Or we can plot the basic local correlation between yield and soil water profile using a box-car kernel. The graphs below present the same concept for all three kernels.

mypalette= c("#FFFFCC","#C7E9B4","#7FCDBB","#41B6C4","#1D91C0","#225EA8","#0C2C84")
spplot(sample_dataset_bx$SDF, "Corr_WW_Yield.Soil_water", key.space = "right",col.regions = mypalette, cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (boxcar)")

As we see in standard deviation graphs, yield appears highly variable. It looks like bi-square kernels with 300 bandwidths (~ 26% of data) is more efficient, compared to two other kernels and the relationship between yield and soil water profile is non-stationary.

I am going to compare the robust GW correlations between yield and soil water profile using a bi-square kernel, with the basic one, that we just created, with the new color scheme for better visualization:

spplot(sample_dataset_bs$SDF, "Corr_WW_Yield.Soil_water", key.space = "right", cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
       main = "GW correlations: WW Yield and Soil-water Profile (bisquare_basic)")
spplot(sample_dataset_bs$SDF, "Spearman_rho_WW_Yield.Soil_water",key.space = "right",cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (bisquare_robust)")

Principle components is another type of analysis that we can apply on our multivariate data to evaluate potential linear combinations of variables that allow sources of variation to be recognized. The “GWmodel” package provides the functionality to account for the spatial heterogeneity in PCA analysis. Here is an example of command lines for basic and robust PCA analysis. But before that, we need to standardize or independent variables by re-scaling them to have a similar magnitude and therefore equal importance for all variables, in the analysis.

scaled_dataset <- scale(as.matrix(sample_dataset@data[,3:10]))
#basic
pca_basic <- princomp(scaled_dataset, cor = F)
(pca_basic$sdev^2 / sum(pca_basic$sdev^2))*100    #percentage of total variance’ (PTV)
Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
5.956296e+01 1.891630e+01 1.202148e+01 6.965907e+00 2.219094e+00 3.114074e-01 
Comp.7       Comp.8 
2.586703e-03 2.653528e-04 

#robust
pca_robust <- covMcd(scaled_dataset, cor = F)
pca.robust <- princomp(scaled_dataset, covmat = R.COV, cor = F)
(pca.robust$sdev^2 / sum(pca.robust$sdev^2))*100   #percentage of total variance’ (PTV)
pca.robust$loadings

   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
42.758379 32.184879 11.967754  5.399829  4.168437  1.731579  1.190092  0.599050 

With bw.gwpca () we can automatically select the bandwidth for GW PCA analysis. The function uses a cross- validation approach to find the optimal bandwidth. However, we need to decide the number of components (k) to include in the analysis. We also need to convert our scaled dataset to a spatial data.

Coords <- as.matrix(cbind(sample_dataset$LON, sample_dataset$LAT))
scaled_dataset.spdf <- SpatialPointsDataFrame(Coords,as.data.frame(scaled_dataset))
bw.gwpca.basic <- bw.gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), k = 4, robust = FALSE,adaptive = TRUE)
#bw.gwpca.basic = 986
bw.gwpca.robust <- bw.gwpca(scaled_dataset.spdf,vars=colnames(scaled_dataset.spdf@data), k = 4, robust = TRUE, adaptive = TRUE)
#bw.gwpca.robust = 767

Once the bandwidth is estimated, we can use gwpca()  to calibrate the basic and robust GW PCA fits. It should be noted that we use all of the components in the fitted model at this step.

gwpca.basic <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.basic, k = 8,robust = FALSE, adaptive = TRUE)
gwpca.robust <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.robust, k = 8,robust = TRUE, adaptive = TRUE)

Now, as an example, we can visualize how data dimensionality varies spatially for the first two components, by extracting the sum of total variance (%) or PTV at each location.

var_pca_basic <- (rowSums(gwpca.basic$var[, 1:2])/rowSums(gwpca.basic$var))*100
sample_dataset$var_pca_basic <- var_pca_basic
var_pca_robust <- (rowSums(gwpca.robust$var[, 1:2])/rowSums(gwpca.robust$var))*100
sample_dataset$var_pca_robust <- var_pca_robust
spplot(sample_dataset, "var_pca_basic", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (basic)")
spplot(sample_dataset, "var_pca_robust", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (robust)")

The differences between these two plots show the effect of local multivariate outliers.

References:

Gollini, I., Lu, B., Charlton, M., Brunsdon, C., Harris, P., 2015. GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models. Journal of Statistical Software 63, 1–50. https://doi.org/10.18637/jss.v063.i17

Lu, B., Harris, P., Charlton, M., Brunsdon, C., Nakaya, T., Murakami, D., Gollini, I., 2020. GWmodel: Geographically-Weighted Models.

Spatial statistics (Part 2): Spatial Regression Models

Regression is one of the main techniques of data analysis. A regression model that can incorporate spatial dependency in a dependent variable is called a spatial regression model. It can be used as a simple surrogate model for prediction when the data are not available for some locations, or for understanding the factors behind patterns. In this blogpost, I am going to create a simple regression model for a crop yield, check the residuals for signs of relationships with nearby areas, and try to remove the potential spatial dependencies in the residuals by applying a spatial regression model. The autocorrelation in the residuals is a sign that the underlying process being studied varies systematically across the study area. In this situation, the resulting estimates of a fitted model are biased. Spatial regression models have applications in different fields such as agriculture (e.g., farm management, policy issues), natural sciences (e.g., species patterns), public health (e.g., air pollution), and social sciences (e.g., forecast population).The datasets that I am going to use (ww.* and WW_ave_hist.txt ) are available here. The .txt file includes historical winter wheat yield for some locations (4*4 km grid cells) with distinct IDs, and the “ww” shapefile (which includes 6 files) has some information for each location based on its ID as well. We will merge these two files, apply linear regression, and check whether we can use some explanatory variables from our data (predictors) to explain the variation in yield (dependent variable) across the region. We assume that we can predict yield by knowing annual potential evapotranspiration, precipitation, and available water in the soil profile.

library("rgdal")
setwd("---your path--- ")
Annual_var<- readOGR(".","ww") # This is  SpatialPolygonsDataFrame objects that brings the spatial representations of the polygons with the  data.
yield_ww<- read.table("---your path---/ww_ave_hist.txt",header = T)
yield<- merge(Annual_var,yield_ww,by="ID")
names (yield)
# fit the linear model
lm_yield <- lm(ww_ave_yield ~  ET_pot  +  precipitat +soil_water, data=yield) 
summary(lm_yield)
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5058.6326   291.9022   17.33   <2e-16 ***
ET_pot        -5.0868     0.2147  -23.70   <2e-16 ***
precipitat    12.8215     0.1626   78.86   <2e-16 ***
soil_water    -5.9440     0.5295  -11.23   <2e-16 ***
Adjusted R-squared:  0.9035

After fitting a regression, and checking the coefficients to ensure that all the variables are statistically significant, one should check the the residuals to make sure they are independent. If there is any correlation, it means that our regression’s coefficient estimates could be wrong and they can not be  assumed constant across the study area. Therefore, the fitted model is not a good representation of the dependent variable.From our fitted model, we can extract the residuals:

yield$residuals_lm <- residuals(lm_yield)

Here, we look at the spatial patterning in the distribution of the residuals:

library("magrittr")
library("ggplot2")
yield %>% as.data.frame %>% 
  ggplot(aes(LON, LAT)) + geom_tile(aes(fill=residuals_lm), alpha=3/4) + 
  ggtitle("") + coord_equal() + theme_bw()+scale_fill_gradientn(colours=c("black","yellow","red"),breaks=seq(-1660,1360,400)) 

Beside the visual inspection of the residuals, a more formal test would be required to decide whether spatial autocorrelation is present. In the first part of this blogpost, we used the linkages based on the physical distance to examine the spatial autocorrelation. Similar as before, we are going to create a list of neighbors using the Queen criteria. In my previous blog post, we calculated the local Moran’s I test statistic for the actual data. Here, we want to apply it on the residuals, so we need to use another function that takes into account that the variable under consideration is a residual of a regression. Also, in this example, we look at the global rather than the local Moran’s which is based on both feature locations and feature values simultaneously.

library("spdep")
neighbor <- poly2nb(yield)
lw <- nb2listw(neighbor)
lm.morantest(lm_yield, lw) 
# 0.7908136752

The result shows statistically significant value for Moran’s I. We can also use the “neighbor” object to get the average value for the neighbors of each location (grid cell), and look at the correlations between these and the residuals or create a scatter plot for visual inspection.

mean_function <- sapply(neighbor, function(x) mean(yield$residuals_lm[x]))
cor(yield$residuals_lm, mean_function) 
# 0.8847948
plot(yield$residuals_lm, mean_function, xlab='Residuals', ylab='Mean adjacent residuals')

We clearly see that the spatial dependencies in the residuals are significant. This means that, if we use this model, the predicted values are systematically underestimated or overestimated. Therefore, we need to use spatial autoregressive models that account for spatial dependencies. There are two models used as spatial regression models: the spatially lagged model and the spatial error model. The first model uses a spatial lag variable that averages the neighboring values for each location and accounts for autocorrelation using a weights matrix. In the second model, the spatial dependence is handled through the errors rather than through the systematic component of the model. In order to decide whether to fit the spatial error or lagged model, the Lagrange Multiplier (LM) test is used to distinguish which is more appropriate. The R function lm.LMtests() would perform this test by considering these statistics: the LM test for the error dependence (LMerr) and the spatially lagged dependent variable (LMlag), as well as for their robust forms (RLMerr and RLMlag; e.g., RLMerr examines the spatially autocorrelated residuals in the possible presence of an omitted lagged dependent variable).

 lm.LMtests(lm_yield, lw, test = c("LMerr","LMlag","RLMerr","RLMlag"))
#LMerr = 2142.1, df = 1, p-value < 2.2e-16
#LMlag = 1025, df = 1, p-value < 2.2e-16
#RLMerr = 1121.8, df = 1, p-value < 2.2e-16
#RLMlag = 4.6952, df = 1, p-value = 0.03025

Since both LMerr and LMlag have significant p-values, we compare the p-values of the robust forms RLMerr and RLMlag. In doing so, it can be seen that RLMerr is significant. Therefore, the LM test suggests that we should run a spatial error model.

fit_err <- errorsarlm(ww_ave_yield ~  ET_pot  +  precipitat +soil_water, data=yield, lw)
# lagsarlm() is a function that creates a  spatial lag model
summary(fit_err)
#Lambda: 0.93674, LR test value: 1912.7, p-value: < 2.22e-16
#AIC: 15166, (AIC for lm: 17077)

The results show that the Likelihood Ratio (LR) test is highly significant (p value 2.22e-16).This shows further evidence that the spatial error model is a good fit. Also, the  Akaike Information Criterion (AIC:an estimate of out-of-sample prediction error and therefore the relative quality of statistical models for a given dataset) in this new model has a AIC of 15166 and has a better fit compared to the original linear model, with no spatial error dependencies (AIC of 17077).

We can now check the residuals of this new model. In addition, we can take a look at the Moran’s I statistic one more time. Note that we previously used a Moran’s I test for spatial autocorrelation in residuals from an estimated linear model. Now, we don’t have a linear model, so we can use a Permutation test for the Moran’s I statistic: the moran.mc() function uses random permutations of x for the given spatial weighting scheme. The residuals graph and Moran’s I statistic both show that there is no correlation in the residuals:

yield$residuals_error_model <- residuals(fit_err)
mean_function_error_model <- sapply(neighbor, function(x) mean(yield$residuals_error_model[x]))
cor(yield$residuals_error_model, mean_function_error_model)
plot(yield$residuals_error_model, mean_function_error_model, xlab='Residuals', ylab='Mean adjacent residuals')
moran.mc(yield$residuals_error_model, lw, 1000)   
# 1000 is a number of permutations

If we used the other model, the residuals would show some correlations:

References

Srinivasan, S., 2008. Spatial Regression Models, in: Shekhar, S., Xiong, H. (Eds.), Encyclopedia of GIS. Springer US, Boston, MA, pp. 1102–1105. https://doi.org/10.1007/978-0-387-35973-1_1294 https://rspatial.org/raster/analysis/index.html

Spatial statistics (Part 1): Spatial Autocorrelation

Exploratory spatial data analysis and statistics help us to interpret maps in a more efficient way by finding trends, enabling pattern mining in space and time, identifying spatial outliers, etc. This information beyond the maps can help us to understand the characteristics of places, phenomena, and the relationships between them. Therefore, we can use it for predication and, decision-making.

For analyzing the spatial data, many tools and libraries are available such as ArcGIS, Gdal , etc. that utilize raster and vector geospatial data formats. In my next blogposts, I am going to introduce some types of spatial statistics using different libraries in R. In this blogpost, I am going to explore an auto-correlation between county-level 10-years average (2006-2016) of total potential evapotranspiration (ET) from March to the end of August. These datasets are aggregated from 4*4 km grid cells for each county across the US and can be downloaded from here. Applying spatial autocorrelation explores if the potential ET in counties near each other are more similar. In other word, it measures how distance can affect our variable of interest. The existence of autocorrelation in data might lead to incorrect statistical inferences for some spatial statistical analysis. Therefore, it is important to test for spatial autocorrelation.

First, we are going to inspect the distribution of the data. To read the spatial data (shapefile), we need to use the “rgdal” library. Set your working directory to the path where you downloaded the data.

install.packages("rgdal")
library(“rgdal”)
setwd("---your path---")
ET<- readOGR(".","US_ETp_County")

We can see the name of columns by names(ET), and the coordinate system of the data by  crs(ET).To plot the spatial objects, I am using “spplot” from the “sp” library, which is the specialized plot methods for spatial data with attributes. The first argument is object of spatial-class, which has coordinates, and the second argument “zcol” is the attribute name or column number in the attribute table associated with the data.

library("sp")
spplot(ET, zcol='MEAN_ETp_n',col.regions = topo.colors(20), main="Average Potential ET during March-August (mm)") 

We can also use quantiles in order to better distinguish the distribution of data and potentially dilute the effect of outliers. To do this we need another library:

library("classInt")
breaks_quantile <- classIntervals(ET$MEAN_ETp_n, n = 10, style = "quantile")
breaks <- breaks_quantile $brks 

ET$MEAN_ETp_qun<- cut(ET$MEAN_ETp_n, breaks)
p2<- spplot(ET, "MEAN_ETp_qun", col.regions = topo.colors(20), main = "Average Potential ET during March-August (mm)_Quantile")

As shown in both maps, the counties in the southern half of the US have higher potential ET during March-August. To explore the spatial autocorrelation we need to define the neighbors of counties.

install.packages("spdep ")
library(“spdep”)
neighbor<- poly2nb(ET,queen = FALSE)
#With “queen” option we define if a single shared boundary point meets the contiguity condition (queen =TRUE), or more than one shared point is required (queen =FALSE).
plot(ET, border = 'lightgrey')
plot(neighbor, coordinates(ET), add=TRUE, col='red')

Now, we are going to present local spatial autocorrelation that explores spatial clustering across the US. First, we need to convert the neighbor data to a listw object. The “nb2listw” adds a neighbor list with spatial weights for the chosen coding scheme. Here, I chose “W”, which is row standardized (sums over all links to n).

lw <- nb2listw(neighbor, style = "W")

To get the autocorrelations a Moran’s test is used. The local spatial statistic Moran’s index estimates a correlation for each county based on the spatial weight object used. It is a statistical way to find out the potential local clusters and spatial outliers. A positive index indicates that a feature has neighboring features with similar high or low attributes that can be part of a cluster. A negative value indicates the outlier feature.

local_moran <- localmoran(x = ET$MEAN_ETp_n, listw = nb2listw(neighbor, style = "W"))

From this, we get some statistical measures: Ii: local moran statistic, E.Ii: expectation of local moran statistic, Var.Ii: variance of local moran statistic, Z.Ii: standard deviate of local moran statistic, and Pr() p-value of local moran statistic. Now we can add this result to our shapefile (ET) and plot the local moran statistic. I am going to add the Us_States boundary to the map.

moran_local_map <- cbind(ET, local_moran)
States<- readOGR(".","US_States")
spplot(moran_local_map, zcol='Ii',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))

Again, we can present this data with quantiles:

breaks_quantile_new  <- classIntervals(moran_local_map$Ii, n = 10, style = "quantile")
breaks_new  <- breaks_quantile_new$brks 
moran_local_map$Ii_qun <- cut(moran_local_map$Ii, breaks_new)
spplot(moran_local_map, zcol='Ii_qun',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)_Quantile", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))

Since the indices include negative and zero values and there are a large variations between the positive values, I modified the breaks manually in order to be able to see the variations and potential clustering with the higher Moran values:

breaks_fix<- classIntervals(moran_local_map$Ii, n=17, style="fixed",fixedBreaks=c(-0.2, -0.0001, 0.0001, 0.2, 0.4,0.6,0.8,1,4,5,6,7,8,9,10,11,12,13))

Note that the p-value should be small enough (statistically significant) for the feature to be considered as part of a cluster.

References

Anselin, L. 1995. Local indicators of spatial association, Geographical Analysis, 27, 93–115; Getis, A. and Ord, J. K. 1996 Local spatial statistics: an overview. In P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment (Cambridge: Geoinformation International), 261–277; Sokal, R. R, Oden, N. L. and Thomson, B. A. 1998. Local Spatial Autocorrelation in a Biological Model. Geographical Analysis, 30. 331–354; Bivand RS, Wong DWS 2018 Comparing implementations of global and local indicators of spatial association. TEST, 27(3), 716–748 https://doi.org/10.1007/s11749-018-0599-x

Displaying Interactions with chordDiagram in R

Sensitivity analysis is a powerful tool to find out which parameters in a model have the largest effect on the results. Besides the impact of the main parameters, exploring the interactions between parameters is also important because complex systems usually have many active interactions. Learning from them will improve our understanding of how a model works at different underlying conditions. This could potentially lead to improving our inferences of the model’s results and model evaluation. Previously, Antonio showed how to create a Radial convergence plot for visualizing Sobol indices in Python. In this blog post, I am going to use a library in R to create a Chord diagram for Sobol interaction indices for annual baseflow. Then, I will create an animated GIF file that shows the interactions for a few years. The dataset (download from here) that I am using came from the simulation of an agro-biophysical model that has 33 parameters.

install.packages("circlize")
library(circlize)
baseflow_S2<- read.csv(paste("...(your path).../sobol_ Baseflow_S2_1989.csv",sep=""))
baseflow_S2$X<- NULL

To work with this library, the format of the data should be a matrix or list.

baseflow_S2<- as.matrix(baseflow_S2)

As I mentioned above, the interactions are between 33 parameters, so we have 33 rows and columns in this dataset. I am going to assign a name to each row and column based on the actual order in the dataset:

rownames(baseflow_S2) = c("H_1","H_2","H_3","H_4","B_1","B_2","B_3","B_4","B_5","W_4","W_5","T_4","W_1","W_2","W_6","W_7","Phy_1","Phy_2","T_1","T_2","W_3","Phy_3","Phy_4","T_3","Phy_5","Phy_6","H_5","Ph_1","Ph_2","Ph_3","Ph_4","Ph_5","Ph_6")                   
colnames(baseflow_S2)= c("H_1","H_2","H_3","H_4","B_1","B_2","B_3","B_4","B_5","W_4","W_5","T_4","W_1","W_2","W_6","W_7","Phy_1","Phy_2","T_1","T_2","W_3","Phy_3","Phy_4","T_3","Phy_5","Phy_6","H_5","Ph_1","Ph_2","Ph_3","Ph_4","Ph_5","Ph_6") 	       

Now, we are going to assign a color for each parameter. For better visualization and interpretability of the results, these parameters are classified into six main groups (H, Ph, B, T, W, and Phy) based on how they are related to the model. I am going to assign the same color to all of the parameters within each group.

grid.col=c(H_1="deepskyblue4",H_2="deepskyblue4",H_3="deepskyblue4",H_4="deepskyblue4",H_5="deepskyblue4",Ph_1="darkorange3",Ph_2="darkorange3",Ph_3="darkorange3",Ph_4="darkorange3",Ph_5="darkorange3",Ph_6="darkorange3",B_1="brown4",B_2="brown4",B_3="brown4",B_4="brown4",B_5="brown4",T_1="coral2",T_2="coral2",T_3="coral2",T_4="coral2",W_1="cyan3",W_2="cyan3",W_3="cyan3",W_4="cyan3",W_5="cyan3",W_6="cyan3",W_7="cyan3",Phy_1="darkgreen",Phy_2="darkgreen",Phy_3="darkgreen",Phy_4="darkgreen",Phy_5="darkgreen",Phy_6="darkgreen")

Now we can look at the plot:

interactions_plot<- chordDiagram(baseflow_S2,grid.col = grid.col )

As we see, there are many small interactions between parameters. Therefore, you may want to set a limit to show some specific interactions that are large, or you may just want to clean up some of the interactions that are less than specific values. To do this task, we can use the result of the above function without any more adjustments in order to get the final format of a dataset that is created within the function. Then, we can modify it based on what we want to show. The “interactions_plot” is actually a plot, but you can also look at its data frame. The “col” column has the color code that is assigned to each interaction. We can extract that column.

 col2<- interactions_plot$col

Then get the index of values that are less than for example 0.005:

idx <- which(interactions_plot$value2 < 0.005, arr.ind = TRUE)

And we can replace the actual color in “col2” with “transparent” for those rows for which their index was selected above.

col2[idx] <- 'transparent'

Now, we can create a final graph with some more adjustments: with “order,” we can manually sort a grid (sectors). “link.sort” and “link.decreasing” are used to sort the links based on the value of the interaction (which is shown by the width of the sector). This might help in detecting interesting interactions: “annotationTrack” set to “grid”, print the sectors, “preAllocateTracks”, specified the track, “circos.track” creates plotting regions for a track: track.index is the index for the track, which is going to be updated. “panel.fun” is used to add graphics and customize sector labels (x and y correspond to the data points in each cell).

chordDiagram(baseflow_S2,order = c("H_1","H_2","H_3","H_4","H_5","Ph_1","Ph_2","Ph_3","Ph_4","Ph_5","Ph_6","B_1","B_2","B_3","B_4","B_5","T_1","T_2","T_3","T_4","W_1","W_2","W_3","W_4","W_5","W_6","W_7","Phy_1","Phy_2","Phy_3","Phy_4","Phy_5","Phy_6"),grid.col = grid.col ,col=col2, link.sort = TRUE, link.decreasing = TRUE,annotationTrack = "grid", preAllocateTracks = list(1))
title(main=paste(1989),adj=0)
    circos.track (track.index = 1, panel.fun = function(x, y) {
      xlim = get.cell.meta.data("xlim")
      ylim = get.cell.meta.data("ylim")
      sector.name = get.cell.meta.data("sector.index")
      circos.text(mean(xlim), ylim[1]+.1, sector.name, facing = "clockwise", niceFacing = TRUE, adj = c(-0.5,0.5),cex = 0.6)
      circos.axis(h = "top", labels.cex = 0.6, major.tick.percentage = 0.2, track.index = 2)}, bg.border = NA) 

We may want to add another layer to highlight a specific sector and add a group name. In this case, we can use “highlight.sector” to specify the color, label, and components of this new layer. With “padding,” we can also change the width of this layer.

highlight.sector(rownames(baseflow_S2[c("H_1","H_2","H_3","H_4","H_5"),]), track.index = 1, col = "deepskyblue4",text = "H", cex = 0.8, text.col = "black", niceFacing = TRUE,padding = c(-0.9, 0, 0.27, 0),font=2) 

highlight.sector(rownames(baseflow_S2[c("Ph_1","Ph_2","Ph_3","Ph_4","Ph_5","Ph_6"),]), track.index = 1, col = "darkorange3", text = "PH", cex = 0.8, text.col = "black", niceFacing = TRUE,padding = c(-0.9, 0, 0.27, 0),font=2)

highlight.sector(rownames(baseflow_S2[c("T_1","T_2","T_3","T_4"),]), track.index = 1, col = "coral2", text = "T", cex = 0.8, text.col = "black", niceFacing = TRUE,padding = c(-0.9, 0, 0.27, 0),font=2)   

highlight.sector(rownames(baseflow_S2[c("Phy_1","Phy_2","Phy_3","Phy_4","Phy_5","Phy_6"),]), track.index = 1, col = "darkgreen",text = "PHY", cex = 0.8, text.col = "black", niceFacing = TRUE,padding = c(-0.9, 0, 0.27, 0),font=2)

highlight.sector(rownames(baseflow_S2[c("B_1","B_2","B_3","B_4","B_5"),]), track.index = 1, col = "brown4", text = "B", cex = 0.8, text.col = "black", niceFacing = TRUE,padding =c(-0.9, 0, 0.27, 0),font=2) 

highlight.sector(rownames(baseflow_S2[c("W_1","W_2","W_3",	"W_4","W_5","W_6","W_7"),]), track.index = 1, col = "cyan3",text = "W", cex = 0.8, text.col = "black", niceFacing = TRUE,padding = c(-0.9, 0, 0.27, 0),font=2)

And here is the final plot:

To save the graph use this line of code before calling the chordDiagram():

png(file = paste("…(your path)…./sobol_Baseflow_S2_1989.png",sep="") ,height = 7, width = 7, units = "in", res = 500) 

And at the end run dev.off(). Now, I am going to run this code for all of the other years and save them to make a GIF file that shows the interactions for all years. To make an animation from these plots, first we need to install “magick package.”

This simple code reads the first image and then adds the rest of the images to that in a loop. You can also add a message for each image. (Here, I added the year.) The “delay” option is used to adjust the delay after each frame.   

install.packages("magick")
library(magick)
img <- image_read(path = paste0("...(your path).../sobol_Baseflow_S2_1989.png"))
years<- as.data.frame(1989:1997)

for(Y in 1:nrow(years)) {
img0 <- image_read(path = paste0("...(your path).../sobol_Baseflow_S2_",years[Y,],".png"))
img <- c(img, img0) 
}  
img1 <- image_scale(image = img, geometry = "720x720")
ani0 <- image_animate(image = img1, delay = 200)
image_write(image = ani0, path = paste0("...(your path).../sobol_Baseflow_S2.gif"))

Parallel File Compressing in Linux

If you are dealing with big data and need to move them to different directories or archive them for long-term storage, you have to think about how you can do so efficiently. Without an efficient method, you will probably need to spend days organizing and finishing the work. There are several utilities that are popular for this task. I had more than 3 TB of data that I needed to move to free up some space on the disk, so I thought about a strategy for moving my files. My smallest subdirectory was about 175 GB, and it took about 2 hours to compress with normal tar and gzip. I realized that gzip has options for the level of compression and the speed of compression, which I did not know before. This can be helpful. I did a simple test with a smaller data set (about 1.94 GB) by applying different speed options from 1 to 9; 1 indicates the fastest but compression method, and 9 indicates the slowest but best compression method:

GZIP=-Option tar cvzf OUTPUT_FILE.tar.gz ./Paths_to_Archive/

Here is the result: the default is 6, but you can play with the speed option and, based on your priority, choose the timing and compression ratio. If your file or folder is much bigger than what I used here as an example, different speed options can really save you time. Here is the graph that I created from this experiment:

You can further speed it up using more than one core/processor. There is another available gzip version that compresses files/folders on multiple processors and cores.

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 20 > OUTPUT_FILE.tar.gz

In this case, I used “1” as the speed option and specified “20” possessors for the task, and the path where I downloaded the pigz. The good news is that you can write a simple bash script to run the same command on other nodes rather than on the login node. Therefore, you can use all the available possessors on a node without making the head node slow.

#!/bin/bash										
#SBATCH -t 24:00:00					
#SBATCH --job-name=test				
#SBATCH --mail-type=end				
#SBATCH -p normal					
#SBATCH --export=ALL				
#SBATCH --nodes=1			
#SBATCH --output="test.txt"				
#SBATCH --cpus-per-task=32

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 32 > OUTPUT_FILE.tar.gz

Save the lines above in a test.sh, and run it with sbatch test.sh.

I used a larger subset of my data (16.5 GB) to check different speed options on parallel, and here is the result: the slowest option was almost threefold faster (168 vs. 478 seconds) than my previous test on one possessor; however, my previous test folder was much smaller (1.96 vs. 16.5 GB).

Parallel processing with R on Windows

Parallel programming can save you a lot of time when you are processing large amounts of data. Modern computers provide multiple processors and cores and hyper-threading ability; therefore, R has become compatible with it and enables multiple simultaneous computations on all resources. There are some discussions regarding when to parallelize, because there is no linear relationship between the number of processors and cores used simultaneously and the computational timing efficiency. In this blog post, I am going to utilize two packages in R, which allows parallelization, for a basic example of when each instance of computation is standalone and when there is no need for communication between cores that are being used in parallel.

Install.packages(“parallel”)
Install.packages(“doParallel”)
library(doParallel)
library(parallel)

If you enter Ctrl+Shift+Esc on your keyboard and click on the Performance tab in the Task Manager window, you will see how many actual logical processes, which are the combination of processors and cores, are available on your local Windows machine and can be used simultaneously for your analysis. We can also detect this number with the following command:

no_cores <- detectCores(logical = TRUE)  # returns the number of available hardware threads, and if it is FALSE, returns the number of physical cores

Now, we need to allocate this number of available cores to the R and provide a number of clusters and then register those clusters. If you specify all the cores to the R, you may have trouble doing anything else on your machine, so it is better not to use all the resources in R.

cl <- makeCluster(no_cores-1)  
registerDoParallel(cl)  

First, we are going to create a list of files that we want to analyze. You can download the example dataset here.

all_samples<- as.data.frame(list.files("your directory/R_Parallel_example/"))
seq_id_all<- seq_along(1:nrow(all_samples))

Then, we will create a function that we are will use for processing our data. Each file in this set has a daily value for several variables and 31 years. Columns 1, 2, and 3 are year, month, and day, respectively. I am going to extract the yield value for each year from column “OUT_CROP_BIOMYELD” and calculate the average yield for the entire period. All the libraries that you are going to use for your data process should be called inside the function. I am going to use “data.table” library to efficiently read my data into R. At the end of the function, I put the two outputs that I am interested in (“annual_yield” and “average_yield”) into one list to return from the function.

myfunction<- function(...) {
  i<-(...)
  library(data.table)
  setwd(paste("your directory/R_Parallel_example/"))
  sample<- fread(paste(all_samples[i,]))
      annual_yield<- subset(sample,sample$OUT_CROP_BIOMYELD>0)  # this column (OUT_CROP_BIOMYELD) is always zero except when the yield is reported which should be a value above zero.
      annual_yield$No<-  as.numeric(gsub("_47.65625_-117.96875","",all_samples[i,]))  # extract some part of the file name, use it as an identification for this dataset
      annual_yield<- annual_yield[,c(1,13,17)]  # extract just “Year”,”Yield” and “No” columns.
      colnames(annual_yield)<- c("Year","Yeild","No")
  average_yield<- colMeans(annual_yield[,c("Yeild","No")])  #calculate average year for each dataset
  return( list(annual_yield,average_yield))
}

Now, we need to export our function on the cluster. Because in the function we used “all_samples” data-frame, which was created outside the function, this should also be exported to the cluster:

clusterExport(cl,list('myfunction','all_samples'))

With the command line below, we are running the function across the number of cores that we specified earlier, and with “system.time,” the process time will be printed at the end:

system.time(
  results<- c(parLapply(cl,seq_id_all,fun=myfunction))
)

The function outputs are saved in the list “results” that we can extract:

k_1<- list()
k_2<- list()
for (k in 1: nrow(all_samples)){
  k_1[[k]]<- results[[k]][[1]]
  k_2[[k]]<- results[[k]][[2]]
}
annual_yield<- data.table::rbindlist(k_1)
period_yield<- data.table::rbindlist(k_2)

More on simple Bash Shell scripts (examples of “find” and “sed”)

When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs.  You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model.  For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”

find

Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].

find . -type f -name "218*"

Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:

find . -type f -name "218*" |  wc -l

You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:

find . -type f -name "218_wheat*" -exec rm {} \;

If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:

find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;

Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:

find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;

 “d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.

If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”

find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;

Or we can look for files that do not contain “success.”

find . -type f -name 218cs*.err" -exec grep -L "success" {} \;

sed

“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:

sed -i 's/1295/1360/' 218cs.txt

Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:

sed -i '5s/1295/1360/' 218cs.txt

There might be more modifications that have to be done, so we can add them in one line using “-e”:

sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt

find + sed

If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:

find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/'  {} \;

Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:

“FORCING1         /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”

For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].

path_new="FORCING1	/home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|$path_new|" global_param_default.txt {} +

Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt).

find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1" {} \; 

Now, all of the “global_param_default” files have a new line with this content: “OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1”.

It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path:

for P in $(find . -type d -name "output"); do (new_name="$(echo "${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "${P}" && mv output__46.34375_-119.90625 $ new_name); done

With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding $ and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions.

With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring

“&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in $P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in $new_name.

ggplot (Part 2)

This is the second part of the ggplot introduction. In this blog post, I am going to go over how you can make a decent density plot in ggplot. Density plots are basically smoothed versions of the histogram and show the distribution of your data while also presenting the probability distribution of the data using the kernel density estimation procedure. For example, when we have a regional data set, it is important to look at the distribution of our data across the region instead of just considering the region average. In our example (download the data set from here), we are going to visualize the regional distribution of simulated average winter wheat yield for 30 years from 1981 to 2010. The “ID” column in the data set represents one grid cell in the region, and there are 1,812 total grid cells. For each grid cell, the average historical yield and the standard deviation of yield during 30 years were given. First, we need to load the library; then, in the general code structure of “ggplot ( dataframe , aes ( x , y , fill )),” we need to specify x-axis to “yield.” The y-axis will be calculated and added through “geom_density()”. Then, we can add a color, title, and label and customize the background.

example1<- read.csv("(your directory)/example_1.csv")
library(ggplot2)   
ggplot(example1, aes(x=example1$period_ave_Y))+ 
geom_density(fill="blue")+
 theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

Now, we want to know how the standard deviation of 30 years’ average yield for all the grid cells in the region can be mapped into this density plot.

We can add another column (name it “SD_class”) to the data set and classify the standard deviations. The maximum and minimum standard deviations among all the grid cells are the following.

max(example1$period_sd_Y)
# [1] 3.605131
min(example1$period_sd_Y)
# [1] 0.8645882

For example, I want to see this plot categorized by standard deviations between 0.8 to 1.5, 1.5 to 2.5, and 2.5 to the maximum value. Here, I am writing a simple loop to go over each row and check the standard deviation value for each row (corresponding to each grid cell in a region); I fill the newly added column (“SD_class”) with the correct class that I specify in the “if statement.”

example1$SD_class<- NA
for (i in 1:nrow(example1)){
  if(example1[i,2]>0.8 && example1[i,2]<= 1.5) {example1[i,4]<- c("0.8-1.5")}
  if(example1[i,2]>1.5 && example1[i,2]<= 2.5) {example1[i,4]<- c("1.5-2.5")}
  if(example1[i,2]>2.5) {example1[i,4]<- c("2.5-3.6")}
}

Now, we just need to add “fill” to the aesthetics section of the code, specify the column with the classifications, and add “alpha” to make the color transparent in order to see the shapes of the graphs and whether they have overlaps.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

We can also use the “facet_grid()” option, like the plot in Part (1), and specify the column with classification to show each of these classes in a separate panel.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+facet_grid(example1$SD_class ~ .)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

The other interesting variables that we can explore are different percentiles of our data set that correspond to the density plot. For this, we need to obtain the density values (y-axis on the plot) for the percentiles that we are interested in—for example 10%, 25%, 50%, 75%, and 90%. Also we need to find out the actual yield value corresponding to each percentile:

quantiles_yield <- quantile(example1$period_ave_Y, prob=c(0.1, 0.25, 0.5, 0.75, 0.9))
#     10%      25%      50%      75%      90% 
#  4.229513 5.055070 5.582192 5.939071 6.186014

Now, we are going to estimate the density value for each of the yields at the 10th, 25th, 50th, 75th, and 90th percentiles.

df <- approxfun(density(example1$period_ave_Y))

The above function will give us the approximate density value for each point (yield) in which we are interested—in our case, yields for the above percentiles:

df(c(quantiles_yield))
#[1] 0.1176976 0.3267841 0.6129621 0.6615790 0.4345247

Now, we can add several vertical segments to the density plot that show where each percentile is located on this graph. The limits of these segments on the y-axis are based on the density values for each percentile that we got above. Also, note that I used those values to adjust the positions of the labels for the segments.

ggplot()+ 
      geom_density(aes(x=example1$period_ave_Y),fill="blue",alpha=0.4) + 
    geom_segment(aes(x=quantiles_yield, y=0, xend =quantiles_yield,
                     yend= df(c(quantiles_yield))),size=1,colour =c("red","green","blue","purple","orange"),linetype='dashed')+
      theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
            axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
            legend.text=element_text(size=13),legend.title=element_text(size=14))+
      labs(title = paste("Density Plot of Regional Average Historical Yield (30 years) and Percentiles"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")+
    annotate("text", x=4.229513, y=0.15, label=paste("10%"),size=5)+
    annotate("text", x=5.055070, y=0.36, label=paste("25%"),size=5)+
    annotate("text", x=5.582192, y=0.65, label=paste("50%"),size=5)+
    annotate("text", x=5.939071, y=0.7, label=paste("75%"),size=5)+
    annotate("text", x=6.186014, y=0.47, label=paste("90%"),size=5)