Spatial statistics (Part 3): Geographically Weighted (GW) models

Geographically weighted (GW) models are useful when there is non-stationarity across the spatial region. In this case global models cannot represent the local variations across the region. Instead locally weighted regression coefficients, based on specific distance, could be used to adjust their global values. In this blog pot, I am going to introduce an R package “GWmodel” that handles this procedure and has more functionality such as principal components analysis that can be used as an exploratory tool for evaluating data spatial heterogeneity. It also provides some summary statistics that I will cover in this post.

The spatial weighting function is the most important part in GW modeling as it defines the spatial dependency between target data. We can define a matrix with the same dimension of target data to indicate the geographical weighting of each data point for each location.  Users have to specifying type of distance, kernel function, and bandwidth to build this matrix. We can consider different methods of distance calculation (Euclidean, Manhattan, Great Circle distance, or generalized Minkowski distance) and commonly used kernel functions (Gaussian, Exponential, Box-car, Bi-square, Tri-cube).

Gaussian and exponential kernels are continuous functions of the distance between two observation points, while Box-car, Bi-square, and Tri-cube are discontinuous functions. This mean that observations that are further than the specified distance (bandwidth) are excluded. The bandwidth can be a fixed distance, or as a fixed number of local data, for both types of functions, but the actual local sample size is the same as the sample size for the continuous functions.

We can examine the potential local relationships between the variables by applying summary statistics function gwss (), which includes GW mean, standard deviation, a measure of skew and a Pearson’s correlation coefficient for any locations. In addition to this basic summary, we can consider a robust statistics where the effects of outliers on the local statistics are excluded. The robust statistics include GW medians, inter-quartile ranges, and quantile imbalances. Also, local bivariate summary statistics including Pearson’s and Spearman’s are available (basic and robust forms, respectively). I am going to use this function to explore some statistics. The sample data that is similar to my previous blog post is here. First we need to convert it to spatial data that has coordinates, by specifying the latitude and longitude columns:

data<- read.table("---your path---/data_new.csv",header = T)
sample_dataset <- SpatialPointsDataFrame(data[, 1:2], data)

Now, I am going to calculate the summary statistics for a few variables by considering three kernels. For this function we need to specify the bandwidth. Ideally this should be estimated by applying cross-validation across a range of bandwidths to reach the most accurate predictions.

library("GWmodel")
sample_dataset_bx <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"), kernel = "boxcar", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_bs <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "bisquare", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_gu <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "gaussian", adaptive = TRUE, bw = 300, quantile = TRUE)

As an example we can compare the basic measures of the local variability in yield based on three kernels.

library("RColorBrewer")
spplot(sample_dataset_bx$SDF, "WW_Yield_LSD", key.space = "right",col.regions = brewer.pal(8, "Set1") ,cuts = c(345,525,705,885,1065,1245,1425,1605,1770),  main = "GW Standard Deviations for Yield (boxcar)")

Or we can plot the basic local correlation between yield and soil water profile using a box-car kernel. The graphs below present the same concept for all three kernels.

mypalette= c("#FFFFCC","#C7E9B4","#7FCDBB","#41B6C4","#1D91C0","#225EA8","#0C2C84")
spplot(sample_dataset_bx$SDF, "Corr_WW_Yield.Soil_water", key.space = "right",col.regions = mypalette, cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (boxcar)")

As we see in standard deviation graphs, yield appears highly variable. It looks like bi-square kernels with 300 bandwidths (~ 26% of data) is more efficient, compared to two other kernels and the relationship between yield and soil water profile is non-stationary.

I am going to compare the robust GW correlations between yield and soil water profile using a bi-square kernel, with the basic one, that we just created, with the new color scheme for better visualization:

spplot(sample_dataset_bs$SDF, "Corr_WW_Yield.Soil_water", key.space = "right", cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
       main = "GW correlations: WW Yield and Soil-water Profile (bisquare_basic)")
spplot(sample_dataset_bs$SDF, "Spearman_rho_WW_Yield.Soil_water",key.space = "right",cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (bisquare_robust)")

Principle components is another type of analysis that we can apply on our multivariate data to evaluate potential linear combinations of variables that allow sources of variation to be recognized. The “GWmodel” package provides the functionality to account for the spatial heterogeneity in PCA analysis. Here is an example of command lines for basic and robust PCA analysis. But before that, we need to standardize or independent variables by re-scaling them to have a similar magnitude and therefore equal importance for all variables, in the analysis.

scaled_dataset <- scale(as.matrix(sample_dataset@data[,3:10]))
#basic
pca_basic <- princomp(scaled_dataset, cor = F)
(pca_basic$sdev^2 / sum(pca_basic$sdev^2))*100    #percentage of total variance’ (PTV)
Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
5.956296e+01 1.891630e+01 1.202148e+01 6.965907e+00 2.219094e+00 3.114074e-01 
Comp.7       Comp.8 
2.586703e-03 2.653528e-04 

#robust
pca_robust <- covMcd(scaled_dataset, cor = F)
pca.robust <- princomp(scaled_dataset, covmat = R.COV, cor = F)
(pca.robust$sdev^2 / sum(pca.robust$sdev^2))*100   #percentage of total variance’ (PTV)
pca.robust$loadings

   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
42.758379 32.184879 11.967754  5.399829  4.168437  1.731579  1.190092  0.599050 

With bw.gwpca () we can automatically select the bandwidth for GW PCA analysis. The function uses a cross- validation approach to find the optimal bandwidth. However, we need to decide the number of components (k) to include in the analysis. We also need to convert our scaled dataset to a spatial data.

Coords <- as.matrix(cbind(sample_dataset$LON, sample_dataset$LAT))
scaled_dataset.spdf <- SpatialPointsDataFrame(Coords,as.data.frame(scaled_dataset))
bw.gwpca.basic <- bw.gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), k = 4, robust = FALSE,adaptive = TRUE)
#bw.gwpca.basic = 986
bw.gwpca.robust <- bw.gwpca(scaled_dataset.spdf,vars=colnames(scaled_dataset.spdf@data), k = 4, robust = TRUE, adaptive = TRUE)
#bw.gwpca.robust = 767

Once the bandwidth is estimated, we can use gwpca()  to calibrate the basic and robust GW PCA fits. It should be noted that we use all of the components in the fitted model at this step.

gwpca.basic <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.basic, k = 8,robust = FALSE, adaptive = TRUE)
gwpca.robust <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.robust, k = 8,robust = TRUE, adaptive = TRUE)

Now, as an example, we can visualize how data dimensionality varies spatially for the first two components, by extracting the sum of total variance (%) or PTV at each location.

var_pca_basic <- (rowSums(gwpca.basic$var[, 1:2])/rowSums(gwpca.basic$var))*100
sample_dataset$var_pca_basic <- var_pca_basic
var_pca_robust <- (rowSums(gwpca.robust$var[, 1:2])/rowSums(gwpca.robust$var))*100
sample_dataset$var_pca_robust <- var_pca_robust
spplot(sample_dataset, "var_pca_basic", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (basic)")
spplot(sample_dataset, "var_pca_robust", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (robust)")

The differences between these two plots show the effect of local multivariate outliers.

References:

Gollini, I., Lu, B., Charlton, M., Brunsdon, C., Harris, P., 2015. GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models. Journal of Statistical Software 63, 1–50. https://doi.org/10.18637/jss.v063.i17

Lu, B., Harris, P., Charlton, M., Brunsdon, C., Nakaya, T., Murakami, D., Gollini, I., 2020. GWmodel: Geographically-Weighted Models.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s