Spatial statistics (Part 3): Geographically Weighted (GW) models

Geographically weighted (GW) models are useful when there is non-stationarity across the spatial region. In this case global models cannot represent the local variations across the region. Instead locally weighted regression coefficients, based on specific distance, could be used to adjust their global values. In this blog pot, I am going to introduce an R package “GWmodel” that handles this procedure and has more functionality such as principal components analysis that can be used as an exploratory tool for evaluating data spatial heterogeneity. It also provides some summary statistics that I will cover in this post.

The spatial weighting function is the most important part in GW modeling as it defines the spatial dependency between target data. We can define a matrix with the same dimension of target data to indicate the geographical weighting of each data point for each location.  Users have to specifying type of distance, kernel function, and bandwidth to build this matrix. We can consider different methods of distance calculation (Euclidean, Manhattan, Great Circle distance, or generalized Minkowski distance) and commonly used kernel functions (Gaussian, Exponential, Box-car, Bi-square, Tri-cube).

Gaussian and exponential kernels are continuous functions of the distance between two observation points, while Box-car, Bi-square, and Tri-cube are discontinuous functions. This mean that observations that are further than the specified distance (bandwidth) are excluded. The bandwidth can be a fixed distance, or as a fixed number of local data, for both types of functions, but the actual local sample size is the same as the sample size for the continuous functions.

We can examine the potential local relationships between the variables by applying summary statistics function gwss (), which includes GW mean, standard deviation, a measure of skew and a Pearson’s correlation coefficient for any locations. In addition to this basic summary, we can consider a robust statistics where the effects of outliers on the local statistics are excluded. The robust statistics include GW medians, inter-quartile ranges, and quantile imbalances. Also, local bivariate summary statistics including Pearson’s and Spearman’s are available (basic and robust forms, respectively). I am going to use this function to explore some statistics. The sample data that is similar to my previous blog post is here. First we need to convert it to spatial data that has coordinates, by specifying the latitude and longitude columns:

data<- read.table("---your path---/data_new.csv",header = T)
sample_dataset <- SpatialPointsDataFrame(data[, 1:2], data)

Now, I am going to calculate the summary statistics for a few variables by considering three kernels. For this function we need to specify the bandwidth. Ideally this should be estimated by applying cross-validation across a range of bandwidths to reach the most accurate predictions.

library("GWmodel")
sample_dataset_bx <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"), kernel = "boxcar", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_bs <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "bisquare", adaptive = TRUE, bw = 300, quantile = TRUE)
sample_dataset_gu <- gwss(sample_dataset, vars = c("WW_Yield","ET_pot", "T_act", "Soil_evap","Soil_water"),  kernel = "gaussian", adaptive = TRUE, bw = 300, quantile = TRUE)

As an example we can compare the basic measures of the local variability in yield based on three kernels.

library("RColorBrewer")
spplot(sample_dataset_bx$SDF, "WW_Yield_LSD", key.space = "right",col.regions = brewer.pal(8, "Set1") ,cuts = c(345,525,705,885,1065,1245,1425,1605,1770),  main = "GW Standard Deviations for Yield (boxcar)")

Or we can plot the basic local correlation between yield and soil water profile using a box-car kernel. The graphs below present the same concept for all three kernels.

mypalette= c("#FFFFCC","#C7E9B4","#7FCDBB","#41B6C4","#1D91C0","#225EA8","#0C2C84")
spplot(sample_dataset_bx$SDF, "Corr_WW_Yield.Soil_water", key.space = "right",col.regions = mypalette, cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (boxcar)")

As we see in standard deviation graphs, yield appears highly variable. It looks like bi-square kernels with 300 bandwidths (~ 26% of data) is more efficient, compared to two other kernels and the relationship between yield and soil water profile is non-stationary.

I am going to compare the robust GW correlations between yield and soil water profile using a bi-square kernel, with the basic one, that we just created, with the new color scheme for better visualization:

spplot(sample_dataset_bs$SDF, "Corr_WW_Yield.Soil_water", key.space = "right", cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
       main = "GW correlations: WW Yield and Soil-water Profile (bisquare_basic)")
spplot(sample_dataset_bs$SDF, "Spearman_rho_WW_Yield.Soil_water",key.space = "right",cuts=c(-1,-0.8,-0.6,-0.4,-0.2,0,0.2,0.4,0.6,0.8,1),
        main = "GW correlations: WW Yield and Soil-water Profile (bisquare_robust)")

Principle components is another type of analysis that we can apply on our multivariate data to evaluate potential linear combinations of variables that allow sources of variation to be recognized. The “GWmodel” package provides the functionality to account for the spatial heterogeneity in PCA analysis. Here is an example of command lines for basic and robust PCA analysis. But before that, we need to standardize or independent variables by re-scaling them to have a similar magnitude and therefore equal importance for all variables, in the analysis.

scaled_dataset <- scale(as.matrix(sample_dataset@data[,3:10]))
#basic
pca_basic <- princomp(scaled_dataset, cor = F)
(pca_basic$sdev^2 / sum(pca_basic$sdev^2))*100    #percentage of total variance’ (PTV)
Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
5.956296e+01 1.891630e+01 1.202148e+01 6.965907e+00 2.219094e+00 3.114074e-01 
Comp.7       Comp.8 
2.586703e-03 2.653528e-04 

#robust
pca_robust <- covMcd(scaled_dataset, cor = F)
pca.robust <- princomp(scaled_dataset, covmat = R.COV, cor = F)
(pca.robust$sdev^2 / sum(pca.robust$sdev^2))*100   #percentage of total variance’ (PTV)
pca.robust$loadings

   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
42.758379 32.184879 11.967754  5.399829  4.168437  1.731579  1.190092  0.599050 

With bw.gwpca () we can automatically select the bandwidth for GW PCA analysis. The function uses a cross- validation approach to find the optimal bandwidth. However, we need to decide the number of components (k) to include in the analysis. We also need to convert our scaled dataset to a spatial data.

Coords <- as.matrix(cbind(sample_dataset$LON, sample_dataset$LAT))
scaled_dataset.spdf <- SpatialPointsDataFrame(Coords,as.data.frame(scaled_dataset))
bw.gwpca.basic <- bw.gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), k = 4, robust = FALSE,adaptive = TRUE)
#bw.gwpca.basic = 986
bw.gwpca.robust <- bw.gwpca(scaled_dataset.spdf,vars=colnames(scaled_dataset.spdf@data), k = 4, robust = TRUE, adaptive = TRUE)
#bw.gwpca.robust = 767

Once the bandwidth is estimated, we can use gwpca()  to calibrate the basic and robust GW PCA fits. It should be noted that we use all of the components in the fitted model at this step.

gwpca.basic <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.basic, k = 8,robust = FALSE, adaptive = TRUE)
gwpca.robust <- gwpca(scaled_dataset.spdf,vars = colnames(scaled_dataset.spdf@data), bw = bw.gwpca.robust, k = 8,robust = TRUE, adaptive = TRUE)

Now, as an example, we can visualize how data dimensionality varies spatially for the first two components, by extracting the sum of total variance (%) or PTV at each location.

var_pca_basic <- (rowSums(gwpca.basic$var[, 1:2])/rowSums(gwpca.basic$var))*100
sample_dataset$var_pca_basic <- var_pca_basic
var_pca_robust <- (rowSums(gwpca.robust$var[, 1:2])/rowSums(gwpca.robust$var))*100
sample_dataset$var_pca_robust <- var_pca_robust
spplot(sample_dataset, "var_pca_basic", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (basic)")
spplot(sample_dataset, "var_pca_robust", key.space = "right",col.regions = brewer.pal(8, "YlGnBu"), cuts=8, main = "PTV for local components 1 to 2 (robust)")

The differences between these two plots show the effect of local multivariate outliers.

References:

Gollini, I., Lu, B., Charlton, M., Brunsdon, C., Harris, P., 2015. GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models. Journal of Statistical Software 63, 1–50. https://doi.org/10.18637/jss.v063.i17

Lu, B., Harris, P., Charlton, M., Brunsdon, C., Nakaya, T., Murakami, D., Gollini, I., 2020. GWmodel: Geographically-Weighted Models.

Introduction to Wavelets

The post is a brief introduction and overview of wavelets. Wavelets are a powerful tool for time series analysis, de-noising, and data compression, but have recently exploded in fields like hydrology and geophysics. We are often interested in discovering periodic phenomena or underlying frequencies that are characteristic in a signal of interest, such as low-frequency teleconnections. We may also want to better understand very rapidly changing seismic signals during an earthquake or ocean wave dispersion. Wavelets can help us answer many questions in these applications!

Basic Background

The typical approach to understanding what frequencies are present in a signal involves transforming the time-domain signal into the frequency domain by means of the Fourier Transform. A well-known deficiency of the Fourier Transform is that is provides information about the frequencies that are present, but all information about the time at which these frequencies are present is lost. Thus, signals that are non-stationary, or exhibit changing frequencies over time cannot be analyzed by the Fourier Transform (Deshpande, 2015). One solution that was proposed to address this limitation was the Short-time Fourier Transform (STFT). In a very basic sense, the STFT involves applying a window function to segment the signal in time and then performing a Fourier transform on each segment. Then, the frequencies for each segment can be plotted to better understand the changing spectra (Smith, 2011). A considerable amount of research was spent on determining appropriate windowing functions between the 1940s and 1970s. However, limitations still existed with this approach. The STFT utilizes the same windowing function across the whole time series, which may not be appropriate for all the different frequencies that may be characterize a signal. When a signal has very high frequency content, a narrow window is preferable, but this results in poor frequency resolution. When a signal has lower frequency content, a wider window is preferable, but this results in poor time resolution. This tradeoff is often termed an example of the Heisenberg Uncertainty Principle (Marković et al, 2012).

1

The Short-Term Fourier Transform assumes that frequencies are present uniformly across the time series. Wavelet bases of different scales (frequency bands) can be influential at select times in the series.

In order to properly analyze signals that are non-stationary or transient (most signals of interest) and to appropriately address both large and small scale features in the frequency space, we can make use of wavelets! Wavelet transforms are especially useful for analyzing signals which have low frequencies for long durations and short frequencies for short durations. Furthermore, the basis functions are not restricted to only sinusoids, which can make it easier to approximate functions with sharper peaks or corners.

In a continuous wavelet transform, expressed below in Equation (1), the basis function or window is termed the “mother” wavelet, which is designated by Ψ .

2

This mother wavelet can be scaled and translated with the s and τ  parameters, respectively, to produce “daughter” or “child” wavelets, which are simply variations of the mother wavelet. The wavelets are not only translated across the signal but can be dilated or contracted to capture longer or shorter term events.

By definition, a continuous wavelet transform is a convolution of the input signal with the daughter wavelets, which is inherently a measure of similarity. The wavelet is “slid” over the signal of interest and similarities at each time point are measured. Therefore, the wavelet will be correlated with the parts of the series that contain a similar frequency.

The convolution with the selection of wavelets can lead to redundancy of information when the basis functions are not orthogonal. A family of wavelet basis functions have been developed to address this limitation. The simplest orthonormal wavelet is the Haar wavelet, which is a discrete, square shaped wavelet. The Haar wavelet is not continuous or differentiable, however it is particularly useful for approximating the response of systems that may experience a sudden transition. A Morlet wavelet is a complex sinusoid enclosed in a Gaussian envelope and may be more useful in applications such as music, hearing/vision, and for analyzing electrocardiograms.

3

Common Wavelets : a) Haar b) Gaussian c) Daubechies d) Morlet (Baker, 2007)

In a discrete wavelet transform (DWT), the translation and scale parameters, s and τ are discretized in such a way that each successive wavelet is twice the dimension as the one before, to cover all but very low frequencies. This is termed a dyadic filter bank. Each daughter wavelet can therefore be thought of as a bandpass filter that represents a specific frequency of interest or scale.

4

Dyadic filter bank frequency response magnitudes (Smith, 2011)

The correlation across time associated with the signal and each daughter wavelet can be plotted in a scalogram as shown below.

5

Mapping the wavelet scalogram (Shoeb & Clifford, 2006)

The coefficients of the wavelet indicate the correlation between the wavelet and the signal of interest at a specific time and scale. The amplitude squared of the wavelet coefficient,|Wi|2 defines the wavelet power and can be used to create a Wavelet Power Spectrum.  A larger power corresponds to a higher correlation, therefore, the regions of high power in the spectrum correspond to areas of interest5 .

WaveletComp

We will use the library WaveletComp in R to demonstrate how to find the Wavelet Power Spectrum. Many wavelet libraries exist in Python and MATLAB work equivalently and just as simply, but I like WaveletComp due the huge supplement in the package repository that contains many examples on how to use the functions. For this post, I took quarterly El Niño 3 Region (NINO3) SST surface anomalies from NOAA recorded over 1871-1996 and applied a single function from the package to create the spectrum.

library(waveletComp)
my.w = analyze.wavelet(mydata, "Index",
loess.span = 0,
dt = 0.25, dj = 1/250,
lowerPeriod = 0.25,
upperPeriod = 32,
make.pval = TRUE, n.sim = 30)
wt.image(my.w, n.levels = 250,
legend.params = list(lab = "wavelet power levels") )

Here we specify the name of the dataframe that contains the data (mydata), the column that contains the index (“Index”), the number of observations per unit time (dt=0.25 due to the quarterly resolution) and the upper and lower period that bounds the search area. The range of periods is generally represented as a series of octaves of 2j and each octave is divided into 250 sub-octaves based on the dj term. The “make.pvalue=TRUE” argument draws a white line around the areas that are deemed significant. Finally, we plot the wavelet objecive using wt.image.

6

Wavelet Power Spectrum

As seen in the power spectrum, the highest powers across the time series are recorded within the 2-7 year frequency bands, which matches with the period that El Niño events tend to occur. Interestingly, the El Niño signal seems strongest at the earliest and later parts of the decade and is notably less prominent between the years of 1920-1960, which has been observed in other work (Torrence & Webster, 1997). The wavelet spectrum allows us to see how the periodicity of the El Niño signal has changed over time, with longer periods being observed around 1920 and shorter periods closer to the beginning and end of the century.

Because wavelets are shifted across each time-point in the convolution operation, the coefficients at both edges (half of  the wavelet duration at each frequency) are not accurate. The smaller frequencies utilize smaller wavelet durations, creating a “cone of influence” that is shown by the white shading on the edges of the plot. The cone of influence designates the areas of the plot that should be disregarded.

Coherence is a measure of the common oscillations that two signals share. Generally, coherence or cross-correlation is used to assess similarity in the time or frequency domain. However, if the two signals being compared are non-stationary, the correlation can change over time. Therefore, the coherence must be represented in a way to show changes across frequency and time. We can create cross-correlation plots in WaveletComp as well. I have extracted some data supplied by MathWorks which contains a monthly Niño3 index along with an average All-India Rainfall Index6. In order to assess the how these two time series are linked, we use the analyze.coherency function in WaveletComp.

my.wc = analyze.coherency(data,
my.pair = c("Nino_Index", "Rainfall"),
loess.span = 0,
dt = 1/12, dj = 1/50,
lowerPeriod = 0.25, upperPeriod = 32,
make.pval = TRUE, n.sim = 10)

wc.image(my.wc, n.levels = 250,
legend.params = list(lab = "cross-wavelet power levels"),
color.key = "interval",show.date=TRUE)

7

Cross-Wavelet Power Spectrum

The maximum correlation aligns well within the 2-7 year bands that were observed in the above plot. The orientation of the arrows in the figure signify the delay between the two signals at that time period, The vertical to horizontal arrows denote a ¼ – ½ cycle delay within the significant areas or ½-3.5 years of delay between the El Niño SST’s observed off the coast of South America to influence rainfall in the Indian subcontinent. Pretty cool!

There is quite a bit of solid math and proofs that one should go through to truly understand the theory behind wavelets, but hopefully this post serves as a good introduction to show how wavelets can be useful for your respective analysis. When I first learned about wavelets and tried out WaveletComp, I immediately began conducting a wavelet analysis on every time series that I had on hand to look for hidden frequencies! I hope that this tutorial motivates you to explore wavelets for analyzing your non-stationary signals.

Sources:

Baker, J. W. “Quantitative Classification of Near-Fault Ground Motions Using Wavelet Analysis.” Bulletin of the Seismological Society of America, vol. 97, no. 5, 2007, pp. 1486–1501., doi:10.1785/0120060255.

Deshpande, Jaidev. “Limitations of the Fourier Transform: Need For a Data Driven Approach.” Limitations of the Fourier Transform: Need For a Data Driven Approach – Pyhht 0.0.1 Documentation, 2015, pyhht.readthedocs.io/en/latest/tutorials/limitations_fourier.html.

Marković, Dejan, et al. “Time-Frequency Analysis: FFT and Wavelets.” DSP Architecture Design Essentials, 2012, pp. 145–170., doi:10.1007/978-1-4419-9660-2_8.

Smith, Julius O. Spectral Audio Signal Processing. W3K, 2011.

Torrence, Christopher, and Peter J. Webster. “Interdecadal Changes in the ENSO–Monsoon System.” Journal of Climate, vol. 12, no. 8, 1999, pp. 2679–2690., doi:10.1175/1520-0442(1999)0122.0.co;2.

[5] “Wavelet Power Spectrum.” Wavelet Toolkit Architecture, Dartmouth, 16 June 2005, northstar-www.dartmouth.edu/doc/idl/html_6.2/Wavelet_Power_Spectrum.html.

[6] “Wcoherence.” MATLAB & Simulink Example, The MathWorks, Inc., 2020, http://www.mathworks.com/help/wavelet/examples/compare-time-frequency-content-in-signals-with-wavelet-coherence.html.

Spatial statistics (Part 2): Spatial Regression Models

Regression is one of the main techniques of data analysis. A regression model that can incorporate spatial dependency in a dependent variable is called a spatial regression model. It can be used as a simple surrogate model for prediction when the data are not available for some locations, or for understanding the factors behind patterns. In this blogpost, I am going to create a simple regression model for a crop yield, check the residuals for signs of relationships with nearby areas, and try to remove the potential spatial dependencies in the residuals by applying a spatial regression model. The autocorrelation in the residuals is a sign that the underlying process being studied varies systematically across the study area. In this situation, the resulting estimates of a fitted model are biased. Spatial regression models have applications in different fields such as agriculture (e.g., farm management, policy issues), natural sciences (e.g., species patterns), public health (e.g., air pollution), and social sciences (e.g., forecast population).The datasets that I am going to use (ww.* and WW_ave_hist.txt ) are available here. The .txt file includes historical winter wheat yield for some locations (4*4 km grid cells) with distinct IDs, and the “ww” shapefile (which includes 6 files) has some information for each location based on its ID as well. We will merge these two files, apply linear regression, and check whether we can use some explanatory variables from our data (predictors) to explain the variation in yield (dependent variable) across the region. We assume that we can predict yield by knowing annual potential evapotranspiration, precipitation, and available water in the soil profile.

library("rgdal")
setwd("---your path--- ")
Annual_var<- readOGR(".","ww") # This is  SpatialPolygonsDataFrame objects that brings the spatial representations of the polygons with the  data.
yield_ww<- read.table("---your path---/ww_ave_hist.txt",header = T)
yield<- merge(Annual_var,yield_ww,by="ID")
names (yield)
# fit the linear model
lm_yield <- lm(ww_ave_yield ~  ET_pot  +  precipitat +soil_water, data=yield) 
summary(lm_yield)
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5058.6326   291.9022   17.33   <2e-16 ***
ET_pot        -5.0868     0.2147  -23.70   <2e-16 ***
precipitat    12.8215     0.1626   78.86   <2e-16 ***
soil_water    -5.9440     0.5295  -11.23   <2e-16 ***
Adjusted R-squared:  0.9035

After fitting a regression, and checking the coefficients to ensure that all the variables are statistically significant, one should check the the residuals to make sure they are independent. If there is any correlation, it means that our regression’s coefficient estimates could be wrong and they can not be  assumed constant across the study area. Therefore, the fitted model is not a good representation of the dependent variable.From our fitted model, we can extract the residuals:

yield$residuals_lm <- residuals(lm_yield)

Here, we look at the spatial patterning in the distribution of the residuals:

library("magrittr")
library("ggplot2")
yield %>% as.data.frame %>% 
  ggplot(aes(LON, LAT)) + geom_tile(aes(fill=residuals_lm), alpha=3/4) + 
  ggtitle("") + coord_equal() + theme_bw()+scale_fill_gradientn(colours=c("black","yellow","red"),breaks=seq(-1660,1360,400)) 

Beside the visual inspection of the residuals, a more formal test would be required to decide whether spatial autocorrelation is present. In the first part of this blogpost, we used the linkages based on the physical distance to examine the spatial autocorrelation. Similar as before, we are going to create a list of neighbors using the Queen criteria. In my previous blog post, we calculated the local Moran’s I test statistic for the actual data. Here, we want to apply it on the residuals, so we need to use another function that takes into account that the variable under consideration is a residual of a regression. Also, in this example, we look at the global rather than the local Moran’s which is based on both feature locations and feature values simultaneously.

library("spdep")
neighbor <- poly2nb(yield)
lw <- nb2listw(neighbor)
lm.morantest(lm_yield, lw) 
# 0.7908136752

The result shows statistically significant value for Moran’s I. We can also use the “neighbor” object to get the average value for the neighbors of each location (grid cell), and look at the correlations between these and the residuals or create a scatter plot for visual inspection.

mean_function <- sapply(neighbor, function(x) mean(yield$residuals_lm[x]))
cor(yield$residuals_lm, mean_function) 
# 0.8847948
plot(yield$residuals_lm, mean_function, xlab='Residuals', ylab='Mean adjacent residuals')

We clearly see that the spatial dependencies in the residuals are significant. This means that, if we use this model, the predicted values are systematically underestimated or overestimated. Therefore, we need to use spatial autoregressive models that account for spatial dependencies. There are two models used as spatial regression models: the spatially lagged model and the spatial error model. The first model uses a spatial lag variable that averages the neighboring values for each location and accounts for autocorrelation using a weights matrix. In the second model, the spatial dependence is handled through the errors rather than through the systematic component of the model. In order to decide whether to fit the spatial error or lagged model, the Lagrange Multiplier (LM) test is used to distinguish which is more appropriate. The R function lm.LMtests() would perform this test by considering these statistics: the LM test for the error dependence (LMerr) and the spatially lagged dependent variable (LMlag), as well as for their robust forms (RLMerr and RLMlag; e.g., RLMerr examines the spatially autocorrelated residuals in the possible presence of an omitted lagged dependent variable).

 lm.LMtests(lm_yield, lw, test = c("LMerr","LMlag","RLMerr","RLMlag"))
#LMerr = 2142.1, df = 1, p-value < 2.2e-16
#LMlag = 1025, df = 1, p-value < 2.2e-16
#RLMerr = 1121.8, df = 1, p-value < 2.2e-16
#RLMlag = 4.6952, df = 1, p-value = 0.03025

Since both LMerr and LMlag have significant p-values, we compare the p-values of the robust forms RLMerr and RLMlag. In doing so, it can be seen that RLMerr is significant. Therefore, the LM test suggests that we should run a spatial error model.

fit_err <- errorsarlm(ww_ave_yield ~  ET_pot  +  precipitat +soil_water, data=yield, lw)
# lagsarlm() is a function that creates a  spatial lag model
summary(fit_err)
#Lambda: 0.93674, LR test value: 1912.7, p-value: < 2.22e-16
#AIC: 15166, (AIC for lm: 17077)

The results show that the Likelihood Ratio (LR) test is highly significant (p value 2.22e-16).This shows further evidence that the spatial error model is a good fit. Also, the  Akaike Information Criterion (AIC:an estimate of out-of-sample prediction error and therefore the relative quality of statistical models for a given dataset) in this new model has a AIC of 15166 and has a better fit compared to the original linear model, with no spatial error dependencies (AIC of 17077).

We can now check the residuals of this new model. In addition, we can take a look at the Moran’s I statistic one more time. Note that we previously used a Moran’s I test for spatial autocorrelation in residuals from an estimated linear model. Now, we don’t have a linear model, so we can use a Permutation test for the Moran’s I statistic: the moran.mc() function uses random permutations of x for the given spatial weighting scheme. The residuals graph and Moran’s I statistic both show that there is no correlation in the residuals:

yield$residuals_error_model <- residuals(fit_err)
mean_function_error_model <- sapply(neighbor, function(x) mean(yield$residuals_error_model[x]))
cor(yield$residuals_error_model, mean_function_error_model)
plot(yield$residuals_error_model, mean_function_error_model, xlab='Residuals', ylab='Mean adjacent residuals')
moran.mc(yield$residuals_error_model, lw, 1000)   
# 1000 is a number of permutations

If we used the other model, the residuals would show some correlations:

References

Srinivasan, S., 2008. Spatial Regression Models, in: Shekhar, S., Xiong, H. (Eds.), Encyclopedia of GIS. Springer US, Boston, MA, pp. 1102–1105. https://doi.org/10.1007/978-0-387-35973-1_1294 https://rspatial.org/raster/analysis/index.html

Deeper Dive into Self-Organizing Maps (SOMs)

This blog post focuses on Self-Organizing Maps (SOMs) or Kohonen maps, first created by Teuvo Kohonen in the 1980s, and most recently introduced in Dave Gold’s latest blog post. SOMs are a type of artificial neural network (ANN) that are trained using unsupervised learning to produce a two-dimensional pattern representation of higher-dimensional datasets. SOMs are therefore a good algorithm for dimension reduction, and inputs that are closer in their higher-dimensional input vectors are also mapped closely together in the resulting 2D representation [1].

Self Organizing Maps. Recently, I learned about SOMs while… | by ...
User-defined neuron grid that represents the basis of SOM  [1]

SOMs can be very effective for clustering/classifying outputs and are advantageous because they “self-organize” to identify clusters. The SOM map is first initialized with a user-defined 2D grid of neurons instead of containing layers like traditional ANNs. The weight vector that characterizes the neurons can be randomly initialized with value or more favorably, by sampling across the eigenvectors of the two largest PCs of the input dataset. The training process that SOMs use to build maps is called “competitive learning”, which is also fundamentally different from backpropagation used for ANNs. During competitive learning, an input training point is fed into the algorithm. Its Euclidean distance from each of the neurons is calculated, and the closest neuron is termed the best matching unit, or BMU. The weight of the neuron and surrounding neurons is adjusted to be closer to the related input vector, using an update formula. This process is repeated for every input in the training data and for a large number of iterations until patterns and clusters begin to emerge [3]. The animation on the right illustrates competitive learning. Note how the locations of neurons are adjusted as training points get assigned to their respective BMUs and how whole grid shifts in response to the new samples.

220px-TrainSOM
Competitive learning [2]

At a high level, this is how SOMs are trained. Next, we will go through a clustering example using the kohonen package offered in R.

Test Case: Maryland Trout

For this text case, we will be using a presence/absence dataset of Maryland trout. Each data point is characterized by a feature vector that includes information on agricultural land cover, logarithmic distance to the nearest road, riffle quality, dissolved oxygen, and spring water temperature. The label associated with each point is a 0 or 1, denoting if trout was observed or not. We will train a SOM to cluster the feature vector and then investigate the characteristics of the resulting clusters. Below is a snippet of our dataset. We use columns 3-7 as our feature vector while column 2 denotes our label. SOMs can also be used for classification, which is not discussed in this blog post. Therefore, you can train the SOM only on a subset of the dataset and used the trained SOM to predict the clusters that the testing points will be assigned to.

inputset

First we must scale our data in order to make sure that variables in our feature vector that may have different magnitudes are treated equally by the algorithm. Then we define the SOM grid. The choice of the number of neurons generally follows the rule: 5*sqrt(# of datapoints). In our case, we have 84 datapoints, so technically a grid that is smaller than 7×7 will work well. I decide to use 5×5, which will be explained in the next paragraph. I also specify that I want the grid to be in a hexagonal shape.  Finally, we train our model by feeding in our input data, the user-defined grid, and specifying an iteration rate of 1000. This means that the input data will be presented 1000 times to the algorithm.

require(kohonen)
data=read.csv("Maryland.Trout.csv")
#Create the training dataset
data_train &amp;amp;lt;-data[, c(3:7)]
# Change the data frame with training data to a matrix
# Also center and scale all variables to give them equal importance during
# the SOM training process.
data_train_matrix &amp;amp;lt;- as.matrix(scale(data_train))
# Create the SOM Grid
som_grid &amp;amp;lt;- somgrid(xdim = 5, ydim=5, topo="hexagonal")
# Finally, train the SOM
som_model &amp;amp;lt;- som(data_train_matrix,
                 grid=som_grid, rlen = 1000,
                 keep.data = TRUE )

Once the SOM is trained, there are a variety of figures that can be easily created. We can plot the training progress and observe how the distance to the BMU decreases as a function of iteration and starts to level off as we reach many iterations. This is ideal behavior because it means that the training points are associating closely with their BMU.

plot(som_model, type="changes")
Training
Average distance from a test point to its BMU

We can also plot a heat map of the node counts, which shows how many of each of the training inputs are associated with each node. Any node that is grey means that no training points were assigned to that BMU. This may indicate that the map is too big, so you can go back and adjust the grid size adjust accordingly until the grey points are no longer existent. For this dataset, this was around 5×5. Furthermore, we want the number of points associated with each node to be as uniform as possible.

plot(som_model, type="count", main="Node Counts")
Nodes

Next we can view the neighbor distances for each neuron, which is also called the U-matrix. Small distances indicate similarity between neighboring nodes.

plot(som_model, type="dist.neighbours", main = "SOM neighbor distances")
neighbor

We can also view the weight vector of the nodes, which are termed codes. The fan in each node indicates the variables of prominence that are linking the datapoints assigned to the neuron. In our dataset, you can see that in the upper left corner of the map, the algorithm is assigning datapoints to these neurons because of the similarity among the agricultural area and water temperature variables in the feature vector. In the lower right corner, these datapoints are linked primarily due to similarities in riffle quality, and dissolved oxygen. This type of map is useful to understand by what features the points are actually starting to cluster.

plot(som_model, type="codes")
codes

In my opinion, one of the most useful standard figures that the kohonen library creates are heatmaps that show the gradient of input values across the nodes. A heatmap can be created for each feature and when placed side by side, they can help demonstrate correlation between inputs. We can clearly see that areas characterized by low agricultural land tend to have a better riffle quality, lower water temperature, and a higher dissolved oxygen. While these patterns are expected and have a strong physical explanation, these maps can also help to discover other unexpected patterns. Note that the values on the side bar are normalized and can be transformed back to the original scale.

 

#Plot heatmaps for each feature
plot(som_model, type = "property", property = getCodes(som_model)[,1], main=colnames(getCodes(som_model))[1],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,2], main=colnames(getCodes(som_model))[2],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,3], main=colnames(getCodes(som_model))[3],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,4], main=colnames(getCodes(som_model))[4],palette.name=magma)
plot(som_model, type = "property", property = getCodes(som_model)[,5], main=colnames(getCodes(som_model))[5],palette.name=magma)
#Unscale the variables- do this for each input column in data_train
var_unscaled &amp;amp;lt;- aggregate(as.numeric(data_train[,1]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled, main=colnames(getCodes(som_model))[1], palette.name=magma)
heatmaps

Finally we can cluster our neuron codes weight vectors) to add some boundaries to our map. Typically what is done is to use k-means clustering to determine an appropriate number of clusters that, in this case, minimize the within-cluster sum of squares (WCSS). We can then plot the results and look for an elbow in the scree plot.

 

#k-means clustering code from [4]
mydata = som_model$codes[[1]]
wss = (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] = sum(kmeans(mydata, centers=i)$withinss)
}
par(mar=c(5.1,4.1,4.1,2.1))
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main="Within cluster sum of squares (WCSS)") #Define a palette and cluster with the number of clusters determined using k-means
WCCS

The elbow isn’t so clear in this dataset, but I go with 8 clusters and then use hierarchical clustering to assign each neuron to a cluster. Finally, we can obtain the cluster that each point is assigned to and insert that back as a column in our dataset.

#Final clustering code from [4]
#Define a palette and cluster with the number of clusters determined using k-means
pretty_palette &amp;amp;lt;- c("#1f77b4", '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2')
som_cluster &amp;amp;lt;- cutree(hclust(dist(som_model$codes[[1]])), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters")
add.cluster.boundaries(som_model, som_cluster)

# get vector with cluster value for each original data sample
cluster_assignment &amp;amp;lt;- som_cluster[som_model$unit.classif]
# for each of analysis, add the assignment as a column in the original data:
data$cluster &amp;amp;lt;- cluster_assignment

clustering

Post-Processing

We’re now interested in understanding which of our points actually clustered together, so we generate summary statistics for each cluster. The points highlighted in blue represent the best values for the feature while the points in red represent the worst. Note that the clusters with a large amount of trout presence on average are characterized by best values of features that are favorable for aquatic life, while the clusters that exhibit a lower presence of trout do not have good values for these features. You can also see that clusters with similar characteristics are side by side on the clustering map.

summary

Overall, I found the kohonen library to be a really useful tool to start to internalize how SOMs work and to fairly quickly be able to do a nice analysis on both the features and the resulting clusters. I would highly recommend this package as a starting point. There are packages in Python such as SimpSom and minisom that have snappier graphics and will be a good next step to explore.

References

[1]https://towardsdatascience.com/self-organizing-maps-ff5853a118d4

[2] By Chompinha – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77822988

[3]http://fourier.eng.hmc.edu/e161/lectures/som/node2.html

[4] Much direction for structuring this tutorial comes from https://www.shanelynn.ie/self-organising-maps-for-customer-segmentation-using-r/

Spatial statistics (Part 1): Spatial Autocorrelation

Exploratory spatial data analysis and statistics help us to interpret maps in a more efficient way by finding trends, enabling pattern mining in space and time, identifying spatial outliers, etc. This information beyond the maps can help us to understand the characteristics of places, phenomena, and the relationships between them. Therefore, we can use it for predication and, decision-making.

For analyzing the spatial data, many tools and libraries are available such as ArcGIS, Gdal , etc. that utilize raster and vector geospatial data formats. In my next blogposts, I am going to introduce some types of spatial statistics using different libraries in R. In this blogpost, I am going to explore an auto-correlation between county-level 10-years average (2006-2016) of total potential evapotranspiration (ET) from March to the end of August. These datasets are aggregated from 4*4 km grid cells for each county across the US and can be downloaded from here. Applying spatial autocorrelation explores if the potential ET in counties near each other are more similar. In other word, it measures how distance can affect our variable of interest. The existence of autocorrelation in data might lead to incorrect statistical inferences for some spatial statistical analysis. Therefore, it is important to test for spatial autocorrelation.

First, we are going to inspect the distribution of the data. To read the spatial data (shapefile), we need to use the “rgdal” library. Set your working directory to the path where you downloaded the data.

install.packages("rgdal")
library(“rgdal”)
setwd("---your path---")
ET<- readOGR(".","US_ETp_County")

We can see the name of columns by names(ET), and the coordinate system of the data by  crs(ET).To plot the spatial objects, I am using “spplot” from the “sp” library, which is the specialized plot methods for spatial data with attributes. The first argument is object of spatial-class, which has coordinates, and the second argument “zcol” is the attribute name or column number in the attribute table associated with the data.

library("sp")
spplot(ET, zcol='MEAN_ETp_n',col.regions = topo.colors(20), main="Average Potential ET during March-August (mm)") 

We can also use quantiles in order to better distinguish the distribution of data and potentially dilute the effect of outliers. To do this we need another library:

library("classInt")
breaks_quantile <- classIntervals(ET$MEAN_ETp_n, n = 10, style = "quantile")
breaks <- breaks_quantile $brks 

ET$MEAN_ETp_qun<- cut(ET$MEAN_ETp_n, breaks)
p2<- spplot(ET, "MEAN_ETp_qun", col.regions = topo.colors(20), main = "Average Potential ET during March-August (mm)_Quantile")

As shown in both maps, the counties in the southern half of the US have higher potential ET during March-August. To explore the spatial autocorrelation we need to define the neighbors of counties.

install.packages("spdep ")
library(“spdep”)
neighbor<- poly2nb(ET,queen = FALSE)
#With “queen” option we define if a single shared boundary point meets the contiguity condition (queen =TRUE), or more than one shared point is required (queen =FALSE).
plot(ET, border = 'lightgrey')
plot(neighbor, coordinates(ET), add=TRUE, col='red')

Now, we are going to present local spatial autocorrelation that explores spatial clustering across the US. First, we need to convert the neighbor data to a listw object. The “nb2listw” adds a neighbor list with spatial weights for the chosen coding scheme. Here, I chose “W”, which is row standardized (sums over all links to n).

lw <- nb2listw(neighbor, style = "W")

To get the autocorrelations a Moran’s test is used. The local spatial statistic Moran’s index estimates a correlation for each county based on the spatial weight object used. It is a statistical way to find out the potential local clusters and spatial outliers. A positive index indicates that a feature has neighboring features with similar high or low attributes that can be part of a cluster. A negative value indicates the outlier feature.

local_moran <- localmoran(x = ET$MEAN_ETp_n, listw = nb2listw(neighbor, style = "W"))

From this, we get some statistical measures: Ii: local moran statistic, E.Ii: expectation of local moran statistic, Var.Ii: variance of local moran statistic, Z.Ii: standard deviate of local moran statistic, and Pr() p-value of local moran statistic. Now we can add this result to our shapefile (ET) and plot the local moran statistic. I am going to add the Us_States boundary to the map.

moran_local_map <- cbind(ET, local_moran)
States<- readOGR(".","US_States")
spplot(moran_local_map, zcol='Ii',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))

Again, we can present this data with quantiles:

breaks_quantile_new  <- classIntervals(moran_local_map$Ii, n = 10, style = "quantile")
breaks_new  <- breaks_quantile_new$brks 
moran_local_map$Ii_qun <- cut(moran_local_map$Ii, breaks_new)
spplot(moran_local_map, zcol='Ii_qun',col.regions = topo.colors(20),main="Local Moran's I statistic for Potential ET (March-August)_Quantile", sp.layout =list("sp.polygons",states,lwd=3, first = FALSE))

Since the indices include negative and zero values and there are a large variations between the positive values, I modified the breaks manually in order to be able to see the variations and potential clustering with the higher Moran values:

breaks_fix<- classIntervals(moran_local_map$Ii, n=17, style="fixed",fixedBreaks=c(-0.2, -0.0001, 0.0001, 0.2, 0.4,0.6,0.8,1,4,5,6,7,8,9,10,11,12,13))

Note that the p-value should be small enough (statistically significant) for the feature to be considered as part of a cluster.

References

Anselin, L. 1995. Local indicators of spatial association, Geographical Analysis, 27, 93–115; Getis, A. and Ord, J. K. 1996 Local spatial statistics: an overview. In P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment (Cambridge: Geoinformation International), 261–277; Sokal, R. R, Oden, N. L. and Thomson, B. A. 1998. Local Spatial Autocorrelation in a Biological Model. Geographical Analysis, 30. 331–354; Bivand RS, Wong DWS 2018 Comparing implementations of global and local indicators of spatial association. TEST, 27(3), 716–748 https://doi.org/10.1007/s11749-018-0599-x

Packages for Hydrological Data Retrieval and Statistical Analysis

Packages for Hydrological Data Retrieval and Statistical Analysis

I recently searched for publicly available code to run statistical analyses that are commonly employed to evaluate (eco)hydrology datasets. My interests included data retrieval, data visualization, outlier detection, and regressions applied to streamflow, water quality, weather/climate, and land use data. This post serves as a time snapshot of available code resources.

Package Languages

I focused on packages in R and Python. R seems to have more data analysis and statistical modeling packages, whereas Python seems to have more physical hydrological modeling packages. As a result of my search interests, this post highlights the available R packages. Here are good package lists for both languages:

Hopefully, the time required to prepare your data and learn how to use these functions is less than the time to develop similar functions yourself. Most packages have vignettes (tutorials) that should help to learn the functions, and may also serve as teaching resources. In R, the function browseVignettes(‘packageName’) will open a webpage containing links to vignette PDFs (if available) and source code resources for the specified package. The datasets.load package in R is useful to discover if a specified package has sample datasets available to use.

Example webpage output from browseVignettes(‘smwrQW’):

USGS R Packages

R is currently the primary development language for hydrological statistical analyses within the United States Geological Survey (USGS). There are nearly 100 USGS-R Github repositories, from which I’ve selected a subset that seem to be actively maintained and applicable to the interests listed above. I’m not sure if similar functions as the ones contained within these repositories are available and tested in other programming languages. These USGS-R packages have test functions, which provides a baseline level of confidence in their development. Many packages are contributed or maintained by Laura DeCicco, and by David Lorenz for statistical analysis and water quality.

Selected USGS R Packages

  • dataRetrieval for streamflow and water quality gauge/site data retrieval from the National Water Information System (NWIS) and the Water Quality Portal (WQP). This package has several great vignettes that demonstrate how to use the functions. The package is simple to use, but the data require processing to be useful for research (I’m currently developing a code repository for this purpose).
  • Statistical Methods in Water Resources (smwr). All have vignettes.

smwrBase: Generic hydrology data processing tools

smwrStats: Statistical analysis and probability functions, and a possible companion book by Helsel and Hirsch (2002)

  • WREG: Predictions in ungauged basins w/ OLS, GLS, WLS
  • nsRFA: Regional Frequency Analysis, focusing on the index-value method
  • DVstats: Daily flow statistical analysis and visualization
  • HydroTools: Seems useful based on the function names, but no documentation outside of the functions themselves. Seems to be under development.

 

Additional R Hydrology Packages

  • hydroTSM: Excellent graphics package for time series data. Two examples below of one-line plot functions and summary statistics.
  • fasstr: hydrological time series analysis with a quick-start Cheat Sheet
  • FlowScreen: temporal trends and changepoint analysis
  • hydrostats: Computation of daily streamflow metrics (e.g. low flows, high flows, seasonality)
  • FAdist: Common probability distributions for frequency analysis. These distributions are not available in base R.
  • lmom and lmomRFA: L-moments for common probability distributions, and regional frequency analysis.
  • nhdR: National Hydrography Dataset downloads

 

Water Quality

 

Streamflow & Weather Generators

 

Weather Station Data

  • rnoaa: NOAA station data downloads using R (tip: the meteo_tidy_ghcnd() function provides nicely formated year-month-day datasets)
    • GHCNpy: similar package in Python for GHCN-Daily records. Also has visualization functions.
  • countyweather: US County-level weather data
  • getMet: Seems like a SWAT model companion for weather data
  • meteo: Data Analysis and Visualization

 

Climate Assessment

  • musica: Climate model assessment tools

 

Hydrological Models in R:

  • topmodel: Topmodel for flow modeling
  • swmmr: SWMM model for stormwater
  • swatmodel: SWAT model for ecohydrological studies

 

Land Use / Land Cover Change

  • lulcc: Land Use and Land Cover Change modeling

 

If you know of other good packages, please add them to the comments or write to me so that I can add them to the lists.

Intro to Boosting

Introduction

Once upon a time, in a machine learning class project, student Michael Kearns asked if it is possible to convert a weak classifier (high bias classifier whose outputs are correct only slightly over 50% of the time) into a classifier of arbitrary accuracy by using an ensemble of such classifiers. This question was posed in 1988 and, two years later, in 1990, Robert Schapire answered that it is possible (Schapire, 1990). And so boosting was born.

The idea of boosting is to train an ensemble of weak classifiers, each of which an expert in a specific region of the space. The ensemble of classifiers has the form below:
H(\vec{x}) = \sum_{t=1}^T \alpha_th_t(\vec{x})
where H is the ensemble of classifiers, \alpha_t is the weight assigned to weak classifier h_t around samples \vec{x} at iteration t, and T is the number of classifiers in the ensemble.

Boosting creates such an ensemble in a similar fashion to gradient descent. However, instead of:
\boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \alpha \nabla \ell(\vec{x}_t)
as in gradient descent in real space, where \ell is a loss function, Boosting is trained via gradient descent in functional space, so that:
H_{t+1}(\boldsymbol{X}) = H_t(\boldsymbol{X}) + \alpha_t \nabla \ell(h_t(\boldsymbol{X}))

The question then becomes how to find the \alpha_t‘s and the classifiers h_t. Before answering these questions, we should get a geometric intuition for Boosting first. The presentation in the following sections were based on the presentation on these course notes.

Geometric Intuition

In the example below we have a set of blue crosses and red circles we would like our ensemble or weak classifiers to correctly classify (panel “a”). Our weak classifier for this example will be a line, orthogonal to either axes, dividing blue crosses from red circles regions. Such classifier can also be called a CART tree (Breiman et al., 1984) with depth of 1 — hereafter called a tree stump. For now, let’s assume all tree stumps will have the same weight \alpha_t in the final ensemble.

Boosting.png

The first tree stump, a horizontal divide in panel “b,” classified ten out of thirteen points correctly but failed to classify the remaining three. Since it incorrectly classified a few points in the last attempt, we would like the next classifier correctly classify these points. To make sure that will be the case, boosting will increase the weight of the points that were misclassified earlier before training the new tree stump. The second tree stump, a vertical line on panel “c,” correctly classifies the two blue crosses that were originally incorrectly classified, although it incorrectly three crosses that were originally correctly classified. For the third classifier, Boosting will now increase the weight of the three bottom misclassified crosses as well of the other misclassified two crosses and circle because they are still not correctly classified — technically speaking they are tied, each of the two classifiers classifies them in a different way, but we are here considering this a wrong classification. The third iteration will then prioritize correcting the high-weight points again, and will end up as the vertical line on the right of panel “d.” Now, all points are correctly classified.

There are different ways to mathematically approach Boosting. But before getting to Boosting, it is a good idea to go over gradient descent, which is a basis of Boosting. Following that, this post will cover, as an example, the AdaBoost algorithm, which assumes an exponential loss function.

Minimizing a Loss Function

Standard gradient descent

Before getting into Boosting per se, it is worth going over the standard gradient descent algorithm. Gradient descent is a minimization algorithm (hence, descent). The goal is to move from an initial x_0 to the value of x with the minimum value of f(x), which in machine learning is a loss function, henceforth called \ell(x). Gradient descent does that by moving one step of length s at a time starting from x^0 in the direction of the steeped downhill slope at location x_t, the value of x at the t^{th} iteration. This idea is formalized by the Taylor series expansion below:
\ell(x_{t + 1}) = \ell(x_t + s) \approx \ell(x_t) - sg(x_t)
where g(x_t)=\nabla\ell(x_t). Furthermore,
s=\alpha g(x_t)
which leads to:
\ell\left[x_{t+1} -\alpha g(x_t)\right] \approx \ell(x^{t})-\alpha g(x_t)^Tg(x_t)
where \alpha, called the learning rate, must be positive and can be set as a fixed parameter. The dot product on the last term g(x_t)^Tg(x_t) will also always be positive, which means that the loss should always decrease — the reason for the italics is that too high values for \alpha may make the algorithm diverge, so small values around 0.1 are recommended.

Gradient Descent in Functional Space

What if x is actually a function instead of a real number? This would mean that the loss function \ell(\cdot) would be a function of a function, say \ell(H(\boldsymbol{x})) instead of a real number x. As mentioned, Gradient Descent in real space works by adding small quantities to x_0 to find the final x_{min}, which is an ensemble of small \Delta x‘s added together. By analogy, gradient descent in functional space works by adding functions to an ever growing ensemble of functions. Using the definition of a functional gradient, which is beyond the scope of the this post, this leads us to:
\ell(H+\alpha h) \approx \ell(H) + \alpha \textless\nabla \ell(H),h\textgreater
where H is an ensemble of functions, h is a single function, and the \textless f, g\textgreater notation denotes a dot product between f and g. Gradient descent in function space is an important component of Boosting. Based on that, the next section will talk about AdaBoost, a specific Boosting algorithm.

AdaBoost

Basic Definitions

The goal of AdaBoost is to find an ensemble function H of functions h, a weak classifier, that minimize an exponential loss function below for a binary classification problem:
\ell(H)=\sum_{i=1}^ne^{-y(x_i)H(x_i)}
where x_i, y(x_i) is the i^{th} data point in the training set. The step size \alpha can be interpreted as the weight of each classifier in the ensemble, which optimized for each function h added to the ensemble. AdaBoost is an algorithm for binary classification, meaning the independent variables \boldsymbol{x} have corresponding vector of dependent variables \boldsymbol{y}(x_i), in which each y(x_i) \in \{-1, 1\} is a vector with the classification of each point, with -1 and 1 representing the two classes to which a point may belong (say, -1 for red circles and 1 for blue crosses). The weak classifiers h in AdaBoost also return h(x) \in \{-1, 1\}.

Setting the Weights of Each Classifier

The weight \alpha_t of each weak classifier h can be found by performing the following minimization:
\alpha=argmin_{\alpha}\ell(H+\alpha h)
Since the loss function is defined as the summation of the exponential loss of each point in the training set, the minimization problem above can be expanded into:
\alpha=argmin_{\alpha}\sum_{i=1}^ne^{y(\boldsymbol{x}_i)\left[H(\boldsymbol{x}_i)+\alpha h(\boldsymbol{x}_i)\right]}
Differentiating the error w.r.t. \alpha, equating it with zero and performing a few steps of algebra leads to:
\alpha = \frac{1}{2}ln\frac{1-\epsilon}{\epsilon}
where \epsilon is the classification error of weak classifier h_t. The error \epsilon can be calculated for AdaBoost as shown next.

The Classification Error of one Weak Classifier

Following from gradient descent in functional space, the next best classifier h_{t+1} will be the one that minimizes the term \textless\nabla \ell(H),h\textgreater, which when zero would mean that the zero-slope ensemble has been reached, denoting the minimum value of the loss function has been reached for a convex problem such as this. Replacing the dot product by a summation, this minimization problem can be written as:
h(\boldsymbol{x}_i)=argmin_h\epsilon=argmin_h\textless\nabla \ell(H),h\textgreater
h(\boldsymbol{x}_i)=argmin_h\sum_{i=1}^n\frac{\partial e^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}}{\partial H(\boldsymbol{x}_i)}h(\boldsymbol{x}_i)
which after some algebra becomes:
h(\boldsymbol{x}_i)=argmin_h\sum_{i:h(\boldsymbol{x}_i)\neq y(\boldsymbol{x}_i)}w_i
(the summation of the weights of misclassified points)

Comparing the last with the first expression, we have that the error \epsilon for iteration t is simply the summation of the weights of the points misclassified by h(\boldsymbol{x}_i)_t — e.g., in panel “b” the error would be summation the of the weights of the two crosses on the upper left corner and of the circle at the bottom right corner. Now let’s get to these weights.

The Weights of the Points

There are multiple ways we can think of for setting the weights of each data point at each iteration. Schapire (1990) found a great way of doing so:
\boldsymbol{w}_{t+1}=\frac{\boldsymbol{w}^{t}}{Z}e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}
where \boldsymbol{w} is a vector containing the weights of each point, Z is a normalization factor to ensure the weights will sum up to 1. Be sure not to confuse \alpha_t, the weight of classifier t, with the weight of the points at iteration t, represented by \boldsymbol{w}_t. For the weights to sum up to 1, Z needs to be the sum of their pre-normalization values, which is actually identical to the loss function, so
Z=\sum_{i=1}^ne^{-y(\boldsymbol{x}_i)H(\boldsymbol{x}_i)}
Using the definition of the error \epsilon, the update for Z can be shown to be:
Z_{t+1}=Z_t\cdot2\sqrt{\epsilon(1-\epsilon)}
so that the complete update is:
w_{t+1}=w_t \frac{e^{-\alpha_th_t(\boldsymbol{x})y(\boldsymbol{x})}}{2\sqrt{\epsilon(1-\epsilon)}}
Now AdaBoost is ready for implementation following the pseudo-code below.

AdaBoost Pseudo-code

Below is a pseudo-code of AdaBoost. Note that it can be used with any weak learner (high bias) classifier. Again, shallow decision trees are a common choice for their simplicity and good performance.Boosting_algorithm.png

Boosting Will Converge, and Fast

One fascinating fact about boosting is that any boosted algorithms is guaranteed to converge independently of which weak classified is chosen. Recall that Z, the normalization factor for the point weights update, equals the loss function. That being the case, we get the following relation:
\ell(H)=Z=n\prod_{t=1}^T2\sqrt{\epsilon_t(1-\epsilon_ t}
where n is the normalizing factor for all weights at step 0 (all of the weights are initially set to 1/n). To derive an expression for the upper bound of the error, let’s assume that the errors at all steps t equal their highest value, \epsilon_{max}. We have that:
\ell(H)\leq n\left[2\sqrt{\epsilon_{max}(1-\epsilon_{max})}\right]^T
Given that necessarily \epsilon_{max} \leq \frac{1}{2}, we have that
\epsilon_{max}(1-\epsilon_{max})<\frac{1}{4}
or
\epsilon_{max}(1-\epsilon_{max})=\frac{1}{4}-\gamma^2
for any \gamma in the interval \left[-\frac{1}{2},\frac{1}{2}\right]. Therefore, replacing the equation above in the first loss inequality written as a function of \epsilon_{max}, we have that:
\ell(H)\leq n(1-4\gamma^2)^{T/2}
which means that the training error is bound by an exponential decay as you add classifiers to the ensemble. This is a fantastic result and applies to any boosted algorithm!

(Next to) No Overfitting

Lastly, Boosted algorithms are remarkably resistant to overfitting. According to Murphy (2012), a possible reason is that Boosting can be seen as a form of \ell_1 regularization, which is prone to eliminate irrelevant features and thus reduce overfitting. Another explanation is related to the concept of margins, so that at least certain boosting algorithms force a classification on a point only if possible by a certain margin, thus also preventing overfitting.

Final Remarks

In this post, I presented the general idea of boosting a weak classifier, emphasizing its use with shallow CART trees, and used the AdaBoost algorithm as an example. However, other loss functions can be used and boosting can also be used for non-binary classifiers and for regression. The Python package scikit-learn in fact allows the user to use boosting with different loss functions and with different weak classifiers.

Despite the theoretical proof that Boosting does not overfit, researchers running it for extremely long times on rather big supercomputers found at at some point it starts to overfit, although still very slowly. Still, that is not likely to happen in your application. Lastly, Boosting with shallow decision trees is also a great way to have a fine control over how much bias there will be on your model, as all you need to do for that is to choose the number of T iterations.

Bibliography

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression T rees (Monterey, California: Wadsworth).
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.

Intro to Machine Learning Part 6: Gaussian Naive Bayes and Logistic Regression

Machine Learning problems often involve binary classification, which seeks to use a data point’s features, x, to correctly predict its label, y. In my last post I discussed binary classification with Support Vector Machines (SVM), which formulates the classification problem as a search for the maximum margin hyperplane that divides two classes. Today we’ll take different view on binary classification, we’ll use our training set to construct P(y|x), the probability of class y given a set of features x and classify each point by determining which class it is more likely to be. We’ll examine two algorithms for that use different strategies for estimating P(y|x), Naïve Bayes and Logistic regression. I’ll demonstrate the two classifiers on an example data set I’ve created, shown in Figure 1 below. The data set contains features X = (X1, X2) and  labels Y∈ (+1,-1),  positive points are shown as blue circles and negative as red triangles. This example was inspired by an in class exercise in CS 5780 at Cornell, though I’ve created this data set and set of code myself using python’s scikit-learn package.

raw_points

Figure 1: Example training set

 

Gaussian Naïve Bayes

Naïve Bayes is a generative algorithm, meaning that it uses a set of training data to generate P(x,y) and then uses Bayes Rule to find P(y|x):

P(y|x)=\frac{P(x|y)P(y)}{P(x)}                                (1)

A necessary condition for equation 1 to hold is the Naïve Bayes assumption, which states that feature values are independent given the label. While this is a strong assumption, it turns out that using this assumption can create effective classifiers even if it is violated.

To use Bayes rule to construct a classifier, we need a second assumption regarding the conditional distribution of each feature x on each label y. Here we’ll use a Gaussian distribution such that:

P(x|y) ~ N(\mu_y, \Sigma_y)                                                                                   (2)

Where \Sigma_y is a diagonal covariance matrix with [\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha, y} for each feature \alpha.

For each feature, $\alpha$, and each class, c we can then model P(x_\alpha|y) as:

P(x_\alpha|y=c) ~ N(\mu_{\alpha c},\sigma^2_{\alpha c})=\frac{1}{\sqrt{2\pi}\sigma_\alpha c}e^{-\frac{1}{2}(\frac{x_\alpha-\mu_{\alpha c}}{\sigma_{\alpha c}})^{2}}                              (3)

We can then estimate model parameters:

\mu_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)x_{i \alpha}                                                                   (4)

\sigma^2_{\alpha c} = \frac{1}{n_c}\sum^{n}_{i=1}I(y_i=c)(x_{i \alpha}-\mu_{\alpha c})^2                                                  (5)

Where:

n_c = \sum^{n}_{i=1}I(y_i=c)                                                                                (6)

Parameters can be estimated with Maximum likelihood estimation (MLE) or maximum a posteriori estimation (MAP).

Once we have fit the conditional Gaussian model to our data set, we can derive a linear classifier, a hyperplane that separates the two classes,  which takes the following form:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                             (7)

Where w is a vector of coefficients that define the separating hyperplane and b is the hyperplane’s intercept. W and b are functions of the Gaussian moments derived in equations 4 and 5. For a full derivation of the linear classifier starting with the Naive Bayes assumption, see the excellent course notes from CS 5780.

Logistic Regression

Logistic regression is the discriminative counterpart to Naive Bayes, rather than modeling P(x,y) and using it to estimate P(y|x), Logistic regression models P(y|x) directly:

P(y|x) = \frac{1}{1+e^{-y(w^T x+b)}}                                                                              (8)

Logistic regression uses MLE or MAP to directly estimate the parameters of the separating hyperplane, w and b rather than deriving them from the moments of P(x,y). Rather than seeking to fit parameters that best describe the test data, logistic regression seeks to fit a hyperplane that best separates the test data. For derivation of MLE and MAP estimates of logistic regression parameters, see the class notes from CS 5780.

Comparing Gaussian Naive Bayes and Logistic Regression

Below I’ve plotted the estimated classifications by the two algorithms using the Scikit-learn package in Python. Results are shown in Figure 2.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set(style='whitegrid')

## create a test data set ##
pos = np.array([[1,5], [1,7], [1,9], [2,8], [3,7], [1,11], [3,3], \
[5,5], [4,8], [5,9], [2,6], [3,9], [4,4]])
neg = np.array([[4,1], [5,1], [3,2], [2,1], [8,4], [6,2], [5,3], \
[4,2], [7,1], [5,4], [6,3], [7,4], [4,3], [5,2], [8,5]])
all_points = np.concatenate((pos,neg), 0)
labels = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1])

## compare Naive Bayes and Logistic Regression ##

# Fit Naive Bayes
gnb = GaussianNB()
gnb.fit(all_points, labels)

# make NB predictions and plot
x1_mesh, x2_mesh = np.meshgrid(np.arange(0,11,1), np.arange(0,11,1))
Y_NB = gnb.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_NB = Y_NB.reshape(x1_mesh.shape)

fig1, axes = plt.subplots(1,2, figsize=(10,4))

axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')
#plt.legend(['Positive Points', 'Negative Points'], scatterpoints=1)
#.savefig('NB_classification.png', bbox_inches='tight')

# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(all_points, labels)

# Make predictions and plot
Y_LR = lr.predict_proba(np.c_[x1_mesh.ravel(), x2_mesh.ravel()])[:,1]
Y_LR = Y_LR.reshape(x1_mesh.shape)

axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1.1,3)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1'); 
axes[1].set_ylabel('X2'); axes[1].set_title("Logistic Regression")
plt.savefig('compare_classification.png', bbox_inches='tight')

 

 

compare_classification

Figure 2: Example classification with Gaussian Naive Bayes (left) and Logistic regression. Blue shaded areas represent a prediction of positive labels for the data points, the red shaded areas represent predictions of negative labels.

Figure 2 illustrates an important difference in the treatment of outliers between the two classifiers. Gaussian Naive Bayes assumes that points close to the centroid of class are likely to be members of that class, which leads it to mislabel positive training points with features (3,3), (4,4) and (5,5). Logistic regression on the other hand is only concerned with correctly classifying points, so the signal from the outliers is more influential on its classification.

So which algorithm should you use? The answer, as usual, is that it depends. In this example, logistic regression is able to correctly classify the outliers with positive labels while Naïve Bayes is not. If these points are indeed an indicator of the underlying structure of positive points, then logistic regression has performed better. On the other hand, if they are truly outliers, than Naïve Bayes has performed better. In general, Logistic Regression has been found to outperform Naïve Bayes on large data sets but is prone to over fit small data sets. The two algorithms will converge asymptotically if the Naïve Bayes assumption holds.

Visualizing P(y|x)

One advantage to these methods for classification is that they provide estimates of P(y|x), whereas other methods such as SVM only provide a separating hyperplane. These probabilities can be useful in decision making contexts such as scenario discover for water resources systems, demonstrated in Quinn et al., 2018. Below, I use scikit-learn to plot the classification probabilities for both algorithms.

# plot Naive Bayes predicted probabilities
fig2, axes = plt.subplots(1,2, figsize=(12,4))
axes[0].contourf(x1_mesh, x2_mesh, Y_NB, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[0].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[0].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[0].set_xlim([0,10]); axes[0].set_ylim([0,10]); axes[0].set_xlabel('X1'); 
axes[0].set_ylabel('X2'); axes[0].set_title('Naive Bayes')

# plot Logistic Regression redicted probabilities
LRcont = axes[1].contourf(x1_mesh, x2_mesh, Y_LR, levels=(np.linspace(0,1,100)), \
cmap='RdBu')
axes[1].scatter(pos[:,0], pos[:,1], s=50, \
edgecolors='none')
axes[1].scatter(neg[:,0], neg[:,1], marker='^', c='r', s=100,\
edgecolors='none')
axes[1].set_xlim([0,10]); axes[1].set_ylim([0,10]); axes[1].set_xlabel('X1')
axes[1].set_ylabel('X2'); axes[1].set_title('Logistic Regression')
cb = fig2.colorbar(LRcont, ax=axes.ravel().tolist())
cb.set_label('Probability of Positive Classification')
cb.set_ticks([0, .25, .5, .75, 1])
cb.set_ticklabels(["0", "0.25", "0.5", "0.75", "1.0"])
plt.savefig('compare_probs.png', bbox_inches='tight')

compare_probs

Figure 3: Conditional probabilities P(y|x) generated by Naive Bayes (left) and Logistic Regression.

Further reading

This post has focused on Gaussian Naive Bayes as it is the direct counterpart of Logistic Regression for continuous data. It’s important to note however, that Naive Bayes frequently used on data with binomial or multinomial features. Examples include spam filters and language classifiers. For more information on Naive Bayes in these context, see these notes from CS 5780.

As mentioned above, logistic regression has been for scenario discovery in water resources systems, for more detail and context see Julie’s blog post.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Course Notes from MIT: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.Logistic

Course Notes from Cornell: http://www.cs.cornell.edu/courses/cs4780/2018fa/syllabus/index.html

Quinn, J. D., Reed, P. M., Giuliani, M., Castelletti, A., Oyler, J. W., & Nicholas, R. E. (2018). Exploring how changing monsoonal dynamics and human pressures challenge multireservoir management for flood protection, hydropower production, and agricultural water supplyWater Resources Research54, 4638–4662. https://doi.org/10.1029/2018WR022743

Discriminant analysis for classification

Discriminant analysis for classification

Data mining encompasses a variety of analytic techniques that can help us analyze, understand, and extract insight from large sets of data. The various data mining techniques generally fall in three potential categories:

Classification – Use a dataset of characteristics (variables) for an observation to determine in which discrete group/class the observation belongs. For example:

  • Loan applications: use data about an individual or business to determine whether the applicant is a “good risk” (approve loan) or a “bad risk” (refuse loan)
  • Project evaluation: sort possible capital investment projects into “priority”, “medium” and “unattractive” based on their characteristics

Prediction – Use data to predict the value (or a range of reasonable values) of a continuous numeric response variable. For example:

  • Predict sales volume of a product in a future time period
  • Predict annual heating/cooling expenses for a set of buildings

Segmentation – Separate a set of observations into some collection of groups that are “most alike” within a group and “different from” other groups. For example:

  • Identify groups of customers that have similar ordering patterns across seasons of the year
  • Identify groups of items that tend to be purchased together

This blogpost will focus on (linear) Discriminant Analysis (DA), one of the oldest techniques for solving classification problems. DA attempts to find a linear combination of features that separates two or more groups of observations. It is related to analysis of variance (ANOVA), the difference being that ANOVA attempts to predict a continuous dependent variable using one or more independent categorical variables, whereas DA attempts to predict a categorical dependent variable using one or more continuous independent variable. DA is very similar to logistic regression, with the fundamental difference that DA assumes that the independent variables are normally distributed within each group. DA is also similar to Principal Component Analysis (PCA), especially in their application. DA differs from PCA in that it tries to find a vector in the variable space that best discriminates among the different groups, by modelling the difference between the centroids of each group. PCA instead tries to find a subspace of the variable space, that has a basis vector that best captures the variability among the different observations. In other words, DA deals directly with discrimination between groups, whereas PCA identifies the principal components of the data in its entirety, without particularly focusing on the underlying group structure[1].

To demonstrate how DA works, I’ll use an example adapted from Ragsdale (2018)[2] where potential employees are given a pre-employment test measuring their mechanical and verbal aptitudes and are then classified into having a superior, average, or inferior performance (Fig. 1).

DA_1

Fig. 1 – Employee classification on the basis of mechanical and verbal aptitude scores

The centroids of the three groups indicate the average value of each independent variable (the two test scores) for that group. The aim of DA is to find a classification rule that maximizes the separation between the group means, while making as few “mistakes” as possible. There are multiple discrimination rules available; in this post I will be demonstrating the maximum likelihood rule, as achieved though the application of the Mahalanobis Distance. The idea is the following: an observation i from a multivariate normal distribution should be assigned to group G_j that minimizes the Mahalanobis distance between i and group centroid C_j.

DA_2

Fig. 2 – Independent variables have different variances

The reason to use Mahalanobis (as opposed to say, Euclidean) is twofold: if the independent variables have unequal variances or if they are correlated, a distance metric that does not account for that could potentially misallocate an observation to the wrong group. In Fig. 2, the independent variables are not correlated, but X2 appears to have much larger variance than X1, so the effects of small but important differences in X1 could be masked by large but unimportant differences in X2. The ellipses in the figure represent regions containing 99% of the values belonging to each group. By Euclidean distance, P1 would be assigned to group 2,

DA_3

Fig. 3 – Independent variables are correlated

however it is unlikely to be in group 2 because its location with respect to the X1 axis exceeds the typical values for group 2. If the two attributes where additionally correlated (Fig. 3), we shall also adjust for this correlation in our distance metric. The Mahalanobis distance therefore uses the covariance matrix for the independent variables in its calculation of the distance of an observation to group means.

It is given by:  D_{ij}=\sqrt{(X_i-\mu_j)^TS^{-1}(X_i-\mu_j)} where X_i is the vector of attributes for observation i, \mu_j is the vector of means for group j, and S is the covariance matrix for the variables. When using the Mahalanobis Distance, points that are equidistant from a group mean are on tilted eclipses:

DA_4

Fig. 4 – Using the Mahalanobis Distance, points equidistant from a group mean are on tilted eclipses

 

 

For further discussion on the two distance metrics and their application, there’s also this blogpost. Applying the formula to the dataset, we can calculate distances between each observation and each group, and use the distances to classify each observation to each of the groups:

DA_5

Table 1 – Distance of each observation to each group, using Mahalanobis distance. Classification is estimated using the minimum between the three distances. Values in red indicate misclassifications.

If we tally up the correct vs. incorrect (in red) classifications, our classification model is right 83.33% of the time. Assuming that this classification accuracy is sufficient, we can then apply this simple model to classify new employees based on their scores on the verbal and mechanical aptitude tests.

[1] Martinez, A. M., and A. C. Kak. 2001. “PCA versus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2): 228–33. https://doi.org/10.1109/34.908974.

[2] Ragsdale, Cliff T. Spreadsheet Modeling and Decision Analysis: a Practical Introduction to Business Analytics. Eighth edition. Boston, MA: Cengage Learning, 2018.

 

Introduction to Gaussian Processes

Introduction to Gaussian Processes

In this post, I will cover the very basics of Gaussian Processes following the presentations in Murphy (2012) and Rasmussen and Williams (2006) — though hopefully easier to digest. I will approach the explanation in two ways: (1) derived from Bayesian Ordinary Linear Regression and (2) derived from the definition of Gaussian Processes in functional space (it is easier than it sounds). I am not sure why the mentioned books do not use “^” to denote estimation (if you do, please leave a comment), but I will stick to their notation assuming you may want to look into them despite running into the possibility of an statistical sin. Lastly, I would like to thank Jared Smith for reviewing this post and providing highly statistically significant insights.

If you are not familiar with Bayesian regression (see my previous post), skip the section below and begin reading from “Succinct derivation of Gaussian Processes from functional space.”

From Bayesian Ordinary Linear Regression to Gaussian Processes

Recapitulation of Bayesian Ordinary Regression

In my previous blog post, about Bayesian Ordinary Linear Regression (OLR), I showed that the predictive distribution for a point \boldsymbol{x}_* for which we are trying to estimate y_* with Bayesian OLR is

\begin{aligned}p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2) &= \int_{\boldsymbol{\beta}}\underbrace{p(y_*|\boldsymbol{x}_*,\mathcal{D}, \sigma_{\epsilon}^2,\boldsymbol{\beta})}_\text{Likelihood}\underbrace{p(\boldsymbol{\beta}|\mathcal{D}, \sigma_{\epsilon}^2)}_\text{\parbox{1.6cm}{Parameter Posterior}}d\boldsymbol{\beta}\\  &=\mathcal{N}(y_*|\boldsymbol{\beta}_N^T\boldsymbol{x}_*, \sigma_N^2(\boldsymbol{x}_*))\end{aligned}

where we are trying to regress a model over the data set \mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}, \boldsymbol{X} being the independent variables and \boldsymbol{y} the dependent variable, \boldsymbol{\beta} is a vector of model parameters and \sigma_{\epsilon}^2 is the model error variance. The unusual notation for the normal distribution \mathcal{N} means “a normal distribution of y_* with (or given) the regression mean \boldsymbol{\beta}_N^T\boldsymbol{x}_* and variance \sigma_N^2(\boldsymbol{x}_*),” where N is the number of points in the data set. The estimated variance of \boldsymbol{y}_*, \sigma_N^2(\boldsymbol{x}_*) and the estimated regression parameters \boldsymbol{\beta}_N can be calculated as

\boldsymbol{V}_N = \sigma_{\epsilon}^2(\sigma_{\epsilon}^2\boldsymbol{V}_0^{-1}+\boldsymbol{X}^T\boldsymbol{X})^{-1}
\boldsymbol{\beta}_N=\boldsymbol{V}_N\boldsymbol{V}_0^{-1}\boldsymbol{\beta}_0 + \frac{1}{\sigma_{\epsilon}^2}\boldsymbol{V}_N\boldsymbol{X}^T\boldsymbol{y}
\hat{\sigma}_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{x}_*^{T}\boldsymbol{V}_N\boldsymbol{x}_*

where \boldsymbol{V}_0 and \boldsymbol{\beta}_0 are the mean and the covariance of the parameter prior distribution. All the above can be used to estimate y_* as

y_* = f(\boldsymbol{x}_*) = \boldsymbol{\beta}_N^T\boldsymbol{x}_* with an error variance of \sigma_N^2(\boldsymbol{x}_*)

If we assume a prior mean of 0 (\boldsymbol{\beta}_0=0), replace \boldsymbol{V}_N and \boldsymbol{\beta}_N by their definitions, and apply a function \phi(\boldsymbol{x}), e.g. \phi(\boldsymbol{x})=(1, x_1, x_2, x_1x_2, x_1^2 ...)^T, to add features to \boldsymbol{X} so that we can use a linear regression approach to fit a non-linear model over our data set \mathcal{D}, we would instead have that

\boldsymbol{\beta}_N = \boldsymbol{\phi}_*^T( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_X\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \sigma_{\epsilon}^2 + \boldsymbol{\phi}_*^{T}( \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T)^{-1}\boldsymbol{\phi}_*

where \boldsymbol{\phi}_* = \phi(\boldsymbol{x}_*) and \boldsymbol{\phi}_X = \phi(\boldsymbol{X}). One problem with this approach is that as we add more terms (also called features) to function \phi(\cdot) to better capture non-linearities, the the dimension of the matrix we will have to invert increases. It is also not always obvious what features we should add to our data. Both problems can be handled with Gaussian Processes.

Now to Gaussian Processes

This is where we transition from Bayesian OLR to Gaussian Processes. What we want to accomplish in the rest of this derivation is to find a way of: (1) adding as many features to the data as we need to calculate \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) without increasing the size of the matrix \sigma_{\epsilon}^2\boldsymbol{V}_0 + \boldsymbol{\phi}_X\boldsymbol{\phi}_X^T we have to invert (remember the dimensions of such matrix equals the number of features in \phi(\cdot)), and of (2) finding a way of implicitly adding features to \boldsymbol{X} and \boldsymbol{x}_* without having to do so manually — manually adding features to the data may take a long time, specially if we decide to add (literally) an infinite number of them to have an interpolator.

The first step is to make the dimensions of the matrix we want to invert equal to the number of data points instead of the number of features in \phi(\cdot). For this, we can re-arrange the equations for \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) as

\boldsymbol{\beta}_N =\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*( \sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_X)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = \boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_*-\boldsymbol{\phi}_*^T\boldsymbol{V}_0\boldsymbol{\phi}_X(\sigma_{\epsilon}^2\boldsymbol{I} + \boldsymbol{\phi}_X^T\boldsymbol{\phi}_X)^{-1}\boldsymbol{\phi}_X^T\boldsymbol{V}_0\boldsymbol{\phi}_*

We now have our feature-expanded data and points for which we want point estimates always appearing in the form of \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'})\boldsymbol{x}' is just another point \boldsymbol{x}_i either in the data set or for which we want to estimate y_*. From now on, \kappa(x, x') = \phi(\boldsymbol{x})^T \boldsymbol{V}_0 \phi(\boldsymbol{x'}) will be called a covariance or kernel function. Since the prior’s covariance \boldsymbol{V}_0 is positive semi-definite (as any covariance matrix), we can write

\kappa(\boldsymbol{x}, \boldsymbol{x}') = \phi(\boldsymbol{x})^T (\boldsymbol{V}_0^{1/2})^T(\boldsymbol{V}_0^{1/2})\phi(\boldsymbol{x}')=\psi(\boldsymbol{x})^T\psi(\boldsymbol{x}')

where \psi(\boldsymbol{x})=\boldsymbol{V}_0^{1/2}\phi(\boldsymbol{x'}). Using \kappa(\boldsymbol{x}, \boldsymbol{x}') and assuming a prior of \boldsymbol{I}, \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) then take a new shape in the predictor posterior of a Gaussian Process:

\boldsymbol{\beta}_N =K_*( \sigma_{\epsilon}^2\boldsymbol{I} + K)^{-1}\boldsymbol{y}
\sigma_N^2(\boldsymbol{x}_*) = k_{**}-\boldsymbol{k}_*(\sigma_{\epsilon}^2\boldsymbol{I}+K)^{-1}\boldsymbol{k}_*^T

where K = \kappa(\boldsymbol{\phi}_X,\boldsymbol{\phi}_X), \boldsymbol{k}_* = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_X) and k_{**} = \kappa(\boldsymbol{\phi}_*,\boldsymbol{\phi}_*). Now the features of the data are absorbed within the inner-products between observed \boldsymbol{x}‘s and or for \boldsymbol{x}_*, so we can add as many features as we want without impacting the dimensions of the matrix we will have to invert. Also, after this transformation, instead of \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) representing the mean and covariance between model parameters, they represent the mean and covariance of and among data points and/or points for which we want to get point estimates. The parameter posterior from Bayesian OLR would now be used to sample the values of y_* directly instead of model parameters. And now we have a Gaussian Process with which to estimate y_*. For plots of samples of \boldsymbol{y}_* from the prior and from the predictive posterior, of the mean plus or minus the error variances, and of models with different kernel parameters, see figures at the end of the post.

Kernel (or Covariance) matrices

The convenience of having \boldsymbol{\beta}_N and \sigma_N^2(\boldsymbol{x}_*) written in terms of \kappa(\cdot, \cdot), is that \kappa does not have to represent simply a matrix-matrix or vector-matrix multiplication of \boldsymbol{X} and \boldsymbol{x}_*. In fact, function \kappa can be any function that corresponds to an inner-product after some transformation from \boldsymbol{x}\rightarrow\phi(\boldsymbol{x}), which will necessarily return a positive semi-definite matrix and therefore be a valid covariance matrix. There are several kernels (or kernel functions or kernel shapes) available, each accounting in a different way for the nearness or similarity (at least in Gaussian Processes) between values of x_i:

  • the linear kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = \boldsymbol{x}^T\boldsymbol{x}'. This kernel is used when trying to fit a line going through the origin.
  • the polynomial kernel: \kappa(\boldsymbol{x}, \boldsymbol{x}') = (1 + \boldsymbol{x}^T\boldsymbol{x}')^d, where d is the dimension of the polynomial function. This kernel is used when trying to fit a polynomial function (such as the fourth order polynomial in the blog post about Bayesian OLR), and
  • the Radial Basis Function, or RBF (aka Squared Exponential, or SE), \kappa(\boldsymbol{x}, \boldsymbol{x}') = e^{-\frac{||\boldsymbol{x}-\boldsymbol{x}'||^2}{2\sigma_{RBF}^2}}, where \sigma_{RBF} is a kernel parameter denoting the characteristic length scale of the function to be approximated. Using the RBF kernel is equivalent to adding an infinite (!) number of features the data, meaning \phi(\boldsymbol{x})=(1, x_1, ..., x_d, x_1x_2, x_1x_3,..., x_{d-1}x_d, x_1x_2x_3, ...)^T.

But this is all quite algebraic. Luckily, there is a more straight-forward derivation shown next.

Succinct derivation of Gaussian Processes from functional space.

Most regression we study in school is a way of estimating the parameters \boldsymbol{\beta} of a model. In Bayesian OLR, for example, we have a distribution (the parameter posterior) from which to sample model parameters. What if we instead derived a distribution from which to sample functions themselves? The first (and second and third) time I heard about the idea of sampling functions I thought right away of opening a box and from it drawing exponential and sinusoid functional forms. However, what is meant here is a function of an arbitrary shape without functional form. How is this possible? The answer is: instead of sampling a functional form itself, we sample the values of y_*=f(\boldsymbol{x}_*) at discrete points, with a collection of \boldsymbol{y}_i for all \boldsymbol{x}_i in our domain called a function \boldsymbol{f}. After all, don’t we want a function more often than not solely for the purpose of getting point estimates and associated errors for given values of \boldsymbol{x}_*?

In Gaussian Process regression, we sample functions \boldsymbol{f} from a distribution by considering each value of x_i in a discretized range along the x-axis as a random variable. That means that if we discretize our x-axis range into 100 equally spaced points, we will have 100 random variables x_0,..,x_{100}. The mean of all possible functions at \boldsymbol{x}_i therefore represents the mean value of y_i in \boldsymbol{x}_i, and each term in the covariance matrix (kernel matrix, see “Kernel (or Covariance) matrices” section above) represents how similar the values of y_i and y_j will be for each pair \boldsymbol{x}_i and \boldsymbol{x}_j based on the value of a distance metric between  \boldsymbol{x}_i and \boldsymbol{x}_j: if \boldsymbol{x}_i and \boldsymbol{x}_j are close to each other, \boldsymbol{y}_i and \boldsymbol{y}_j tend to be similar and vice-versa. We can therefore turn the sampled functions y = f(\boldsymbol{x}) into whatever functional form we want by choosing the appropriate covariance matrix (kernel) — e.g. a linear kernel will return a linear model, a polynomial kernel will return a polynomial function, and an RBF kernel will return a function with an all linear and an infinite number of non-linear terms. A Gaussian Process can then be written as

f(\boldsymbol{x}) \sim \mathcal{GP}(\mu(\boldsymbol{x}), \kappa(\boldsymbol{x}, \boldsymbol{x}'))

where \mu(\boldsymbol{x}) is the mean function (e.g. a horizontal line along the x-axis at y = 0 would mean that \mu(\boldsymbol{x}) = [0,..,0]^T) and \kappa(\boldsymbol{x}, \boldsymbol{x}') is the covariance matrix, which here expresses how similar the values of y_i will be based on the distance between two values of x_i.

To illustrate this point, assume we want to sample functions over 200 discretizations \boldsymbol{X}_* = x_{*0}, ..., x_{*199} along the x-axis from x=-5 to x=5 (\Delta x = 0.05) using a Gaussian Process. Assume the parameters of our Gaussian Process from which we will sample our functions are mean 0 and that it uses an RBF kernel for its covariance matrix with parameter \sigma_{RBF}=2 (not be confused with standard deviation), meaning that k_{i,i}=1.0 ,\:\forall i \in [0, 199] and k_{i,j}=e^{-\frac{||x_i-x_j||^2}{2\cdot 2^2}},\: \forall i,j \in [0, 199], or

\begin{aligned} p(\boldsymbol{f}_*|\boldsymbol{X_*})&=\mathcal{N}(\boldsymbol{f}|\boldsymbol{\mu},\boldsymbol{K_{**}})\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[\mu_{*0},..,\mu_{*199}], \begin{bmatrix} k_{*0,0} & k_{*0,1} & \cdots & k_{*0,199} \\ k_{*1,0} & k_{*1, 1} & \cdots & k_{*1,199}\\ \vdots & \vdots & \ddots & \vdots \\ k_{*199,0} & k_{*199,1} & \cdots & k_{*199,199} \end{bmatrix} \right)\\ &=\mathcal{N}\left(\boldsymbol{f}_*\Bigg|[0,..,0], \begin{bmatrix} 1 & 0.9997 & \cdots & 4.2\cdot 10^{-6} \\ 0.9997 & 1 & \cdots & 4.8\cdot 10^{-6}\\ \vdots & \vdots & \ddots & \vdots \\ 4.2\cdot 10^{-6} & 4.8\cdot 10^{-6} & \cdots & 1 \end{bmatrix} \right) \end{aligned}

Each sample from the normal distribution above will return 200 values: \boldsymbol{f}_* = [y_0, ..., y_{199}]. A draw of 3 such sample functions would look generally like the following:

Figure_1

The functions above look incredibly smooth because of the RBF kernel, which adds an infinite number of non-linear features to the data. The shaded region represents the prior distribution of functions. The grey line in the middle represents the mean of an infinite number of function samples, while the upper and lower bounds represent the corresponding prior mean plus or minus standard deviations.

These functions looks quite nice but pretty useless. We are actually interested in regressing our data assuming a prior, namely on a posterior distribution, not on the prior by itself. Another way of formulating this is to say that we are interested only in the functions that go through or close to certain known values of \boldsymbol{x} and y. All we have to do is to add random variables corresponding to our data and condition the resulting distribution on them, which means accounting only for samples of functions that exactly go through (or given) our data points. The resulting distribution would then look like p(\boldsymbol{f}_*|\mathcal{D}, \boldsymbol{X}_*). After adding our data \boldsymbol{X}, \boldsymbol{f}, or \boldsymbol{X}, \boldsymbol{y}, our distribution would look like

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}K(\boldsymbol{X},\boldsymbol{X}) & K(\boldsymbol{X},\boldsymbol{X}_*) \\ K(\boldsymbol{X}_*,\boldsymbol{X}) & K(\boldsymbol{X}_*,\boldsymbol{X}_*)\end{bmatrix}\right)

In this distribution we have random variables representing the \boldsymbol{X} from our data set \mathcal{D} and the \boldsymbol{X}_* from which we want to get point estimates and corresponding error standard deviations. The covariance matrix above accounts for correlations of all observations with all other observations, and correlations of all observations with all points to be predicted. We now just have to condition the multivariate normal distribution above over the points in our data set, so that the distribution accounts for the infinite number of functions that go through our \boldsymbol{y} at the corresponding \boldsymbol{X}. This yields

p(\boldsymbol{f}_*|\boldsymbol{X}_*,\boldsymbol{X},\boldsymbol{y})=\mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}})

\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{y}

\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T\boldsymbol{K}^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \boldsymbol{K} = K(\boldsymbol{X},\boldsymbol{X})\boldsymbol{K}_* = K(\boldsymbol{X},\boldsymbol{X}_*), \boldsymbol{K}_{**} = K(\boldsymbol{X_*},\boldsymbol{X_*}), and diag\left(\bar{\boldsymbol{\Sigma}}\right) is a vector containing the elements in the diagonal of the covariance matrix \bar{\boldsymbol{\Sigma}} — the variances \bar{\sigma}^2_i of each random variable x_i. If you do not want to use a prior with \boldsymbol{\mu} = \boldsymbol{\mu}_* = [0, ... , 0]^T, the mean of the Gaussian Process would be \bar{\boldsymbol{\mu}}=\boldsymbol{\mu}_* + \boldsymbol{K}_*^T\boldsymbol{K}^{-1}(\boldsymbol{f} - \boldsymbol{\mu}). A plot of \boldsymbol{f}_* = \bar{\boldsymbol{\mu}} estimated for 200 values of x_i — so that the resulting curve looks smooth — plus or minus associated uncertainty (\pm\bar{\boldsymbol{\sigma}}^2), regressed on 5 random data points \boldsymbol{X}, \boldsymbol{y}, should look generally like

Figure_1-1_2

 

A plot of three sampled functions (colored lines below, the grey line is the mean) \boldsymbol{f}_* \sim \mathcal{N}(\boldsymbol{f}_*|\bar{\boldsymbol{\mu}}, \bar{\boldsymbol{\Sigma}}) should look generally like

Figure_1-1-sampling

 

Several kernel have parameters that can strongly influence the regressed model and that can be estimated — see Murphy (2012) for a succinct introduction to kernel parameters estimation. One example is parameter \sigma_{RBF}^2 of the RBF kernel, which determines the correlation length scale of the fitted model and therefore how fast with increasing separation distance between data points the model will default to the prior (here, with \boldsymbol{\mu}=0). The figure below exemplifies this effect

Figure_kparams

All figures have the same data points but the resulting models look very different. Reasonable values for \sigma_{RBF}^2 depend on the spacing between the data points and are therefore data-specific. Also, if when you saw this plot you thought of fitting RBF functions as interpolators, that was not a coincidence: the equations solved when fitting RBFs to data as interpolators are the same solved when using Gaussian Processes with RBF kernels.

Lastly, in case of noisy observation of y in \mathcal{D} the Gaussian Process distribution and the expressions for the mean, covariance, and standard deviation become

\begin{bmatrix}\boldsymbol{y}\\ \boldsymbol{f}_*\end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}\boldsymbol{\mu}\\ \boldsymbol{\mu}_*\end{bmatrix}, \begin{bmatrix}\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I} & \boldsymbol{K}_* \\ \boldsymbol{K}_*^T & \boldsymbol{K}_{**}\end{bmatrix}\right)

\bar{\boldsymbol{\mu}}=\boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{y}

\bar{\boldsymbol{\Sigma}}=\boldsymbol{K}_{**} - \boldsymbol{K}_*^T(\boldsymbol{K} + \sigma_{\epsilon}^2\boldsymbol{I})^{-1}\boldsymbol{K}_*

\bar{\boldsymbol{\sigma}}^2 = diag\left(\bar{\boldsymbol{\Sigma}}\right)

where \sigma_{\epsilon}^2 is the error variance and \boldsymbol{I} is an identity matrix. The mean and error predictions would look like the following

Figure_1-1

Concluding Remarks

What I presented is the very basics of Gaussian Processes, leaving several important topics which were not covered here for the interested reader to look into. Examples are numerically more stable formulations, computationally more efficient formulations, kernel parameter estimation, more kernels, using a linear instead of zero-mean prior (semi-parametric Gaussian Processes), Gaussian Processes for classification and Poisson regression, the connection between Gaussian Processes and other methods, like krieging and kernel methods for nonstationary stochastic processes. Some of these can be found in great detail in Rasmussen and Williams (2006), and Murphy (2012) has a succinct presentation of all of the listed topics.

I was really confused about Gaussian Processes the first time I studied it and tried to make this blog post as accessible as possible. Please leave a comment if you think the post would benefit from any particular clarification.

 

References

Murphy, Kevin P., 2012. Machine Learning: A Probabilistic Perspective. The MIT Press, ISBN 978-0262018029

Rasmussen, Carl Edward and Williams, Christopher K. I., 2006. Gaussian Processes for Machine Learning. The MIT Press, ISBN 978-0262182539.