Introducing the R Package “biascorrection”

For variety of reasons, we need hydrological models for our short- and long-term predictions and planning.  However, it is no secret that these models always suffer from some degree of bias. This bias can stem from many different and often interacting sources. Some examples are biases in underlying model assumptions, missing processes, model parameters, calibration parameters, and imperfections in input data (Beven and Binley, 1992).

The question of how to use models, given all these uncertainties, has been an active area of research for at least 50 years and will probably remain so for the foreseeable future, but going through that is not the focus of this blog post.

In this post, I explain a technique called bias correction that is frequently used in an attempt to improve model predictions. I also introduce an R package for bias correction that I recently developed; the package is called “biascorrection.” Although most of the examples in this post are about hydrological models, the arguments and the R package might be useful in other disciplines, for example with atmospheric models that have been one of the hotspots of bias correction applications (for example, here, here and here). The reason is that the algorithm follows a series of simple mathematical procedures that can be applied to other questions and research areas.

What Do I Mean by Bias Correction?

Bias correction in the context of hydrological modeling usually means “manual” improvement of model simulations to ensure that they match the observed data. Here’s a simple example: Let’s say we know that our model tends to underestimate summer flow by 30%. Then we might reason that to improve our predictions, we could add 30% to our simulated summer flows. In this blog post, I use a more sophisticated method, but this example should give you an idea of what I’m talking about.

Bias Correction using Quantile Mapping

Quantile mapping is a popular bias correction method that has been used in various applications. In this method, we first create quantiles of observed and simulated data. After that, whenever we have a simulated value we can find its simulated quantile and replace it with the value of the closest observed quantile.

Figure 1 – A simplified workflow of the quantile mapping technique

We generally follow the following steps to do the bias correction (see here):

  • Monthly Quantiles

We need to have two monthly time series of observed and simulated streamflow. If both series use daily time steps, we must aggregate the daily values to monthly values first, to create the monthly quantiles. We then sort the observed and simulated streamflows and assign each value to a quantile. This process generates streamflow quantiles for each month (January, February, etc.).

  • Monthly Bias Correction

When the quantiles are ready, we can start from the first month of the simulated results and determine what quantile its values belong to. Then we can go to the same quantile of the observed values and use it instead of the simulated one. This creates a monthly bias-corrected stream.

  • Annual Adjustment

Hamlet and Lettenmaier 1999, (also here) argue that the monthly bias correction can dramatically change the magnitude of annual streamflow predictions. However, although hydrologic models usually perform poorly at monthly time steps, they are pretty good at capturing annual variations. Therefore, we tend to rely on them. We calculate the annual difference between the bias-corrected and simulated flows and then we apply that to each individual month. This way, we can make sure that while the monthly variations are consistent with the recorded streamflow, our model is still able to determine how the average annual flow looks.

  • Disaggregation to Daily

After we apply the annual adjustments, we can use simulated or historical observed values to disaggregate the monthly time series to a daily one. The “biascorrection” package provides two methods for doing that: (1) Rescaling the raw simulated daily time series to match the monthly bias-corrected values. For example, if the total simulated streamflow for a month is half of the bias-corrected values for that month, the disaggregation module multiplies the raw daily simulated values by two for that specific month. (2) Sampling from the daily observed historical times series. In this case, the model uses KNN to find some of the closest months in the historical period (in terms of average monthly values) and randomly selects one of them.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In some situations, in large river basins, several upstream stations contribute to a downstream station. Bias correction can add or remove water from the system, and that can cause spatial inconsistencies and water budget problems. To ensure that the basin-wide water balance isn’t violated in these cases, we start from the station furthest downstream station and move upward to make sure that the total water generated upstream of each station and the incremental flow between the stations sum up to the same total downstream flow. This is not included in our model.

In climate change studies, if the historical, simulated, and observed quantiles do not differ widely from the projected future scenarios, we can use the historical quantiles. However, if the future values are fundamentally different from the historical time series, this might not be justifiable. In such a case, synthetic generation of future data may be the way to go. The R package doesn’t include this either.

There are just two quick disclaimers:

  1. The R package has been tested and seems to perform well, but its complete soundness is not guaranteed.
  2. This blog post and the R package only introduce this bias correction as a common practice; I do not endorse it as a remedy for all the problems of hydrological models. Readers should keep in mind that there are serious criticisms of bias correction (for example here). I will discuss some in the following sections.

Arguments Against Bias Correction

One of the main advantages of hydrological models is that they can simultaneously and, arguably, consistently simulate different water and energy balance processes. However, bias correction manually perturbs streamflow without taking into account its effects on other components of the system. Therefore, it takes away one of the main advantages of hydrological models. In some cases, this can even distort climate change signals.

The other problem is that bias correction tries only to match the overall, aggregate statistics of the simulated flow with the observed flow, although streamflow has many more attributes. For example, current bias-correction algorithms can systematically ignore extremes that occur in daily to weekly time steps.

The “biascorrection” Package

I recently developed an R package that can be used for bias correction in simulated streamflow. The package has four main functions. Its workflow is consistent with the four bias-correction steps described above: (1) quantile creation, (2) monthly quantile mapping, (3) annual adjustment, (4) disaggregation to daily. The package doesn’t have a prescribed unit, and it can be used in different applications that require bias correction.

How to Install “biascorrection” Package

The package is available on GitHub (here) and it can be installed using the following command:

devtools::install_github("keyvan-malek/biascorrection")

You can also go to my “blog” folder on GitHub to download simulated and observed datasets of streamflow at inlet of the Arrow dam in British Columbia (AKA Keenleyside Dam). The simulated flow has been generated using the Variable Infiltration Capacity model (VIC), and the observed flow is from Bonneville Power Administration’s No Regulation-No Irrigation streamflow datasets.

observed_input<-read.table("sample_data/Arrow_observed.txt", header = T)
simulated_input<-read.table("sample_data/Arrow_simulated.txt", header = T)

Note that the two datasets have different starting and ending dates. I intentionally used them to show how biascorrection package handles these types of datasets.

However, I plotted the overlapping period of the two datasets to demonstrate the difference between them. The simulated data set tends to underestimate streamflow during low-flow periods and overestimate during the high-flow periods.

Figure 2 – Observed vs simulated inflow to Arrow dam

Monthly Bias-Corrections

To run the monthly bias-correction function, first, you need to define the following starting and ending dates of your observed and simulated data frames:

start_date_observed<-"1929-01-01"
end_date_observed<-"2007-12-31"
start_date_simulated<-"1979-01-01"
end_date_simulated<-"2015-12-31"

You also need to define the following two conditions:

time_step<-"day"
date_type<-"JY" ## Water year (WY) or Julian Year (JY)

Finally, we can use the following to calculate the monthly bias corrected flow:

observed_flow=observed_input$ARROW_obs
simulatred_flow=simulated_input$ARROW_sim

df_bc_month<-bias_correct_monthly(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type)

Note that the format of observed and simulated inputs to the “bias_correct_monthly” function are not data frames. Here is how bias correction changes the simulated streamflow. As you see in the followin figure the bias-corrected flow doesn’t seem to have the underestimation problem during the low-flow anymore.

Figure 3- Monthly VIC simulated flow vs. bias-corrected flow

Annual Adjustment

You can use the following command to return the average annual flow back to what model originally simulated. Note that the inputs to the function is output of the monthly function.

df_bc_annual<-bias_correct_annual(df_bc_month$simulated, df_bc_month$bias_corrected, start_date_simulated, end_date_simulated)

Daily Disaggregation

The following function first uses the monthly function, then applies annual adjustment, and finally disaggregate the monthly streamflow to daily. The package has two options to convert monthly to daily: 1- it can multiply the simulated streamflow of each month by its bias-correction coefficient 2- it can use the K-nearest Neighbors method to find the closest month in the observed record

disaggregation_method<-"scaling_coeff" # "scaling_coeff" or "knn"

df_bc_daily<-bias_correct_daily(observed_flow, simulatred_flow, start_date_observed, end_date_observed, start_date_simulated, end_date_simulated, time_step, date_type, disaggregation_method)

Here is how the entire bias-correction procedure affect the simulated streamflow:

Figure 4 – Simulated vs. bias-corrected flow after annual adjustment and daily disaggregation

Known Issues and Future Plans

  • Currently the biascorrection package only accepts complete years. For example, if your year starts in January it has to end in December, and it can not continue to, let’s say, next February.
  • I am thinking about adding some more functionalities to the biascorrection package in a few months. Some built-in post-processing options, built-in example datasets, and at least one more bias-correction techniques are some of the options.
  • For the next version, I am also thinking about releasing the model through CRAN.

Variable Infiltration Capacity (VIC) Model- Part 2

As promised, I am back with the second part of my blog post for variable infiltration capacity (VIC) model training. Before we start talking about how you can run the model, take a look at my first blog post on VIC; that post will hopefully give you a high-level understanding of the model as well as its application, theoretical background, and main processes.

In this blog post, we’ll go over model input and output files and things that you need to do to run the model. We’ll use a “popular among first-timers” real-world example provided by the VIC development team. I found the instructions provided by the VIC website clear and very helpful, and I recommend referring to the site for more information beyond this post. However, the goal of this blog post is to provide easy-to-follow hands-on instructions on how to run the VIC model. Lastly, before I forget, although many of these details are applicable to all VIC versions, this instruction is specifically for VIC-4.x and VIC-5 (classic mode) versions.

Input Files

The most important input to the VIC model is likely the global parameter file. The global parameter file has two main responsibilities: (1) setting the model’s main options and (2) providing the paths to other input files. The other input files include soil parameters, meteorological data, vegetation library, vegetation parameter file, and snow band file. The following sections start by providing more information about these input files and then discuss the global parameter file.

Soil File

Broadly speaking, the soil file provides two categories of information to the model. (1) Calibration parameters include the coefficient that adjusts the rainfall-runoff formulation (bi), and additional parameters control base flow generation (Ws, Ds, Dsmax, C). Calibrating the depths of the middle and last layers of VIC is also a standard practice. However, keep in mind that although the snow model of VIC can be calibrated, its parameters are not in the soil file. (2) Soil textural information includes the parameters of the Brooks-Corey/Campbell relationship, soil depth, soil bulk density, field capacity, permanent wilting point, and more. The soil file usually has multiple lines, and each line corresponds to a specific grid cell. The soil file also tells the model whether or not a specific grid cell should be part of the simulation, and the first number in each line indicates these data. If the number is zero (0), the cell will not be part of the simulation.

Met Data

Met data in VIC also has two options. (1) Users can only provide precipitation, maximum temperature, minimum temperature, and wind speed; VIC’s internal weather generator (MTCLIM; Bohn et al. 2013) calculates the rest of the parameters such as shortwave and longwave radiation, atmospheric pressure, and vapor pressure. (2) Users can provide a complete time series of input data.

Vegetation Parameter File

The vegetation parameter file tells the model what percentage of each grid cell is occupied by each vegetation cover type. If you have a soil file (see the test case for an example) and sum up all the fractions in each grid cell, you will probably notice that the figure is almost always less than one. This is because the rest of the grid cell is bare soil with no vegetation cover, but keep in mind that bare soil is part of simulations. The vegetation parameter file also includes information about root depth, root fraction, and other vegetation-related parameters.

Vegetation Library

The vegetation library provides the model with characteristics of each vegetation type—for example, albedo, LAI, canopy coverage, roughness, and other parameters that the model needs to calculate Penman-Monteith’s evapotranspiration. The original vegetation library comes with twelve vegetation classes, and users usually don’t need to modify this file unless they want to add a new class of vegetation. Keep in mind that the original vegetation file does not do a good job of representing different crop types. That’s one of the reasons that VIC-CropSyst exists.

Snow Band (aka Elevation Band) File

The snow band file divides each grid cell into different elevations. The model simulates each elevation band separately while lapsing temperature and precipitation for each elevation. VIC does this to improve the accuracy of the simulation. Remember that VIC is a large-scale model; a 50 km by 50 km grid cell might contain mountains and valleys with various climates. Although using the snow band file is optional (as specified in the global parameter file), doing so is usually recommended—especially over snow-dominant regions—because snow is very sensitive to elevation.

Global Parameter File

The global parameter file provides two things:

(1) Model main options such as output parameters of interest; whether or not the model should simulate frozen soil and full energy balance; and information related to start and end date of simulation, start date of met data file, parameters included in the met data file, number of soil layers, and maximum number of snow bands. (2) Path to different input files.

How to Download VIC

VIC works in the Linux/Unix environment. The VIC website has recently been updated and provides everything that you need to know about the model. To download the model, go to its GitHub page. The model comes with all necessary codes. You should explore different folders in your “VIC-master” folder. However, in this example, we are only interested in the “/VIC-master/vic/drivers/classic” directory, which provides the VIC code and executable files.

How to Download the Test Dataset

VIC has a test dataset, which is for Stehekin river basin in Pacific Northwest. You can download it from here. This provides you with all the input files needed for the test case.

How to Adjust the Global Parameter File

The global parameter file needs to be adjusted based on what you want from the model. For example, if you need a specific output parameter, you must include it in the list of outputs and modify the number of output parameters accordingly. For this test case, let’s stick to the original setting of the model. You just need to point the global parameter file to your directory. To do so, open “VIC_sample_data-master/classic/Stehekin/parameters/global_param.STEHE.txt”

Create the Executable File

To run VIC, you need to have an executable file. To create an executable file, go to “/VIC-master/vic/drivers/classic,” and use Linux’s make command:

make clean
make

Run VIC

Finally, you’re ready to run the model. It’s super easy to type in the following command on your Linux terminal:

/VIC-master/vic/drivers/classic/vic_classic.exe -g VIC_sample_data-master/classic/Stehekin/parameters/global_param.STEHE.txt

I think that’s enough for the VIC model. As I mentioned in my last blog post, there is a coupled agro-hydrologic model called VIC-CropSyst that simulates agricultural processes on top of hydrology. If time allows, I will post about VIC-CropSyst in the future.

Nondimensionalization of differential equations – an example using the Lotka-Volterra system of equations

I decided to write about nondimensionalization today since it’s something I only came across recently and found very exciting. It’s apparently a trivial process for a lot of people, but it wasn’t something I was taught how to do during my education so I thought I’d put a short guide out on the interwebs for other people. Nondimensionalization (also referred to as rescaling) refers to the process of transforming an equation to a dimensionless form by rescaling its variables. There are a couple benefits to this:

  • All variables and parameters in the new system are unitless, i.e. scales and units not an issue in the new system and the dynamics of systems operating on different time and/or spatial scales can be compared;
  • The number of model parameters is reduced to a smaller set of fundamental parameters that govern the dynamics of the system; which also means
  • The new model is simpler and easier to analyze, and
  • The computational time becomes shorter.

I will now present an example using a predator-prey system of equations. In my last blogpost, I used the Lotka-Volterra system of equations for describing predator-prey interactions. Towards the end of that post I talked about the logistic Lotka-Volterra system, which is in the following form:image002image004Where x is prey abundance, y is predator abundance, b is the prey growth rate, d is the predator death rate, c is the rate with which consumed prey is converted to predator abundance, a is the rate with which prey is killed by a predator per unit of time, and K is the carrying capacity of the prey given its environmental conditions.

The first step is to define the original model variables as products of new dimensionless variables (e.g. x*) and scaling parameters (e.g. X), carrying the same units as the original variable.image006The rescaled models are then substituted in the original model:
image008image010Carrying out all cancellations and obvious simplifications:image012image014Our task now is to define the rescaling parameters X, Y, and T to simplify our model – remember they have to have the same units as our original parameters.

 

Variable/parameter Unit
x mass 1
y mass 1
t time
b 1/time
d 1/time
a 1/(mass∙time) 2
c mass/mass 3
K mass

There’s no single correct way of going about doing this, but using the units for guidance and trying to be smart we can simplify the structure of our model. For example, setting X=K will remove that term from the prey equation (notice that this way X has the same unit as our original x variable).

image016image018

The choice of Y is not very obvious so let’s look at T first. We could go with both T=1/b or T=1/d. Unit-wise they both work but one would serve to eliminate a parameter from the first equation and the other from the second. The decision here depends on what dynamics we’re most interested in, so for the purposes of demonstration here, let’s go with T=1/b.

image020image022

We’re now left with defining Y, which only appears in the second term of the first equation. Looking at that term, the obvious substitution is Y=b/a, resulting in this set of equations:image024image022

Our system of equations is still not dimensionless, as we still have the model parameters to worry about. We can now define aggregate parameters using the original parameters in such a way that they will not carry any units and they will further simplify our model.

By setting p1=caK/b and p2=d/b we can transform our system to:image024image026a system of equations with no units and just two parameters.

 

1 Prey and predator abundance don’t have to necessarily be measured using mass units, it could be volume, density or something else. The units for parameters a, c, K would change equivalently and the rescaling still holds.

2 This is the death rate per encounter with predator per time t.

3 This is the converted predator (mass) per prey (mass) consumed.

 

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

This semester I’m taking a Multivariate statistics course taught by Professor Scott Steinschneider in the BEE department at Cornell. I’ve been really enjoying the course thus far and thought I would share some of what we’ve covered in the class with a blog post. The material below on multicollinearity is from Dr. Steinschneider’s class, presented in my own words.

What is Multicollinearity?

Multicollinearity is the condition where two or more predictor variables in a statistical model are linearly related (Dormann et. al. 2013). The existence of multicollinearity in your data set can result in an increase of the variance of regression coefficients leading to unstable estimation of parameter values. This in turn can lead to erroneous identification of relevant predictors within a regression and detracts from a model’s ability to extrapolate beyond the range of the sample it was constructed with. In this post, I’ll explain how multicollinearity causes problems for linear regression by Ordinary Least Squares (OLS), introduce three metrics for detecting multicollinearity and detail two “Tolerant Methods” for dealing with multicollinearity within a data set.

How does multicollinearity cause problems in OLS regression?

To illustrate the problems caused by multicollinearity, let’s start with a linear regression:

y=x\beta +\epsilon

Where:

y=x\beta +\epsilon

x = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} predictor \hspace{.1 cm} variables

\beta = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} coefficients

\epsilon =  a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} residuals

The Gauss-Markov theorem states that the Best Linear Unbiased Estimator (BLUE) for each  coefficient can be found using OLS:

\hat{\beta}_{OLS} = (x^Tx)^{-1}x^Ty

This  estimate will have a variance defined as:

var(\hat{\beta}_{OLS}) =\sigma^2 (x^Tx)^{-1}

Where:

\sigma^2 = the \hspace{.1 cm} variance\hspace{.1 cm} of \hspace{.1 cm} the\hspace{.1 cm} residuals

If you dive into the matrix algebra, you will find that the term (xTx) is equal to a matrix with ones on the diagonals and the pairwise Pearson’s correlation coefficients (ρ) on the off-diagonals:

(x^Tx) =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}

As the correlation values increase, the values within (xTx)-1 also increase. Even with a low residual variance, multicollinearity can cause large increases in estimator variance. Here are a few examples of the effect of multicollinearity using a hypothetical regression with two predictors:

 \rho = .3 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.09 & -0.33 \\ -0.33 & 1.09 \end{bmatrix}

 \rho = .9 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 5.26 & -4.73 \\ -5.26 & -4.73 \end{bmatrix}

 \rho = .999 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 500.25 & -499.75 \\ -499.75 & 500.25\end{bmatrix}

So why should you care about the variance of your coefficient estimators? The answer depends on what the purpose of your model is. If your only goal is to obtain an accurate measure of the predictand, the presence of multicollinearity in your predictors might not be such a problem. If, however, you are trying to identify the key predictors that effect the predictand, multicollinearity is a big problem.

OLS estimators with large variances are highly unstable, meaning that if you construct estimators from different data samples you will potentially get wildly different estimates of your coefficient values (Dormann et al. 2013). Large estimator variance also undermines the trustworthiness of hypothesis testing of the significance of coefficients. Recall that the t value used in hypothesis testing for an OLS regression coefficient is a function of the sample standard deviation (the square root of the variance) of the  OLS estimator.

t_{n-2} =\frac{\hat{\beta_j}-0}{s_{\beta_j}}

An estimator with an inflated standard deviation, s_{\beta_j}, will thus yield a lower t value, which could lead to the false rejection of a significant predictor (ie. a type II error). See Ohlemüller et al. (2008) for some examples where hypothesis testing results are undermined by multicollinearity.

Detecting Multicollinearity within a data set

Now we know how multicollinearity causes problems in our regression, but how can we tell if there is multicollinearity within a data set? There are several commonly used metrics for which basic guidelines have been developed to determine whether multicollinearity is present.

The most basic metric is the pairwise Pearson Correlation Coefficient between predictors, r. Recall from your intro statistics course that the Pearson Correlation Coefficient is a measure of the linear relationship between two variables, defined as:

r_{x_1,x_2}=\frac{cov(x_1,x_2)}{\sigma_x\sigma_y}

A common rule of thumb is that multicollinearity may be a problem in a data set if any pairwise |r| > 0.7 (Dormann et al. 2013).

Another common metric is known as the Variance Inflation Factor (VIF). This measure is calculated by regressing each predictor on all others being used in the regression.

VIF(\beta_j) = \frac{1}{1-R^2_j}

Where Rj2 is the R2 value generated by regressing predictor xj on all other predictors. Multicollinearity is thought to be a problem if VIF > 10 for any given predictor (Dormann et al. 2012).

A third metric for detecting multicollinearity in a data set is the Condition Number (CN) of the predictor matrix defined as the square root of the ratio of the largest and smallest eigenvalues in the predictor matrix:

CN=\sqrt{\frac{\lambda_{max}}{\lambda_{min}}}

CN> 15 indicates the possible presence of multicollinearity, while a CN > 30 indicates serious multicollinearity problems (Dormann et al. 2013).

Dealing with Multicollinearity using Tolerant Methods

In a statistical sense, there is no way to “fix” multicollinearity. However, methods have been developed to mitigate its effects. Perhaps the most effective way to remedy multicollinearity is to make a priori judgements about the relationship between predictors and remove or consolidate predictors that have known correlations. This is not always possible however, especially when the true functional forms of relationships are not known (which is often why regression is done in the first place). In this section I will explain two “Tolerant Methods” for dealing with multicollinearity.

The purpose of Tolerant Methods is to reduce the sensitivity of regression parameters to multicollinearity. This is accomplished through penalized regression. Since multicollinearity can result in large and opposite signed  estimator values for correlated predictors, a penalty function is imposed to keep the value of predictors below a pre-specified value.

\sum_{j=1}^{p}|\beta|^l \leq c

Where c is the predetermined value representing model complexity, p is the number of predictors and l is either 1 or 2 depending on the type of tolerant method employed (more on this below).

Ridge Regression

Ridge regression uses the L2 norm, or Euclidean distance, to constrain model coefficients (ie. l = 2 in the equation above). The coefficients created using ridge regression are defined as:

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Ridge regression adds a constant, λ, to the term xTx to construct the estimator. It should be noted that both x and y should be standardized before this estimator is constructed. The Ridge regression coefficient is the result of a constrained version of the ordinary least squares optimization problem. The objective is to minimize the sum of square errors for the regression while meeting the complexity constraint.

\hat{\beta_r} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2  \\  \sum_{j=1}^{p}|\beta_j|^2 \leq c \end{cases}

To solve the constrained optimization, Lagrange multipliers can be employed. Let z equal the Residual Sum of Squares (RSS) to be minimized:

argmin(\beta) \hspace{.3cm}  z= (y-x\beta)^T(y-x\beta)+\lambda(\sum_{i=1}^{N}|\beta_j|^2-c)

This can be rewritten in terms of the L2 norm of β:

z = (y-x\beta)^T(y-x\beta)+\lambda||\beta||^2_2

Taking the derivative with respect to β and solving:

0 = \frac{\partial z}{\partial \beta} = -2x^T(y-x\beta)+2\lambda\beta

x^Ty = x^Tx\beta+\lambda\beta=(x^Tx+\lambda I)\beta

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Remember that the Gauss-Markov theorem states that the OLS estimate for regression coefficients is the BLUE, so by using ridge regression, we are sacrificing some benefits of OLS estimators in order to constrain estimator variance. Estimators constructed using ridge regression are in fact biased, this can be proven by calculating the expected value of ridge regression coefficients.

E[\hat{\beta_r}]=(I+\lambda(x^Tx)^{-1})\beta \neq \beta

For a scenario with two predictors, the tradeoff between reduced model complexity and increase bias in the estimators can be visualized graphically by plotting the estimators of the two beta values against each other. The vector of beta values estimated by regression are represented as points on this plot  (\hat{\beta}=[\beta_1, \beta_2]).  In Figure 1,\beta_{OLS} is plotted in the upper right quadrant and represents estimator that produces the smallest RSS possible for the model. The ellipses centered around  are representations of the increasing RSS resulting from the combination of β1 and β2  values, each RSS is a function of a different lambda value added to the regression.  The circle centered around the origin represents the chosen level of model complexity that is constraining the ridge regression. The ridge estimator is the point where this circle intersects a RSS ellipse. Notice that as the value of c increases, the error introduced into the estimators decreases and vice versa.

ridge_regression

Figure 1: Geometric Interpretation of a ridge regression estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimates represent RSS contours for each Beta 1, Beta 2 combination (denoted on here as z from the optimization equation above). The model complexity is constrained by distance c from the origin. The ridge regression estimator of Beta is shown as the red dot, where the RSS contour meets the circle defined by c.

The c value displayed in Figure 1 is only presented to explain the theoretical underpinnings of ridge regression. In practice, c is never specified, rather, a value for λ is chosen a priori to model construction. Lambda is usually chosen through a process known as k-fold cross validation, which is accomplished through the following steps:

  1. Partition data set into K separate sets of equal size
  2. For each k = 1 …k, fit model with excluding the kth set.
  3. Predict for the kth set
  4. Calculate the cross validation error (CVerror)for kth set: CV^{\lambda_0}_k = E[\sum(y-\hat{y})^2]
  5. Repeat for different values of , choose a that minimizes   CV^{\lambda_0} = \frac{1}{k}CV^{\lambda_0}_k

Lasso Regression

Another Tolerant Method for dealing with multicollinearity known as Least Absolute Shrinkage and Selection Operator (LASSO) regression, solves the same constrained optimization problem as ridge regression, but uses the L1 norm rather than the L2 norm as a measure of complexity.

\hat{\beta}_{Lasso} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2 \\ \sum_{j=1}^{p}|\beta_j|^1 \leq c \end{cases}

LASSO regression can be visualized similarly to ridge regression, but since c is defined by the sum of absolute values of beta, rather than sum of squares, the area it constrains is diamond shaped rather than circular.  Figure 2 shows the selection of the beta estimator from LASSO regression. As you can see, the use of the L1 norm means LASSO regression selects one of the predictors and drops the other (weights it as zero). This has been argued to provide a more interpretable estimators (Tibshirani 1996).

lasso_regression

Figure 2: Geometric interpretation of Lasso Regression Estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimate represents RSS contours for each Beta 1, Beta 2 combination (denoted as z from the optimization equation). The mode complexity is constrained by the L1 norm representing model complexity. The Lasso estimator of Beta is shown as the red dot, the location where the RSS contour intersects the diamond defined by c.

Final thoughts

If you’re creating a model with multiple predictors, it’s important to be cognizant of potential for multicollinearity within your data set. Tolerant methods are only one of many possible remedies for multicollinearity (other notable techniques include data clustering and Principle Component Analysis) but it’s important to remember that no known technique can truly “solve” the problem of multicollinearity. The method chosen to deal with multicollinearity should be chosen on a case to case basis and multiple methods should be employed if possible to help identify the underlying structure within the predictor data set (Dormann et. al. 2013)

Citations

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D. and Lautenbach, S. 2013, “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.” Ecography, 36: 27–46. doi:10.1111/j.1600-0587.2012.07348.x

Ohlemüller, R. et al. 2008. “The coincidence of climatic and species rarity: high risk to small-range species from climate change.” Biology Letters. 4: 568 – 572.

Tibshirani, Robert 1996. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological): 267-288.