Guide to Your First Year in the Reed Research Group

I’m finishing up my first year as a MS/PhD student in Reed Research Group and I would like to use this blog post to formally list resources within the blog that I found especially useful and relevant to my  first year of training. We are also at the point where many of the senior students in the group are moving on to new positions, so I would also like to use this blog post to consolidate tips and tricks that I learned from them that will hopefully be helpful to future students as well.

Blog Posts

There are 315 blog posts on this Water Programming Blog. Chances are, if you have a question, it has already been answered in one of these posts. However, when I first joined the group, it was sometimes hard for me to know what I was even supposed to be searching for. Here are some blog posts that I found particularly useful when I started out or ones that I continue to regularly refer to.

Getting Oriented with the Cube

What even is a cluster? I had no idea when I first arrived but this post brought me up to speed.

Understanding the Terminal

1. Using MobaXterm as a terminal is incredibly intuitive, especially for someone like me who had rarely touched a terminal in undergrad. MobaXterm allows you to drag and drop files from your computer directly into your directory on the Cube. Furthermore, with the MobaXterm graphical SFTP browser you can navigate through your directories similarly to a Windows environment. I found that it was easier to use other terminal environments like Cygwin after I had gotten used to the terminal through MobaXterm. See Dave’s post here.
2. Once you are oriented with how the terminal works, the best thing to do is practice navigating using Linux commands. Linux commands can also be very helpful for file manipulation and processing. When I first started training, I was much more comfortable opening text files, for example, in Excel, and making the necessary changes. However, very quickly, I was confronted with manipulating hundreds of text files or set files at a time, which forced me to learn Linux commands. After I learned how to properly used these commands, I wished I had started using them a long time ago. You will work much more efficiently if start practicing the Linux commands listed in Bernardo’s blog post.

Using Borg and the MOEA Framework

Most of my second semester was spent reproducing Julie Quinn’s Lake Problem paper, which is when I first started to understand how to use Borg. It took me entirely too long to realize that the commands in Jazmin’s tutorials here and here are completely generalizable for any application requiring the MOEA framework or Borg. Since these tutorials are done so early in training, it is very easy to forget that they may be useful later and applied to problems other than DTLZ. I found myself referring back many times to these posts to remember the commands needed to generate a reference set from multiple seeds and how to execute Borg using the correct flags.

Using GitHub, Bitbucket, and Git commands

I had heard GitHub tossed around by CS majors in undergrad but it never occurred to me that I would be using it one day. Now, I have realized what a great tool it is for code version control. If used correctly, it makes sharing code with collaborators so much more clean and organized. However, before you can “clone” the contents of anyone’s repository to your own computer, you need an SSH key, which was not obvious to me as newbie to both Github and Bitbucket. You also need a different SSH key for every computer that you use. To generate an SSH key, refer to 2) of this post. Then you can add the generated keys in your profile settings on your Github and Bitbucket accounts.

Once you have keys, you can start cloning directories and pushing changes from your local version to the repository that you cloned from using Git commands outlined in this blog post.

Pro Tips

A consolidation of notes that I wrote down from interactions with senior students in the group that have proven to be useful:

1.  If you can’t get your set files to merge, make sure there is a # sign at the end of each set file.
2. If a file is too big to view, use the head or tail command to see the first few lines or last lines of a file to get an idea of what the contents of the file look like.
3. Every time you submit a job, a file with the name of the job script and job number will appear in your directory. If your code crashes and you aren’t sure where to start, this file is a good place to see what might be going on. I was using Borg and couldn’t figure out why it was crashing after just 10 minutes of running because no errors were being returned. When I looked at this file, hundreds of outputs had been printed that I had forgotten to comment out. This had overloaded the system and caused it to crash.
4. If you want to compile a file or series of files, use the command make. If you have multiple make files in one folder, then you’ll need to use the command make -f . If you get odd errors when using the make command, try make clean first and then recompile.
5. Most useful Cube commands:qsub to submit a job

qdel job number if you want to delete a job on the cube

qsub -I to start an interactive node. If you start an interactive node, you have one node all to yourself. If you want to run something that might take a while but not necessarily warrant submitting a job, then use an interactive node (don’t run anything large on the command line). However, be aware that you won’t be able to use your terminal until your job is done. If you exit out of your terminal, then you will be kicked out of your interactive node.

In retrospect, I see just how much I have learned in just one year of being in the research group. When you start, it can seem like a daunting task. However, it is important to realize that all of the other students in the group were in your position at one point. By making use of all the resources available to you and with time and a lot of practice, you’ll get the hang of it!

A Deeper Dive into Principal Component Analysis

This post is meant to be a continuation of Dave Gold’s introductory post on Principal Component Analysis, which is an excellent explanation on how to conduct a PCA and visualize the principal components. The goal of this post is to elaborate on how to proceed after you have conducted a PCA and to address some common questions and concerns associated with the method.

Performing a PCA in R

Often times, you will perform a PCA on large datasets that contain many variables and/or many observations. One such dataset that will be used as an example is the Living Blended Drought Atlas (LBDA) which is a reconstruction of the Palmer Drought Severity Index (PDSI) over the contiguous United States from 1473-2005. This dataset contains 4968 columns, or variables, each of which is a grid cell over the U.S., and 533 rows, each of which is a yearly observation. We will call this dataset, X. In matrix notation, we will denote the PCA analysis formula as:

U=XW    (1)

where X is the dataset, W is the weighting matrix, whose columns are the key patterns in the data, and U is the matrix whose columns are the resulting principal components (PCs). You can perform a PCA on this dataset with a single function in R, prcomp.

 PCA=prcomp(X, scale=TRUE/FALSE)

The first input into the function is your data matrix and the second input is used to declare if your dataset should be scaled to have a unit variance before the PCA is conducted. There are various other inputs into the function, listed here, that can be included if necessary.

prcomp returns three sets of results in a list that we have called “PCA”:

1. sdev: the standard deviations of the principal components (if you square them, you get the eigenvalues of the covariance/correlation matrix)
2. rotation: the loading matrix whose columns are the eigenvectors (W in the equation above)
3. x: the rotated data or your PCs (The columns of U in the equation above)

And that’s it! You have the results of the PCA. Now comes the more difficult part: interpreting them.

How do I choose how many PCs to keep?

The dimensions associated with equation (1) are as follows:

U=XW

(nxk)=(nxk)(kxk)

If the number of observations is much larger than the number of variables in the dataset, i.e. n>>k, then the PCA will return k distinct eigenvectors. In our case, since n is smaller than our number of variables, the most non-zero eigenvectors that the PCA will return is n. Either way, we have many supposedly distinct patterns. How do we decide how many of those patterns to keep?

The answer is not always clear and most often subjective and case-dependent. One common tool used is a scree plot or an eigenvalue spectrum.

Scree Plot

Each column of our W matrix is a distinct, independent pattern, also called an empirical orthogonal function (EOF). Each EOF is responsible for explaining some amount of variance in the dataset. A scree plot, shown in Figure 1, allows you visualize this variance breakdown.

Figure 1: Scree Plot

On the x-axis of the scree plot is the EOF (we’ve chosen to keep 10) and on the y-axis is the total variance explained by that EOF. Each variance is equivalent to the eigenvalue associated with its respective eigenvector. You can find the eigenvalues/variances by squaring of the results of sdev that is returned by prcomp.

Generally speaking, one can look for the “elbow” of the scree plot to determine at what point to truncate the EOFs and retain up to the EOF before the elbow. At about the 5th EOF, the graph starts to level off and all subsequent EOFs start to contribute about the same amount of variance. Therefore, the elbow of the graph is located at about the 5th EOF and you will retain the first 4 EOFs.

North’s Rule of Thumb

North’s Rule of Thumb is more of a precise way of truncating and involves creating confidence intervals around your estimates of the variance. The rule states that you should truncated EOFs only when the confidence intervals of the variances start to intersect. At this point, the eigenvectors are considered too close to be interpretable and spacing might be due to sampling error rather than a clear distinction between the variances [1].

Rotated EOFs

At some point, your EOFs might start to exhibit patterns that can be hard to interpret or attribute to a physical phenomenon. It is not uncommon for these types of patterns to result from pure noise in your data, especially if you are analyzing latter EOFs that explain a very small amount of the variance [2].

Rotating EOFs is a practice that is done to simplify the patterns obtained in EOFs and make them more interpretable. A varimax orthogonal rotation can be used to determine an optimum rotation matrix that maximizes the variance in the columns of W. The variance of the columns is maximized by driving some of the loadings to zero and trying to maximize the values of other loadings. In R, this is done using the varimax function.

 my.varimax=varimax(PCA$rotation[,1:10])  In the above command, my input is the first 10 EOFs from the original weighting matrix. The result, my.varimax, is a list with the following components: 1. loadings: the resulting rotated loading matrix 2. rotmat: the rotation matrix The new loading matrix, Wrot , is still orthogonal after the rotation, and the eigenvectors are, therefore, still orthonormal. However, multiplication of the rotated loading matrix by the original dataset, X, to obtain a new U, results in principal components are not guaranteed to be independent. This can be seen through further inspection of the correlation matrix associated with U. Unfortunately, this is a tradeoff associated with obtaining EOFs that are simpler and easier to interpret. Sources: [1] http://yyy.rsmas.miami.edu/users/bmapes/teaching/MPO581_2011/EOF_chapter_DelSole.pdf [2] Hannachi, A., Jolliffe, I.T., and Stephenson, D.B. (2007), Empirical orthogonal functions and related techniques in atmospheric science: A review, International Journal of Climatology, 27, 1119-1152. *All information or figures not specifically cited came from class notes and homework from Dr. Scott Steinschneider’s class, BEE 6300: Environmental Statistics Transitioning from R to Python with Spyder I am currently in the midst of transitioning from R, my preferred programming language, over to Python. While the syntax of the languages is similar, I was quickly overwhelmed with the choices of Integrated Development Environments (IDEs) and text editors for Python. Choosing an IDE is a deeply personal choice and the one you choose depends on your skill level and programming needs. I have tried out many different IDEs, and, for the purpose of making a smooth, intuitive, transition from R to Python, the best one for me was Spyder (Scientific Python Development Environment). In this blog post, I will give an overview of the Spyder environment and step through some of its functionalities. Installation The easiest way to install Spyder is through a Python Scientific Distribution found here. There are three options, but I chose to install Anaconda which gives you the core Python language, over 100 main Python libraries, and Spyder. It is an incredibly efficient way to get everything you need in just one download and works for both Windows and Mac. Once this is installed, you can open Spyder immediately. Environment Figure 1: Spyder Environment The first aspect that I like about Spyder is how similar it looks to RStudio and Matlab, as shown in Figure 1. This made the transition very easy for me. As shown in Figure 2, the Spyder environment is comprised of a collection of panes which can be repositioned by dragging if a different format is more intuitive to the user. To see which panes are open, click View->Panes. The most useful panes will already be open by default. You can choose to keep either the console or the IPython console. This is a matter of preference and I chose to use the regular console. Figure 2: Choosing Panes At the top of the screen is your directory, which, by default, is set to the folder which contains Anaconda. You can change it to your preferred location on your computer by clicking the folder icon next to the drop down arrow. Editor The leftmost pane is the editor which is where code can be written. The Spyder editor has features such as syntax coloring and real-time code analysis. By default, a temporary script, temp.py, will be open. Go ahead and save this in your current directory. Make sure that the file shown in the gray bar matches your directory (shown in Figure 3). Figure 3: Setting a Directory Let’s write a simple script to test out the environment (shown in Figure 4). Figure 4: Sample Script and Run Settings Click the green arrow at the top of the screen to run the script. A box will pop up with Run Settings. Make sure the working directory is correct and click “Run.” If you just want to run a certain section of the script, you can highlight that section and click the second green arrow with the blue and orange box. Console The results from the script will appear in the console, which is my bottom right pane. The user can also execute a command directly in this console. Figure 5: Console Object Inspector/Variable Explorer/File Explorer The last major aspect of the environment is the top right pane, which is a comprised of three tabs. The first tab is the object inspector, which is analogous to RStudio’s “help” tab. You can search for information on libraries, functions, modules, and classes. Figure 6: Object Inspector The second tab is the variable explorer, which is the same as RStudio’s “Environment” tab. This tab conveniently shows the type, size, and value of your variables. The results from our test script are shown in Figure 7. Figure 7: Variable Explorer Finally, the last tab is a file explorer which lists out all of the files and folder in your the current directory. Debugging with Spyder The Python debugger, pdb, is partly integrated into Spyder. The debugging tools are located in blue, adjacent to the green “run” buttons. By double-clicking specific lines in the code, the user can set breakpoints where the debugger will stop and results from the debugger are displayed in the console. Figure 8: Debugging Tools Those are the main components of Spyder! As you can see, it is a fairly uncomplicated and intuitive IDE. Hopefully this overview will make the transition from R or Matlab to Python much easier. Go forth and conquer! Exploring the stability of systems of ordinary differential equations – an example using the Lotka-Volterra system of equations Stability when dealing with dynamical systems is important because we generally like the systems we make decisions on to be predictable. As such, we’d like to know whether a small change in initial conditions could lead to similar behavior. Do our solutions all tend to the same point? Would slightly different initial conditions lead to the same or to a completely different point for our systems. This blogpost will consider the stability of dynamical systems of the form: The equilibria of which are denoted by x* and y*, respectively. I will use the example of the Lotka-Volterra system of equations, which is the most widely known method of modelling many predator-prey/parasite-host interactions encountered in natural systems. The Lotka-Volterra predator-prey equations were discovered independently by both Alfred Lotka and Vito Volterra in 1925-26. Volterra got to these equations while trying to explain why, immediately after WWI, the number of predatory fish was much larger than before the war. The system is described by the following equations: Where a, b, c, d > 0 are the parameters describing the growth, death, and predation of the fish. In the absence of predators, the prey population (x) grows exponentially with an intrinsic rate of growth b. Total predation is proportional to the abundance of prey and the abundance of predators, at a constant predation rate a. New predator abundance is proportional to the total predation (axy) at a constant conversion rate c. In the absence of prey, the predator population decreases at a mortality rate d. The system demonstrates an oscillating behavior, as presented in the following figure for parameters a=1, b=1, c=2, d=1. Volterra’s explanation for the rise in the numbers of predatory fish was that fishing reduces the rate of increase of the prey numbers and thus increases the rate of decrease of the predator. Fishing does not change the interaction coefficients. So, the number of predators is decreased by fishing and the number of prey increases as a consequence. Without any fishing activity (during the war), the number of predators increased which also led to a decrease in the number of prey fish. To determine the stability of a system of this form, we first need to estimate its equilibria, i.e. the values of x and y for which: An obvious equilibrium exists at x=0 and y=0, which kinda means that everything’s dead. We’ll first look at a system that’s still alive, i.e x>0 and y>0: And Looking at these expressions for the equilibria we can also see that the isoclines for zero growth for each of the species are straight lines given by b/a for the prey and d/ca for the predator, one horizontal and one vertical in the (x,y) plane. In dynamical systems, the behavior of the system near an equilibrium relates to the eigenvalues of the Jacobian (J) of F(x,y) at the equilibrium. If the eigenvalues all have real parts that are negative, then the equilibrium is considered to be a stable node; if the eigenvalues all have real parts that are positive, then the equilibrium is considered to be an unstable node. In the case of complex eigenvalues, the equilibrium is considered a focus point and its stability is determined by the sign of the real part of the eigenvalue. I found the following graphic from scholarpedia to be a useful illustration of these categorizations. So we can now evaluate the stability of our equilibria. First we calculate the Jacobian of our system and then plug in our estimated equilibrium. To find the eigenvalues of this matrix we need to find the values of λ that satisfy: det⁡(J-λI)=0 where I is the identity matrix and det denotes the determinant. Our eigenvalues are therefore complex with their real parts equal to 0. The equilibrium is therefore a focus point, right between instability and asymptotic stability. What this means for the points that start out near the equilibrium is that they tend to both converge towards the equilibrium and away from it. The solutions of this system are therefore periodic, oscillating around the equilibrium point, with a period , with no trend either towards the equilibrium or away from it. One can arrive at the same conclusion by looking at the trace (τ) of the Jacobian and its determinant (Δ). The trace is exactly zero and the determinant is positive (both d,b>0) which puts the system right in between stability and instability. Now let’s look into the equilibrium where x*=0 and y*=0, aka the total death. Both b and d are positive real numbers which means that the eigenvalues will always have real values of different signs. This makes the (0,0) an unstable saddle point. This is important because if the equilibrium of total death were a stable point, initial low population levels would tend to converge towards their extinction. The fact that this equilibrium is unstable means that the dynamics of the system make it difficult to achieve total death and that prey and predator populations could be infinitesimally close to zero and still recover. Now consider a system where we’ve somehow killed all the predators (y=0). The prey would continue to grow exponentially with a growth rate b. This is generally unrealistic for real-life systems because it assumed infinite resources for the prey. A more realistic model would consider the prey to exhibit a logistic growth, with a carrying capacity K. The carrying capacity of a biological species is the maximum population size of the species that can be sustained indefinitely given the necessary resources. The model therefore becomes: Where a, b, c, d, K > 0. To check for this system’s stability we have to go through the same exercise. The predator equation has remained the same so: For zero prey growth: Calculating the eigenvalues becomes a tedious exercise at this point and the time of writing is 07:35PM on a Friday. I’d rather apply a small trick instead and use the isoclines to derive the stability of the system. The isocline for the predator zero-growth has remained the same (d/ca), which is a straight line (vertical on the (x,y) vector plane we draw before). The isocline for the prey’s zero-growth has changed to: Which is again a straight line with a slope of –b/aK, i.e., it’s decreasing when moving from left to right (when the prey is increasing). Now looking at the signs in the Jacobian of the first system: We see no self-dependence for each of the two species (the two 0), we see that as the predator increases the prey decreases (-) and that as the prey increases the predator increases too (+). For our logistic growth the signs in the Jacobian change to: Because now there’s a negative self-dependence for the prey-as its numbers increase its rate of growth decreases. This makes the trace (τ) of the Jacobian negative and the determinant positive, which implies that our system is now a stable system. Plotting the exact same dynamic system but now including a carrying capacity, we can see how the two populations converge to specific numbers. Let your Makefile make your life easier This semester I’m taking my first official CS class here at Cornell, CS 5220 Applications of Parallel Computers taught by Dave Bindel (for those of you in the Reed group or at Cornell, I would definitely recommend taking this class, which is offered every other year, once you have some experience coding in C or C++). In addition to the core material we’ve been learning in the class, I’ve been learning a lot by examining the structure and syntax of the code snippets written by the instructor and TA that are offered as starting points and examples for our class assignments. One element in particular that stood out to me from our first assignment was the many function calls made through the makefile. This post will first take a closer look into the typical composition of a makefile and then examine how we can harness the structure of a makefile to help improve workflow on complicated projects. Dissecting a typical makefile On the most basic level, a makefile simply consists of series of rules that each have an associated set of actions. Makefiles are how you use the “make” utility, a software package installed on all linux systems. Make has its own syntax similar to bash but with some distinct idiosyncrasies. For example, make allows you to store a snippet of code in whats called a “macro” (these are pretty much analogous to variables in most other languages). A macro to store the flags you would like to run with your compiler could be defined like this: CFLAGS = -g -Wall To referenced the CFLAGS macro, use a dollar sign and brackets, like this: $(CFLAGS)

There are a series of “special” predifined macros that can be used in any makefile which are fairly common, you can find them here.

Now that we’ve discussed makefile syntax, lets take a look at how rules are structured within a makefile. A rule specified by a makefile has the following shape:

target: prerequisites
recipe
...
...

The target is usually the name of the file that is generated by  a program, for example an executable or object file. A prerequisite is the specified input used to create the target (which can often depend on several files). The recipe is the action that make carries out for the intended target (note: this must be indented at every line).

For example, a rule to build an executable called myProg from a c file called myProg.c using the gcc compiler with flags defined in CFLAGS might look like this:

myProg: myProg.c
gcc $(CFLAGS)$<

Make the makefile do the work

The most common rules within makefiles call the compiler to build code (hence the name “makefile”) and many basic makefiles are used for this sole purpose. However, a rule simply sends a series commands specified by its recipe to the command line and a rule can actually specify any action or series of actions that you want. A ubiquitous example of a rule that specifies an action is “clean”, which may be implemented like this:

clean:
rm -rf *.o $(PROGRAM) Where PROGRAM is a macro containing the names of the executable compiled by the makefile. In this example, the rule’s target is an action rather than an output file. You can call “clean” by simply typing “make clean” into the command line and you will remove all .o files and the executable PROGRAM from your working directory. Utilizing this application of rules, we can now have our makefile do a lot of our command line work for us. For example, we could create a rule called “run” which submits a series of pbs jobs into a cluster. run: qsub job-$*.pbs

We can then enter “make run” into the command line to execute this rule, which will submit the .pbs jobs for us (note that this will not perform any of the other rules defined in the makefile). Using this rule may be handy if we have a large number of jobs to submit.

Alternatively we could  make a rule called “plot” which calls a plotting function from python:

plot:
python plotter.py \$(PLOTFILES)

Where PLOTFILES is a macro containing the name of files to be plotted and plotter.py is a python function that takes the file names as input.

Those are just two examples (loosely based on a makefile given in CS 5220) of how you can use a makefile to do your command line work for you, but the possibilities are endless!! Ok, maybe that’s getting a bit carried away, but I do find this functionality to be a simple and elegant way to improve the efficiency of your workflow on complex projects.

For some background on makefiles, I’d recommend taking a look a Billy’s post from last year. I also found the GNU make user manual helpful as well as this tutorial from Swarthmore that has some nice example makefiles.

Water Programming Blog Guide (Part 2)

Water Programming Blog Guide (Part 1)

This second part of the blog guide will cover the following topics:

1. Version control using git
2. Generating maps and working with spatial data in python
3. Reviews on synthetic streamflow and synthetic weather generation
4. Conceptual posts

1. Version Control using git

If you are developing code it’s worth the time to gain familiarity with git to maintain reliable and stable development.  Git allows a group of people to work together developing large projects minimizing the chaos when multiple people are editing the same files.   It is also valuable for individual projects as it allows you to have multiple versions of a project, show the changes that you have made over time and undo those changes if necessary.  For a quick introduction to git terminology and functionality, check out   The Intro to git Part 1: Local version control and  Intro to git Part 2: Remote Repositories  posts will guide you through your first git project (local or remote) while providing a set of useful commands.  Other specialized tips can be found in: Git branch in bash prompt and GitHub Pages. And if you are wondering how to use git with pycharm, you’ll find these couple of posts useful: A Guide to Using Git in PyCharm – Part 1A Guide to Using Git in PyCharm – Part 2

2. Generating maps and working with spatial data in python

To learn more about python’s capabilities on this subject,  this  lecture  summarizes key python libraries relevant for spatial analysis.  Also,  Julie and the Jons have documented their efforts when working with spatial data and with python’s basemap, leaving us with some valuable examples:

Working with raster data

Python – Extract raster data value at a point

Python – Clip raster data with a shapefile

Generating maps

Making Watershed Maps in Python

Plotting geographic data from geojson files using Python

Generating map animations

Python makes the world go ’round

Making Movies of Time-Evolving Global Maps with Python

3. Reviews on synthetic streamflow and weather generation

We are lucky to have thorough reviews on synthetic weather and synthetic streamflow generation written by our experts Julie and Jon L.  The series on synthetic weather generation consists of five parts. Part I and Part II cover parametric and non-parametric methods, respectively. Part III covers multi-site generation.  Part IV discusses how to modify both parametric and non-parametric methods to simulate weather with climate change projections and finally Part V covers how to simulate weather with seasonal climate forecasts:

The synthetic streamflow review provides a historical perspective while answering key questions on “Why do we care about synthetic streamflow generation?  “Why do we use it in water resources planning and management? and “What are the different methods available?

Synthetic streamflow generation

4.  Conceptual posts

Multi-objective evolutionary algorithms (MOEAs)

We frequently use multi-objective evolutionary algorithms due to their power and flexibility to solve multi-objective problems in water resources applications, so you’ll find sufficient documentation in the blog on basic concepts, applications and performance metrics:

You have a problem integrated into your MOEA, now what?

On constraints within MOEAs

MOEA Performance Metrics

Many Objective Robust Decision Making (MORDM) and Problem framing

The next post discusses the MORDM framework which combines many objective evolutionary optimization, robust decision making, and interactive visual analytics to frame and solve many objective problems under uncertainty.  This is a valuable reading along with the references within.  The second post listed provides a systematic way of thinking about problem formulation and defines the key components of a many-objective problem:

Many Objective Robust Decision Making (MORDM): Concepts and Methods

“The Problem” is the Problem Formulation! Definitions and Getting Started

Econometric analysis and handling multi-variate data

To close this second part of the blog guide, I leave you with a couple selected topics  from the Econometrics and Multivariate statistics courses at Cornell documented by Dave Gold:

A visual introduction to data compression through Principle Component Analysis

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

Introduction To Econometrics: Part II- Violations of OLS Assumptions & Methods for Fixing them

Regression is the primary tool used in econometrics to infer relationships between a group of explanatory variables, X and a dependent variable, y. My previous post focused on the mechanics of Ordinary Least Squares (OLS) Regression and outlined key assumptions that, if true, make OLS estimates the Best Linear Unbiased Estimator (BLUE) for the coefficients in the regression:

$y = \beta X+\epsilon$

This post will discuss three common violations of OLS assumptions, and explain tools that have been developed for dealing with these violations. We’ll start with a violation of the assumption of a linear relationship between X and y, then discuss heteroskedasticity in the error terms and the issue of endogeniety.

Linearity

If the relationship between X and y is not linear, OLS can no longer be used to estimate beta. A nonlinear regression of y on X has the form:

$y = g(X\beta)+\epsilon$

Where  g(X\beta) is the functional form of the nonlinear relationship between X and y and epsilon is the error term. Beta can be estimated using Nonlinear Least Squares regression (NLS). Similar  to OLS regression, NLS seeks to minimize the sum of the square error term.

$\hat{\beta} = argmin(\beta) \epsilon'\epsilon = (y-g(x\beta))^2$

To solve for beta, we again take the derivative and set it equal to zero, but for the nonlinear system there is no closed form solution, so the estimators have to be found using numerical optimization techniques.

The variance of a NLS estimator is:

$\hat{Var}_{\hat{\beta}_{NLS}} = \hat{\sigma^2}(\hat{G}'\hat{G})^{-1}$

Where G is a matrix of partial derivatives of g with respect to each Beta.

Modern numerical optimization techniques can solve many NLS equations quite easily making NLS a common alternative to OLS regression especially when there is a hypothesized functional form for the relationship between X and y.

Heteroskedasticity

Heteroskedasticity arises within a data set when the errors do not have a constant variance with respect to X. In equation form, under heteroskedasticity:

$E(\epsilon_i^2|X ) \neq \sigma^2$

The presence of heteroskedasticity  increases the variance of Beta estimators found using OLS regression, reducing the efficiency of the estimator and causing it to no longer be the BLUE. As put by Allison (2012), OLS on heteroskedastic data puts “equal weight on all observations when, in fact, observations with larger disturbances contain less information”.

To fix this problem, econometric literature provides two options which both use a form of weighting to correct for differences in variance amongst the error terms:

1. Use the OLS estimate for beta, but calculate the variance of beta with a robust variance-covariance matrix .
2. Estimate Beta using Feasible Generalized Least Squares (FGLS)

Let’s begin with the first strategy, using OLS beta estimates with a robust variance-covariance matrix. The robust variance-covariance matrix can be derived using the Generalized Method of Moments (GMM) for the sake of brevity, I’ll omit the derivation here and skip to the final result:

$\hat{var}(\hat{\beta}) = (X'X)^{-1}(X'\hat{D}X)(X'X)^{-1}$

Where $\hat{D}$ is a matrix of square residuals from the OLS regression:

The second strategy, estimation using FGLS, requires a more involved process for estimating beta. FGLS can be accomplished through 3 steps:

1. Use OLS to find OLS estimate for beta and calculate the residuals:

$\hat{\epsilon}_i = y_i-x_i \hat{\beta}_{OLS}$

2. Regress the error term on a subset of X, which we will call Z, to get an estimate of a new parameter, theta (denoted with a tilde, but wordpress makes it difficult for me to add this in the middle of a paragraph). We then use this parameter to estimate the variance of the error term, sigma squared,  for each observation:

$\hat{\sigma}^2_i = z_i\tilde{\theta}$

A diagonal matrix, D (different than the D used for the robust variance-covariance matrix), is then constructed using these variance estimates.

3. Finally, we use the matrix D to find our FGLS estimator for beta:

$\hat{\beta}_{FGLS} = (X'\hat{D}^{-1}X)^{-1}(X'\hat{D}^{-1}y)$

The variance of of the FGLS beta etimate is then defined as:

$\hat{var}(\hat{\beta}_{FGLS} = (X'\hat{D}X)^{-1}$

Endogeneity

Endogeneity arises when explanatory variables are correlated with the error term in a regression. This may be a result of simultaneity, when errors and explanatory variables are effected by the same exogenous influences, omitted variable bias,  when an important variable is left out of a regression, causing the over- or underestimation of the effect of other explanatory variables and the error term, measurement error or a lag in the dependent variable. Endogeniety can be hard to detect and may cause regression large errors in regression results.

A common way of correcting for endogeniety is through Instrumental Variables (IVs). Instrumental variables are explanatory variables that are highly correlated with variables that cause endogeniety but are exogenous to the system. Examples include using proximity to cardiac care centers as an IV for heart surgery when modeling health or state cigarette taxes as an IV for maternal smoking rate when modeling infant birth weight (Angrist and Kruger, 2001). For an expansive but accessible overview of IVs and their many applications, see Angrist and Kruger (2001).

A common technique for conducting a regression using IVs is 2 Stage Least Squares (2SLS) regression. The two stages of 2SLS are as follows:

1. Define Z as a new set of explanatory variables, which omits the endogenous variables and includes the IVs (which are usually not included in the original OLS regression).
2. Project Z onto the column space of X.
3. Estimate the 2SLS using this projection:

$\hat{\beta}_{2SLS} = [X'Z(Z'Z)^{-1}Z'X]^{-1}[X'Z(Z'Z)^{-1}Z'y]$

Using 2SLS regression to correct for endogeneity is fairly simple, however identifying good IVs for an endogenous variable can be extremely difficult. Finding a good IV (or set of IVs) can be enough to get one published in an economics journal (at least that’s what my economist friend told me).

Concluding thoughts

These two posts have constituted an extremely brief introduction to the field of econometrics meant for engineers who may be interested in learning about common empirical tools employed by economists. We covered the above methods in much more detail in class and also covered other topics such as panel data, Generalize Method Of Moments estimation, Maximum Likelihood Estimation, systems of equations in regression and discrete choice modeling. Overall, I found the course (AEM 7100) to be a useful introduction to a field that I hope to learn more about over the course of my PhD.

References:

Allison, Paul D. (2012). “Multiple regression: a primer. Pine Forge. Thousand Oaks, CA: Press Print.

Angrist, J.; Krueger, A. (2001). “Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments”. Journal of Economic Perspectives. 15 (4): 69–85. doi:10.1257/jep.15.4.69.