Job scheduling on HPC resources

Architecture of a HPC Cluster

Modern High Performance Computing (HPC) resources are usually composed of a cluster of computing nodes that provide the user the ability to parallelize tasks and greatly reduce the time it takes to perform complex operations. A node is usually defined as a discrete unit of a computer system that runs its own instance of an operating system. Modern nodes have multiple chips, often known as Central Processing Units or CPUs, which each contain multiple cores each capable of processing a separate stream of instructions (such as a single Monte Carlo run). An example cluster configuration is shown in Figure 1.

garbage

Figure 1. An example cluster configuration

To efficiently make use of a cluster’s computational resources, it is essential to allow multiple users to use the resource at one time and to have an efficient and equatable way of allocating and scheduling computing resources on a cluster. This role is done by job scheduling software. The scheduling software is accessed via a shell script called in the command line. A scheduling  script does not actually run any code, rather it provides a set of instructions for the cluster specifying what code to run and how the cluster should run it. Instructions called from a scheduling script may include but are not limited to:

  • What code would you like the cluster to run
  • How would you like to parallelize your code (ie MPI, openMP ect)
  • How many nodes would you like to run on
  • How many core per processor would you like to run (normally you would use the maximum allowable per processor)
  • Where would you like error and output files to be saved
  • Set up email notifications about the status of your job

This post will highlight two commonly used Job Scheduling Languages, PBS and SLURM and detail some simple example scripts for using them.

PBS

The Portable Batching System (PBS) was originally developed by NASA in the early 1990’s [1] to facilitate access to computing resources.  The intellectual property associated with the software is now owned by Altair Engineering. PBS is a fully open source system and the source code can be found here. PBS is the job scheduler we use for the Cube Cluster here at Cornell.

An annotated PBS submission script called “PBSexample.sh” that runs a C++ code called “triangleSimulation.cpp” on 128 cores can be found below:

#PBS -l nodes=8:ppn=16    # how many nodes, how many cores per node (ppn)
#PBS -l walltime=5:00:00  # what is the maximum walltime for this job
#PBS -N SimpleScript      # Give the job this name.
#PBS -M email.cornell.edu # email address for notifications
#PBS -j oe                # combine error and output file
#PBS -o outputfolder/output.out # name output file

cd $PBS_O_WORKDIR # change working directory to current folder

#module load openmpi/intel # load MPI (Intel implementation)
time mpirun ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3

To submit this PBS script via the command line one would type:

qsub PBSexample.sh

Other helpful PBS commands for UNIX can be found here. For more on PBS flags and options, see this detailed post from 2012 and for more example PBS submission scripts see Jon Herman’s Github repository here.

SLURM

A second common job scheduler is know as SLURM. SLURM stands for “Simple Linux Utility Resource Management” and is the scheduler used on many XSEDE resources such as Stampede2 and Comet.

An example SLURM submission script named “SLURMexample.sh” that runs “triangleSimulation.cpp” on 128 core can be found below:


#!/bin/bash
#SBATCH --nodes=8             # specify number of nodes
#SBATCH --ntasks-per-node=16  # specify number of core per node
#SBATCH --export=ALL
#SBATCH -t 5:00:00            # set max wallclock time
#SBATCH --job-name="triangle" # name your job #SBATCH --output="outputfolder/output.out"

#ibrun is the command for MPI
ibrun -v ./triangleSimulation -m batch -r 1000 -s 1 -c 5 -b 3 -p 2841

To submit this SLURM script from the command line one would type:

sbatch SLURM

The Cornell Center  for Advanced Computing has an excellent SLURM training module within the introduction to Stampede2 workshop that goes into detail on how to most effectively make use of SLURM. More examples of SLURM submission scripts can be found on Jon Herman’s Github. Billy also wrote a blog post last year about debugging with SLURM.

References

  1. https://en.wikipedia.org/wiki/Portable_Batch_System
Advertisements

Creating shaded dial plots in python

I recently created a code for plotting shaded dials (figures that look like gauges or speedometers) in python and I thought I’d share my code here. The dials are well suited to plot things such as risk or maybe the probability of meeting a set of robustness criteria across a range of decision variables (shameless plug: if you’re at EWRI this week, come check out my talk: Conflicts in Coalitions, Wednesday morning at 8:30 in Northstar B for which I created these figures).

As hinted at above, I originally created the plot to show bivariate data, with one variable plotted as the location on the dial and the other as the color. You could also plot the same variable as both color and location if you wanted to emphasize the meaning of increasing value on the dial. An example dial created with the code is shown below.

myDial

cbar

Example custom dial. The above figure consists of two images, a dial plot (originally constructed from a pie plot) and a color bar, made as a separate image but using the same data.

The color distribution, location of arrow and labeling of the gauge and colorbar are all fully customizable. I created the figure by first making a pie chart using marplotlib, inscribing a small white circle in the middle and then cropping the image in half using the Python image processing library (PIL also known as Pillow). The arrow is created using the matplotlib “arrow” function and will point to a specified location on the dial. The code is created such that you can add an array of any length to specify your colors, the array does not have to be monotonic like the one shown above, but will accept any values between zero and one (if your data is not in this range I’d suggest normalizing).

Annotated code is below:

import matplotlib.pyplot as plt
from matplotlib import cm, gridspec
import numpy as np
import math
from PIL import Image
from mpl_toolkits.axes_grid1 import make_axes_locatable

# set your color array and name of figure here:
dial_colors = np.linspace(0,1,1000) # using linspace here as an example
figname = 'myDial'

# specify which index you want your arrow to point to
arrow_index = 750

# create labels at desired locations
# note that the pie plot ploots from right to left
labels = [' ']*len(dial_colors)*2
labels[25] = '100'
labels[250] = '75'
labels[500] = '50'
labels[750] = '25'
labels[975] = '0'

# function plotting a colored dial
def dial(color_array, arrow_index, labels, ax):
    # Create bins to plot (equally sized)
    size_of_groups=np.ones(len(color_array)*2)

    # Create a pieplot, half white, half colored by your color array
    white_half = np.ones(len(color_array))*.5
    color_half = color_array
    color_pallet = np.concatenate([color_half, white_half])

    cs=cm.RdYlBu(color_pallet)
    pie_wedge_collection = ax.pie(size_of_groups, colors=cs, labels=labels)

    i=0
    for pie_wedge in pie_wedge_collection[0]:
        pie_wedge.set_edgecolor(cm.RdYlBu(color_pallet[i]))
        i=i+1

    # create a white circle to make the pie chart a dial
    my_circle=plt.Circle( (0,0), 0.3, color='white')
    ax.add_artist(my_circle)

    # create the arrow, pointing at specified index
    arrow_angle = (arrow_index/float(len(color_array)))*3.14159
    arrow_x = 0.2*math.cos(arrow_angle)
    arrow_y = 0.2*math.sin(arrow_angle)
    ax.arrow(0,0,-arrow_x,arrow_y, width=.02, head_width=.05, \
        head_length=.1, fc='k', ec='k')

# create figure and specify figure name
fig, ax = plt.subplots()

# make dial plot and save figure
dial(dial_colors, arrow_index, labels, ax)
ax.set_aspect('equal')
plt.savefig(figname + '.png', bbox_inches='tight') 

# create a figure for the colorbar (crop so only colorbar is saved)
fig, ax2 = plt.subplots()
cmap = cm.ScalarMappable(cmap='RdYlBu')
cmap.set_array([min(dial_colors), max(dial_colors)])
cbar = plt.colorbar(cmap, orientation='horizontal')
cbar.ax.set_xlabel("Risk")
plt.savefig('cbar.png', bbox_inches='tight')
cbar = Image.open('cbar.png')
c_width, c_height = cbar.size
cbar = cbar.crop((0, .8*c_height, c_width, c_height)).save('cbar.png')

# open figure and crop bottom half
im = Image.open(figname + '.png')
width, height = im.size

# crop bottom half of figure
# function takes top corner <span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>and bottom corner coordinates
# of image to keep, (0,0) in python images is the top left corner
im = im.crop((0, 0, width+c_width, int(height/2.0))).save(figname + '.png')

Other ways of doing this from around the web

This code was my way of making a dial plot, and I think it works well for plotting gradients on the dial. In the course of writing this I came across a couple similar codes, I’m listing them below. They both have advantages if you want to plot a small number of colors on your dial but I had trouble getting them to scale.

Here’s an example that creates dials using matplotlib patches, this method looks useful for plotting a small number of categorical data, I like the customization of the labels: http://nicolasfauchereau.github.io/climatecode/posts/drawing-a-gauge-with-matplotlib/

Here’s another alternative using the plotly library, I like the aesthetics but if you’re unfamiliar with plotly there’s a lot to learn before you can nicely customize the final product: https://plot.ly/python/gauge-charts/

Simple Bash shell scripts that have made my life easier

I’ve recently been using Bash shell scripts to improve the efficiency of my workflow when working on Linux systems and I thought I would share some of them here. I’m fairly new to Linux so this post is not meant to be a comprehensive guide on how to write shell scripts rather, I hope the scripts in this post can serve as examples for those who may also be learning Linux and unsure of where or how to start writing shell scripts. I didn’t write any of these from scratch, most of the scripts are based off files shared with me by group members Julie Quinn, Bernardo Trindade and Jazmin Zatarian Salazar. If you’re interested in learning more about any of the commands used in these scripts I’ve put some references I found useful at the end of this post. If you’re more experienced in writing shell scripts, please feel free to put tips or suggestions in the comments.

1. A simple script for making directories

For my research I’m processing results of a monte carlo simulation for several solutions found through multi-objective search and I needed to make folders in several locations to store the output from each solution. My first instinct was to make each directory separately using the mkdir command in the command line, but this quickly got tedious. Instead I used a bash script to loop through all the solution numbers and create a new directory for each. For more on using loops in Bash, check out this reference.

#!/bin/bash#!/bin/bash

# This script will create directories named "Solution_*.txt" for
# a set of numbered solutions 

# specify solution numbers
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# create a variable to store the string "Solution_"
DIRECTORY="Solution_" 

# loop over solution numbers
for i in ${SOLUTIONS[@]}
do
# create a separate directory for each solution
mkdir $DIRECTORY${i}
done

2. Calling a Java function and saving the output

The MOEA framework is a tool written in Java with all sorts of cool functions. I used it to generate 1024 latin hypercube samples across a given range for each of the 8 solutions mentioned above. Using a shell script allows for you to easily set up the arguments needed for the MOEA framework, call the Java function and save the output to your desired file format. The MOEA framework’s tool spits out a .txt file, but this script uses the “sed” command to save it as a .csv file. More on “sed” can be found in the reference at the end of this post.

#!/bin/bash#!/bin/bash
# this shell script will call the MOEA framework's Latin Hypercube
# Sampling tool to create 1024 samples from a set of
# prespecified ranges for each of 8 solutions

# create variables to store Java arguments
JAVA_ARGS="-Xmx1g -classpath MOEAFramework-1.16-Executable.jar"
NUM_SAMPLES=1024
METHOD=latin

# these are the solutions we will create samples from
SOLUTIONS=('162' '1077' '1713' '1725' '1939' '2191' '2290' '2360')

# loop through solutions
for i in ${SOLUTIONS[@]}
do
    # define names for input (ranges) and output file names
    RANGES_FILENAME=${i}ranges.txt
    OUTPUT_FILENAME=Solution${i}_Samples.txt
    CSV_FILENAME=Solution${i}_Samples.csv

    # Call MOEA framework from JAVA using specified arguments to
    # create LHS Samples, specify OUTPUT_FILENAME as output
    java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.SampleGenerator -m ${METHOD} -n ${NUM_SAMPLES} -p ${RANGES_FILENAME} -o ${OUTPUT_FILENAME}

    # Use the sed command tocreate new comma separated values file
    # from original output .txt file
    sed 's/ /,/g' ${OUTPUT_FILENAME} > ${CSV_FILENAME} 

    # remove .txt files
    rm $OUTPUT_FILENAME
done

3. A piping example

Piping allows you to link together programs by making the output from one program or function the input to another. The script below was originally written by my friend Shrutarshi Basu for a class project we were working on together. This script is made to process the output from the Borg MOEA for 9 random seeds of the DTLZ2 benchmarking problem across several different algorithmic configurations, seen in the code as “masters” (for more on this see Jazmin’s post here). In addition to calling Java tools from the MOEAframework, Basu uses piping to link the Linux commands “tac”, “sed”, “grep” and “cut”.  For more on each of these commands, see the links at the bottom of this post.


# loop over each of 9 seeds
for i in {0..9}
do
obj=DTLZ2_S${i}.obj
output=dtlz2.volume

# loop over masters
for m in $(seq 0 $1)
do
runtime=DTLZ2_S${i}_M${m}.runtime
mobj=DTLZ2_S${i}_M${m}.obj

# extract objectives from output
echo "Extracting objectives"
tac ${runtime} | sed -n '1,/\/\// p' | grep -v "//" | cut -d' ' -f15-19 | tac > ${mobj};
done

# combine objectives into one file
echo "Combining objectives"
java -cp ../../moea.jar org.moeaframework.analysis.sensitivity.ResultFileSeedMerger \
-d 5 -e 0.01,0.01,0.01,0.01,0.01 \
-o ${obj} DTLZ2_S${i}_M*.obj

# calculate the hypervolume
echo "Finding final hypervolume"
hvol=$(java -cp ../../moea.jar HypervolumeEval ${obj})

printf "%s %s\n" "$i" "$hvol" >> ${output}
echo "Done with seed $i"
done

Additional References and Links

 

An Introduction to Copulas

Modeling multivariate probability distributions can be difficult when the marginal probability density functions of the component random variables are different. Copulas are a useful tool to model dependence between random variables with any marginal distributions. This post will introduce the idea of a copula, run through the basic math that underlies its composition and discuss some common copulas in use. Through researching for this post I found several comprehensive coding examples in Matlab, Python and R, so instead of creating my own I’m focusing on the a theoretical introduction to copulas in this post and will link to the coding tutorials at the end.

What is a copula?

The word copula is derived from the Latin noun for a link or a tie (as is the English word “couple”) and its purpose is to describe the dependence structure between two variables. Sklar’s theorem states that “Any multivariate joint distribution can be written in terms of univariate marginal distribution functions and a copula which describes the dependence structure between the two variables” [1].

To understand how a copula can describe the dependence function between random variables its helpful to first review some simple statistics.

u1_u2

proof of uniformity.PNG

The above statements can be summarized as saying that the values of the CDF of any marginal distribution are uniformly distributed on the interval [0,1] ie. if you make a random draw from any distribution, you have the same probability of drawing the largest value (U=1) of that distribution as the smallest possible value (U=0) or the median value (U=.5).

So what does this have to do with copulas? Great question, a copula is actually a joint distribution of the CDFs of the random variables it is modeling. Put formally:

A k dimensional copula is a function c:[0,1]^k \rightarrow [0,1] and is a CDF with uniform marginals [1].

copula defined

So now that we’ve defined what a copula is, lets take a look at the form of some commonly used ones.

The Gaussian Copula

The Gaussian takes the form:

Guassian

Where:

Φ_R is the joint standard normal CDF with: mean and var

ρ_(i,j) is the correlation between random variables X_i and X_j.

Φ^-1 is the inverse standard normal CDF.

It’s important to note that the Pearson Correlation coefficient is a bad choice for ρ row here, a rank based correlation such as Spearman’s ρ or Kendall’s τ are better options since they are scale invariant and do not require linearity.

Issues with Tail Dependency

The Gaussian copula is a helpful tool and relatively easy to fit, even for relatively large numbers of RVs with different marginal distributions. Gaussian copulas however, do not do a good job capturing tail dependence and can cause one to underestimate risk of simultaneously being in the tails of each distribution. The failure of the Gaussian copula to capture tail dependence has been blamed for contributing the the 2008 financial crisis after it was widely used by investment firms on Wall Street (this is actually a really interesting story, for more details check out this article from the financial times) .

Tail dependency can be quantified by the coefficients of upper and lower tail dependence (λ_u and λ_l) defined as:

upper tail dependence.PNG

lower tail dependence

The Student t Copula

Like the student t distribution, the student t copula has a similar shape to the Gaussian copula, but with fatter tails, thus it can do a slightly better job capturing tail dependence.

student t.PNG

Where:

t_ν,Σ is the joint student t CDF, Σ is covariance matrix (again don’t use Pearson correlation coefficient), ν is the degrees of freedom and t^-1_ν is the inverse student t CDF.

Archimedean Copulas

Archimedian copulas are a family of copulas with the following form:

archimedean.PNG

ψ(u|θ) is called the generator function and θ is the parameters for the copula.

3 common Archimedean Copulas are:

  • Gumbel: which is good at modeling upper tail dependence
  • Clayton: which is good at modeling lower tail dependence
  • Frank: has lighter tails and more density in the middle

It’s important to note that these copulas are usually employed for bivariate cases, for more than two variables, the Gaussian or Student t copulas are usually used.

A comparison of the shape of the copulas above can be found in Figure 1.

compare

Image source: wikipedia.org

Coding Copulas

There are numerous packages for modeling copulas in Matlab, Python and R.

In Matlab, the Statistics and Machine learning Toolbox has some helpful functions. You can find some well narrated examples of copulas here. There’s also the Multivariate Copula Analysis Toolbox from UC Irvine.

In Python, the copulalib package can be used to model the Clayton, Frank and Gumbel copulas. The statsmodels  package also has copulas built in. I found this post on the copulalib package, it has an attached Jupyter notebook with nice coding examples and figures. Here’s a post on statsmodels copula implementation, along with example Jupyter notebook.

Finally, here’s an example of coding copulas in R using the copulas library.

References and Acknowledgments

  1. Rüschendorf, L. (2013). Mathematical risk analysis dependence, risk bounds, optimal allocations and portfolios. Berlin: Springer.

I’d like to note that the majority of the content in the post came from Scott Steinschneider’s excellent course, BEE 6940: Multivariate Analysis, at Cornell.

Let your Makefile make your life easier

This semester I’m taking my first official CS class here at Cornell, CS 5220 Applications of Parallel Computers taught by Dave Bindel (for those of you in the Reed group or at Cornell, I would definitely recommend taking this class, which is offered every other year, once you have some experience coding in C or C++). In addition to the core material we’ve been learning in the class, I’ve been learning a lot by examining the structure and syntax of the code snippets written by the instructor and TA that are offered as starting points and examples for our class assignments. One element in particular  that stood out to me from our first assignment was the many function calls made through the makefile. This post will first take a closer look into the typical composition of a makefile and then examine how we can harness the structure of a makefile to help improve workflow on complicated projects.

Dissecting a typical makefile

On the most basic level, a makefile simply consists of series of rules that each have an associated set of actions. Makefiles are how you use the “make” utility, a software package installed on all linux systems. Make has its own syntax similar to bash but with some distinct idiosyncrasies. For example, make allows you to store a snippet of code in whats called a “macro” (these are pretty much analogous to variables in most other languages).  A macro to store the flags you would like to run with your compiler could be defined like this:

CFLAGS  = -g -Wall

To referenced the CFLAGS macro, use a dollar sign and brackets, like this:

 $(CFLAGS)

There are a series of “special” predifined macros that can be used in any makefile which are fairly common, you can find them here.

Now that we’ve discussed makefile syntax, lets take a look at how rules are structured within a makefile. A rule specified by a makefile has the following shape:

target: prerequisites
    recipe
    ...
    ...

The target is usually the name of the file that is generated by  a program, for example an executable or object file. A prerequisite is the specified input used to create the target (which can often depend on several files). The recipe is the action that make carries out for the intended target (note: this must be indented at every line).

For example, a rule to build an executable called myProg from a c file called myProg.c using the gcc compiler with flags defined in CFLAGS might look like this:

myProg: myProg.c
    gcc $(CFLAGS) $<

Make the makefile do the work

The most common rules within makefiles call the compiler to build code (hence the name “makefile”) and many basic makefiles are used for this sole purpose. However, a rule simply sends a series commands specified by its recipe to the command line and a rule can actually specify any action or series of actions that you want. A ubiquitous example of a rule that specifies an action is “clean”, which may be implemented like this:

clean:
    rm -rf *.o $(PROGRAM)

Where PROGRAM is a macro containing the names of the executable compiled by the makefile.

In this example, the rule’s target is an action rather than an output file. You can call “clean” by simply typing “make clean” into the command line and you will remove all .o files and the executable PROGRAM from your working directory.

Utilizing this application of rules, we can now have our makefile do a lot of our command line work for us. For example, we could create a rule called “run” which submits a series of pbs jobs into a cluster.

run:
    qsub job-$*.pbs

We can then enter “make run” into the command line to execute this rule, which will submit the .pbs jobs for us (note that this will not perform any of the other rules defined in the makefile). Using this rule may be handy if we have a large number of jobs to submit.

Alternatively we could  make a rule called “plot” which calls a plotting function from python:

plot: 
    python plotter.py $(PLOTFILES)

Where PLOTFILES is a macro containing the name of files to be plotted and plotter.py is a python function that takes the file names as input.

Those are just two examples (loosely based on a makefile given in CS 5220) of how you can use a makefile to do your command line work for you, but the possibilities are endless!! Ok, maybe that’s getting a bit carried away, but I do find this functionality to be a simple and elegant way to improve the efficiency of your workflow on complex projects.

For some background on makefiles, I’d recommend taking a look a Billy’s post from last year. I also found the GNU make user manual helpful as well as this tutorial from Swarthmore that has some nice example makefiles.

Enhance your (Windows) remote terminal experience with MobaXterm

Jazmin and Julie recently introduced me to a helpful program for Windows called “MobaXterm” that has significantly sped up my workflow when running remotely on the Cube (our cluster here at Cornell). MobaXterm bills itself as an “all in one” toolbox for remote computing. The program’s interface includes a terminal window as well as a graphical SFTP browser. You can link the terminal to the SFTP browser so that as you move through folders on the terminal the browser follows you. The SFTP browser allows you to view and edit files using your text editor of choice on your windows desktop, a feature that I find quite helpful for making quick edits to shell scripts or pieces of code as go.

mobaxtermsnip

A screenshot of the MobaXterm interface. The graphical SFTP browser is on the left, while the terminal is on the right (note the checked box in the center of the left panel that links the browser to the terminal window).

 

You can set up a remote Cube session using MobaXterm with the following steps:

  1. Download MobaXterm using this link
  2.  Follow the installation instructions
  3. Open MobaXterm and select the “Session” icon in the upper left corner.
  4. In the session popup window, select a new SSH session in the upper left, enter “thecube.cac@cornell.edu” as the name of the remote host and enter your username.
  5. When the session opens, check the box below the SFTP browser on the left to link the browser to your terminal
  6. Run your stuff!

Note that for a Linux system, you can simply link your file browser window to your terminal window and get the same functionality as MobaXterm. MobaXterm is not available for Mac, but Cyberduck and Filezilla are decent alternatives. An alternative graphical SFTP browser for Windows is WinSCP, though I prefer MobaXterm because of its linked terminal/SFTP interface.

For those new to remote computing, ssh or UNIX commands in general, I’d recommend checking out the following posts to get familiar with running on a remote cluster:

 

 

 

Introduction To Econometrics: Part II- Violations of OLS Assumptions & Methods for Fixing them

Regression is the primary tool used in econometrics to infer relationships between a group of explanatory variables, X and a dependent variable, y. My previous post focused on the mechanics of Ordinary Least Squares (OLS) Regression and outlined key assumptions that, if true, make OLS estimates the Best Linear Unbiased Estimator (BLUE) for the coefficients in the regression:

y = \beta X+\epsilon

This post will discuss three common violations of OLS assumptions, and explain tools that have been developed for dealing with these violations. We’ll start with a violation of the assumption of a linear relationship between X and y, then discuss heteroskedasticity in the error terms and the issue of endogeniety.

Linearity

If the relationship between X and y is not linear, OLS can no longer be used to estimate beta. A nonlinear regression of y on X has the form:

y = g(X\beta)+\epsilon

Where  g(X\beta) is the functional form of the nonlinear relationship between X and y and epsilon is the error term. Beta can be estimated using Nonlinear Least Squares regression (NLS). Similar  to OLS regression, NLS seeks to minimize the sum of the square error term.

\hat{\beta} = argmin(\beta)  \epsilon'\epsilon = (y-g(x\beta))^2

To solve for beta, we again take the derivative and set it equal to zero, but for the nonlinear system there is no closed form solution, so the estimators have to be found using numerical optimization techniques.

The variance of a NLS estimator is:

\hat{Var}_{\hat{\beta}_{NLS}} = \hat{\sigma^2}(\hat{G}'\hat{G})^{-1}

Where G is a matrix of partial derivatives of g with respect to each Beta.

Modern numerical optimization techniques can solve many NLS equations quite easily making NLS a common alternative to OLS regression especially when there is a hypothesized functional form for the relationship between X and y.

Heteroskedasticity

Heteroskedasticity arises within a data set when the errors do not have a constant variance with respect to X. In equation form, under heteroskedasticity:

E(\epsilon_i^2|X ) \neq \sigma^2

The presence of heteroskedasticity  increases the variance of Beta estimators found using OLS regression, reducing the efficiency of the estimator and causing it to no longer be the BLUE. As put by Allison (2012), OLS on heteroskedastic data puts “equal weight on all observations when, in fact, observations with larger disturbances contain less information”.

To fix this problem, econometric literature provides two options which both use a form of weighting to correct for differences in variance amongst the error terms:

  1. Use the OLS estimate for beta, but calculate the variance of beta with a robust variance-covariance matrix .
  2. Estimate Beta using Feasible Generalized Least Squares (FGLS)

Let’s begin with the first strategy, using OLS beta estimates with a robust variance-covariance matrix. The robust variance-covariance matrix can be derived using the Generalized Method of Moments (GMM) for the sake of brevity, I’ll omit the derivation here and skip to the final result:

\hat{var}(\hat{\beta}) = (X'X)^{-1}(X'\hat{D}X)(X'X)^{-1}

Where \hat{D} is a matrix of square residuals from the OLS regression:

D matrix

The second strategy, estimation using FGLS, requires a more involved process for estimating beta. FGLS can be accomplished through 3 steps:

  1. Use OLS to find OLS estimate for beta and calculate the residuals:

\hat{\epsilon}_i = y_i-x_i \hat{\beta}_{OLS}

2. Regress the error term on a subset of X, which we will call Z, to get an estimate of a new parameter, theta (denoted with a tilde, but wordpress makes it difficult for me to add this in the middle of a paragraph). We then use this parameter to estimate the variance of the error term, sigma squared,  for each observation:

\hat{\sigma}^2_i = z_i\tilde{\theta}

A diagonal matrix, D (different than the D used for the robust variance-covariance matrix), is then constructed using these variance estimates.

3. Finally, we use the matrix D to find our FGLS estimator for beta:

\hat{\beta}_{FGLS} = (X'\hat{D}^{-1}X)^{-1}(X'\hat{D}^{-1}y)

The variance of of the FGLS beta etimate is then defined as:

\hat{var}(\hat{\beta}_{FGLS} = (X'\hat{D}X)^{-1}

Endogeneity

Endogeneity arises when explanatory variables are correlated with the error term in a regression. This may be a result of simultaneity, when errors and explanatory variables are effected by the same exogenous influences, omitted variable bias,  when an important variable is left out of a regression, causing the over- or underestimation of the effect of other explanatory variables and the error term, measurement error or a lag in the dependent variable. Endogeniety can be hard to detect and may cause regression large errors in regression results.

A common way of correcting for endogeniety is through Instrumental Variables (IVs). Instrumental variables are explanatory variables that are highly correlated with variables that cause endogeniety but are exogenous to the system. Examples include using proximity to cardiac care centers as an IV for heart surgery when modeling health or state cigarette taxes as an IV for maternal smoking rate when modeling infant birth weight (Angrist and Kruger, 2001). For an expansive but accessible overview of IVs and their many applications, see Angrist and Kruger (2001).

A common technique for conducting a regression using IVs is 2 Stage Least Squares (2SLS) regression. The two stages of 2SLS are as follows:

  1. Define Z as a new set of explanatory variables, which omits the endogenous variables and includes the IVs (which are usually not included in the original OLS regression).
  2. Project Z onto the column space of X.
  3. Estimate the 2SLS using this projection:

\hat{\beta}_{2SLS} = [X'Z(Z'Z)^{-1}Z'X]^{-1}[X'Z(Z'Z)^{-1}Z'y]

Using 2SLS regression to correct for endogeneity is fairly simple, however identifying good IVs for an endogenous variable can be extremely difficult. Finding a good IV (or set of IVs) can be enough to get one published in an economics journal (at least that’s what my economist friend told me).

Concluding thoughts

These two posts have constituted an extremely brief introduction to the field of econometrics meant for engineers who may be interested in learning about common empirical tools employed by economists. We covered the above methods in much more detail in class and also covered other topics such as panel data, Generalize Method Of Moments estimation, Maximum Likelihood Estimation, systems of equations in regression and discrete choice modeling. Overall, I found the course (AEM 7100) to be a useful introduction to a field that I hope to learn more about over the course of my PhD.

References:

Allison, Paul D. (2012). “Multiple regression: a primer. Pine Forge. Thousand Oaks, CA: Press Print.

Angrist, J.; Krueger, A. (2001). “Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments”. Journal of Economic Perspectives. 15 (4): 69–85. doi:10.1257/jep.15.4.69.