# Welcome to our blog!

Welcome to Water Programming! This blog is a collaborative effort by Pat Reed’s group at Cornell, Joe Kasprzyk’s group at CU Boulder, Jon Herman’s group at UC Davis, and others who use computer programs to solve problems — Multiobjective Evolutionary Algorithms (MOEAs), simulation models, visualization, and other techniques. Use the search feature and categories on the right panel to find topics of interest. Feel free to comment, and contact us if you want to contribute posts.

To find software:  Please consult the Pat Reed group website, MOEAFramework.org, and BorgMOEA.org.

The MOEAFramework Setup Guide: A detailed guide is now available. The focus of the document is connecting an optimization problem written in C/C++ to MOEAFramework, which is written in Java.

The Borg MOEA Guide: We are currently writing a tutorial on how to use the C version of the Borg MOEA, which is being released to researchers here. To gain access please email joseph.kasprzyk “at” colorado.edu.

Call for contributors: We want this to be a community resource to share tips and tricks. Are you interested in contributing? Please email joseph.kasprzyk “at” colorado.edu. You’ll need a WordPress.com account.

# An Introduction To Econometrics: Part 1- Ordinary Least Squares Regression

I took a PhD level econometrics course this semester in the Applied Economics and Management department here at Cornell and I thought I’d share some of what I learned. Overall, I enjoyed the course and learned a great deal. It was very math and theory heavy, but the Professor Shanjun Li did a nice job keeping the class lively and interesting. I would recommend the class to future EWRS students who may be looking for some grounding in econometrics, provided they’ve taken some basic statics and linear algebra courses.

So lets start with the basics, what does the term “econometrics” even mean? Hansen (2010) defined econometrics as “the unified study of economic models, mathematical statistics and economic data”. After taking this introductory course, I’m inclined to add my own definition: econometrics is “a study of the problems with regression using Ordinary Least Squares (OLS) and how to solve them”. This is obviously a gross oversimplification of the field, however, regression through OLS was the primary tool used for finding insights and patterns within data, and we spent the vast majority of the course examining it. In this post I’ll briefly summarize OLS mechanics and classical OLS assumptions. In my next post, I’ll detail methods for dealing with violations of OLS assumptions. My hope is that reading this may help you understand some key terminology and the reasoning behind why certain econometric tools are employed.

## OLS mechanics

Our primary interest when creating an econometric model is to estimate some dependent variable, y, using a observations from a set of independent variables, X. Usually y is a vector of length n, where n is the number of observations, and X is a matrix of size (n x k) where k is the number of explanatory variables (you can think of X as a table of observations, where each column contains a different variable and each row represents an observation of that variable). The goal of OLS regression is to estimate the coefficients, beta, for the model:

$y = X\beta+\epsilon$

Where beta is a k by 1 vector of coefficients on X and epsilon is a k by 1 vector of error terms.

OLS regression estimates beta by minimizing the sum of the square error term (hence the name “least squares”). Put in matrix notation, OLS estimates beta using the equation:

$\hat{\beta} = argmin_{\beta} SSE_N(\beta) = \epsilon ' \epsilon$

The optimal beta estimate can be found through the following equations:

$\epsilon = y-X\hat{\beta}$

$\epsilon ' \epsilon = (y-X\hat{\beta})'(y-X\hat{\beta})$

Taking the derivative and setting it equal to zero:

$2X'y+2X'X\hat{\beta} = 0$

Then solving for the beta estimate:

$\hat{\beta} = (X'X)^{-1}X'y$

Estimation of y using OLS regression can be visualized as the orthogonal projection of the vector y onto the column space of X. The estimated error term, epsilon, is the orthogonal distance between the projection and the true vector y.  Figure 1 shows this projection for a y that is regressed on two explanatory variables, X1 and X2.

Figure 1: OLS regression as an orthogonal projection of vector y onto the column space of matrix X. The error term, $\hat{\epsilon}$, is the orthogonal distance between y and $X\hat{\beta}$. (Image source: Wikipedia commons)

## Assumptions and properties of OLS regression

The Gauss-Markov Theorem states that under a certain set of assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) for vector y.

To understand the full meaning of the Gauss-Markov theorem, it’s important to define two fundamental properties that can be used to describe estimators, consistency and efficiency. An estimator is consistent if its value will converge to the true parameter value as the number of observations goes to infinity. An estimator is efficient if its asymptotic variance is no larger than the asymptotic variance of any other possible consistent estimator for the parameter. In light of these definitions, the Gauss-Markov Theorem can be restated as: estimators found using OLS will be the most efficient consistent estimator for beta as long as the classical OLS assumptions hold. The remainder of this post will be devoted to describing the necessary assumptions for the OLS estimator to be the BLUE and detailing fixes for when these assumptions are violated.

The four classical assumptions for OLS to be the BLUE are:

1. Linearity: The relationship between X and y is linear, following the functional form:

$y = X\beta+\epsilon$.

2. Strict exogeneity: The error $\epsilon$ terms should be independent of the value of the explanatory variables, X. Put in equation form, this assumption requires:

$E(\epsilon_i|X) = 0$

$E(\epsilon_i) =0$

3.  No perfect multicollinearity: columns of X should not be correlated with each other (see my earlier post on dealing with mulitcollinearity for fixes for violations of this assumption).

4. Spherical Error: Error terms should be homoskedastic, meaning they are evenly distributed around the X values. Put in equation form:

$E(\epsilon_i^2|X) =\sigma^2$

Where $\sigma^2$ is a constant value.

$E(\epsilon_i \epsilon_j|X)=0$

Using assumption 4, we can define the variance of $\hat{\beta}$ as:

$var(\hat{\beta}_{OLS}) = \sigma^2(X'X)^{-1}$

If assumptions 1-4 hold, then the OLS estimate for beta is the BLUE, if however, any of the assumptions are broken, we must employ other methods for estimating our regression coefficients.

In my next post I’ll detail the methods econometricians use when these assumptions are violated.

### References:

Hansen, Bruce. “Econometrics”. 2010. University of Wisconsin

http://www.ssc.wisc.edu/~bhansen/econometrics/Econometrics2010.pdf

# Map making in Matlab

Greetings,

This weeks post will cover the basics of generating maps in Matlab.  Julie’s recent post showed how to do some of this in Python, but, Matlab is also widely used by the community.  You can get a lot done with Matlab, but in this post we’ll just cover a few of the basics.

We’ll start off by plotting a map of the continental United States, with the states.  We used three  this with three commands: usamap, shaperead, and geoshow.  usamap creates an empty map axes having the Lambert Projection covering the area of the US, or any state or collection of states.  shaperead reads shapefiles (duh) and returns a Matlab geographic data structure, composed of both geographic data and attributes.  This Matlab data structure then interfaces really well with various Matlab functions (duh).  Finally, geoshow plots geographic data, in our case on the map axes we defined.  Here’s some code putting it all together.

hold on
figure1 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
tightmap
hold off


Note that ‘usastatehi’ is a shapefile containing the US states (duh) that’s distributed with Matlab. The above code generates this figure:

Now, suppose we wanted to plot some data, say a precipitation forecast, on our CONUS map.  Let’s assume our forecast is being made at many points (lat,long).  To interpolate between the points for plotting we’ll use Matlab’s griddata function.  Once we’ve done this, we use the Matlab’s contourm command.  This works exactly like the normal contour function, but the ‘m’ indicates it plots map data.

xi = min(x):0.5:max(x);
yi = min(y):0.5:max(y);
[XI, YI] = meshgrid(xi,yi);
ZI = griddata(x,y,V,XI,YI);

hold on
figure2 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])

contourm(YI,-1*XI,ZI)
tightmap
hold off


Here x, y, and V are vectors of long, lat, and foretasted precipitation respectively.  This code generates the following figure:

Wow!  Louisiana is really getting hammered!  Let’s take a closer look.  We can do this by changing the entry to usamap to indicate we want to consider only Louisiana.  Note, usamap accepts US postal code abbreviations.

ax = usamap('LA');


Making that change results in this figure:

Neat!  We can also look at two states and add annotations.  Suppose, for no reason in particular, you’re interested in the location of Tufts University relative to Cornell.  We can make a map to look at this with the textm and scatterm functions.  As before, the ‘m’ indicates the functions  plot on a map axes.

hold on
figure4 = figure;
ax = usamap({'MA','NY'});

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
scatterm(42.4075,-71.1190,100,'k','filled')
textm(42.4075+0.2,-71.1190+0.2,'Tufts','FontSize',30)

scatterm(42.4491,-76.4842,100,'k','filled')
textm(42.4491+0.2,-76.4842+0.2,'Cornell','FontSize',30)
tightmap
hold off


This code generates the following figure.

Cool! Now back to forecasts.  NOAA distributes short term Quantitative Precipitation Forecasts (QPFs) for different durations every six hours.  You can download these forecasts in the form of shapefiles from a NOAA server.  Here’s an example of a 24-hour rainfall forecast made at 8:22 AM UTC on April 29.

Wow, that’s a lot of rain!  Can we plot our own version of this map using Matlab!  You bet!  Again we’ll use usamap, shaperead, and geoshow.  The for loop, (0,1) scaling, and log transform are simply to make the color map more visually appealing for the post.  There’s probably a cleaner way to do this, but this got the job done!

figure5 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
p = colormap(jet);

N = max(size(S));
d = zeros(N,1);
for i = 1:N
d(i) = log(S(i).QPF);
end

y=floor(((d-min(d))/range(d))*63)+1;
col = p(y,:);
for i = 1:N
geoshow(S(i),'FaceColor',col(i,:),'FaceAlpha',0.5)%,'SymbolSpec', faceColors)
end


This code generates the following figure:

If you are not plotting in the US, Matlab also has a worldmap command.  This works exactly the same as usamap, but now for the world (duh).  Matlab is distibuted with a shapefile ‘landareas.shp’ which contains all of the land areas in the world (duh).  Generating a global map is then trivial:

figure6 = figure;

worldmap('World')
geoshow(land, 'FaceColor', [0.15 0.5 0.15])


Which generates this figure.

Matlab also comes with a number of other included that might be of interest.  For instance, shapefiles detailing the locations of major world cities, lakes, and rivers.  We can plot those with the following code:

figure7 = figure;

worldmap('World')
geoshow(land, 'FaceColor', [0.15 0.5 0.15])
geoshow(lakes, 'FaceColor', 'blue')
geoshow(rivers, 'Color', 'blue')
geoshow(cities, 'Marker', '.', 'Color', 'red')


Which generates the figure:

But suppose we’re interested in one country or a group of countries.  worldmap works in the same usamap does.  Also, you can plot continents, for instance Europe.

worldmap('Europe')


Those are the basics, but there are many other capabilities, including 3-D projections. I can cover this in a later post if there is interest.

That’s it for now!

# Water Programming Blog Guide (Part I)

The Water Programming blog continues to expand collaboratively through contributors’ learning experiences and their willingness to share their knowledge in this blog.  It now covers a wide variety of topics ranging from quick programming tips to comprehensive literature reviews pertinent to water resources research and multi-objective optimization.  This post intends to provide guidance to new, and probably to current users by bringing to light what’s available in the blog and by developing a categorization of topics.

This first post will cover:

Software requirements

1.Programming languages and IDEs

2.Frameworks of optimization, sensitivity analysis and decision support

3.The Borg MOEA

4.Parallel computing

Part II)  will focus  on version control, spatial data and maps, conceptual posts and literature reviews.  And finally Part III)  will cover visualization and figure editing, LaTex and literature management, tutorials and miscellaneous research and programming tricks.

*Note to my fellow bloggers: Feel free to disagree and make suggestions on the categorization of your posts, also your thoughts on facilitating an easier navigation through the blog are very welcomed.  For current contributors,  its always suggested to tag and categorize your post, you can also follow the guidelines in Some WordPress Tips to enable a better organization of our blog.  Also, if you see a 💡, it means that a blog post idea has been identified.

# Software Requirements

If you are new to the group and would like to know what kind of software you require to get started with research.  Joe summed it up pretty well in his  New Windows install? Here’s how to get everything set up post, where he points out all the installations that you should have on your desktop.  Additionally, you can find some guidance on:  Software to Install on Personal Computers and Software to Install on Personal Computers – For Mac Users!. You may also want to check out  What do you need to learn?  if you are entering the group.  These posts are subject to update 💡 but they are a good starting point.

# 1. Programming languages and Integrated Development Environments (IDEs)

Dave Hadka’s Programming Language Overview provides a summary of key differences between the C, C++, Python and Java.  The programming tips found in the blog cover Python, C, C++,  R and Matlab, there are also some specific instances were Java is used which will be discussed in section 2.  I’ll give some guidance on how to get started on each of these programming languages and point out some useful resources in the blog.

## 1.1. Python

Python is  a very popular programming language in our group so there’s sufficient guidance and resources available in the blog.  Download is available here, also some online tutorials that I really recommend to get you up to speed with Python are:  learn python the hard waypython for everybody and codeacademy.   Additionally,  stackoverflow is a useful resource for specific tasks.  The python resources available in our blog are outlined as follows:

### Data analysis and organization

Data analysis libraries that you may want to check out are  scipy,   numpy   and pandas. Here are some related posts:

Comparing Data Sets: Are Two Data Files the Same?

### Using Python IDEs

The use of an integrated development environment (IDE) can enable code development and  make the debugging process easier.  Tom has done a good amount of development in PyCharm, so he has generated a  sequence of posts that provide guidance on how to take better advantage of PyCharm:

PyCharm as a Python IDE for Generating UML Diagrams

Josh also provides instructions to setup PyDev in eclipse in his Setting up Python and Eclipse post,  another Python IDE that you may want to check out is Spyder.

### Plotting

The plotting library for python is matplotlib.  Some of the example found in the blog will provide some guidance on importing and using the library.  Matt put together a github repository with several  Matlab and Matplotlib Plotting Examples, you can also find guidance on generating more specialized plots:

Customizing color matrices in matplotlib

Easy vectorized parallel plots for multiple data sets

Interactive plotting basics in matplotlib

### Miscellaneous  Python tips and tricks

Other applications in Python that my fellow bloggers have found useful are related to machine learning:  Basic Machine Learning in Python with Scikit-learn,  Solving systems of equations: Root finding in MATLAB, R, Python and C++  and using  Python’s template class.

## 1.2. Matlab

Matlab with its powerful toolkit, easy-to-use IDE and high-level language can be used for quick development as long as you are not concerned about speed.  A major disadvantage of this software is that it is not free … fortunately I have a generous boss paying for it.  Here are examples of Matlab applications available in the blog:

A simple command for plotting autocorrelation functions in Matlab

Plotting Probability Ellipses for Bivariate Normal Distributions

Solving Analytical Algebra/Calculus Expressions with Matlab

Generating .gifs from Matlab Figures

Code Sample: Stacked Bars and Lines in Matlab

## 1.3. C++

I have heard C++ receive extremely unflattering nicknames lately, it is a low-level language which means that you need to worry about everything, even memory allocation, but the fact is that it is extremely fast and powerful and is widely used in the group for modeling, simulation and optimization purposes which would take forever in other  languages.

### Getting started

If you are getting started with C++,there are some online  tutorials , and you may want to check out the following material available in the blog:

Setting up Eclipse for C/C++

Getting started with C and C++

Matt’s Thoughts on C++

### Training

Here is some training material that Joe put together:

### Debugging

If you are developing code in C++ is probably a good idea to install an IDE,  I recently started using CLion, following Bernardo’s and Dave’s recommendation, and I am not displeased with it.  Here are other posts available within this topic:

Quick testing of code online

Debugging the NWS model: lessons learned

### Sample code

If you are looking for sample code of commonly used processes in C++, such as defining vectors and arrays, generating output files and timing functions, here are some examples:

C++: Vectors vs. Arrays

A quick example code to write data to a csv file in C++

Code Sample: Timing Functions for C++

## 1.4. R

R is another free open source environment widely used for statistics.  Joe recommends a reading in his Programming language R is gaining prominence in the scientific community post.  Downloads are available here.  If you happen to use an R package for you research, here’s some guidance on How to cite packages in R.  R also supports a very nice graphics package and the following posts provide plotting examples:

Survival Function Plots in R

Easy labels for multi-panel plots in R

R plotting examples

Parallel plots in R

## 1.5. Command line/ Linux:

Getting familiar with the command line and linux environment is essential to perform many of the examples and tutorials available in the blog.  Please check out the   Terminal basics for the truly newbies if you want an introduction to the terminal basics and requirements, also take a look at Using gdb, and notes from the book “Beginning Linux Programming”.   Also check out some useful commands:

Useful Linux commands to handle text files and speed up work

Using Linux input redirection in a C++ code test

Emacs in Cygwin

# 2. Frameworks for optimization, sensitivity analysis, and decision support

We use a variety of free open source libraries to perform commonly used analysis in our research.  Most of the libraries that I outline here were developed by our very own contributors.

## 2.2. MOEAFramework

I have personally used this framework for most of my research.  It has great functionality and speed. It is an open source Java library that supports several multi-objective evolutionary algorithms (MOEAs) and provides  tools to statistically test their performance.  It has other powerful capabilities for sensitivity and data analysis.   Download and documentation material are available here.  In addition to the documentation and examples provided on the MOEAFramework site, other useful resources and guidance can be found in the following posts:

### Setup guidance

MOEAframework on Windows

How to specify constraints in MOEAframework (External Problem)

Additional information on MOEAframework Setup Guide

### Extracting data

Extracting Data from Borg Runtime Files

Runtime metrics for MOEAFramework algorithms, extracting metadata from Borg runtime, and handling infinities

Parameter Description Files for the MOEAFramework algorithms and for the Borg MOEA

### Other uses

Running Sobol Sensitivity Analysis using MOEAFramework

Speeding up algorithm diagnosis by epsilon-sorting runtime files

## 2.2. Project Platypus

This is the newest python framework developed by Dave Hadka that support a collection of libraries for optimization, sensitivity analysis, data analysis and decision making.  It’s free to download in the Project Platypus github repository .  The repository comes with its own documentation and examples.  We are barely beginning to document our experiences with this platform 💡, but it is very intuitive and user friendly.  Here is the documentation available in the blog so far:

A simple allocation optimization problem in Platypus

Rhodium – Open Source Python Library for (MO)RDM

Using Rhodium for RDM Analysis of External Dataset

## 2.3. OpenMORDM

This is an open source library in R for Many Objective robust decision making (MORDM), for more details and documentation on both MORDM and the library use, check out the following post:

Introducing OpenMORDM

## 2.4. SALib

SALib is a python library developed by Jon Herman that supports commonly used methods to perform sensitivity analysis.  It is available here, aside from the documentation available in the github repository,  you can also find  guidance on some of the available methods in the following posts:

Method of Morris (Elementary Effects) using SALib

Extensions of SALib for more complex sensitivity analyses

Running Sobol Sensitivity Analysis using SALib

SALib v0.7.1: Group Sampling & Nonuniform Distributions

There’s also an R Package for sentitivity analysis:  Starting out with the R Sensitivity Package.  Since we are on the subject, Jan Kwakkel provides guidance on Scenario discovery in Python as well.

## 2.5. Pareto sorting function in python (pareto.py)

This is a non-dominated sorting function for multi-objective problems in python available in Matt’s github repository.  You can find more information about it in the following posts:

Announcing version 1.0 of pareto.py

Announcing pareto.py: a free, open-source nondominated sorting utility

# 3.  Borg MOEA

The Borg Multi-objective Evolutionary Algorithm (MOEA) developed by Dave Hadka and Pat Reed, is widely used in our group due to its ability to tackle complex many-objective problems.   We have plenty of documentation and references in our blog so you can get familiar with it.

## 3.1. Basic Implementation

You can find a brief introduction and basic use in Basic Borg MOEA use for the truly newbies (Part 1/2) and (Part 2/2).  If you want to link your own simulation model to the optimization algorithm, you may want to check: Compiling, running, and linking a simulation model to Borg: LRGV Example.  Here are other Borg-related posts in the blog:

Basic implementation of the parallel Borg MOEA

Simple Python script to create a command to call the Borg executable

Compiling shared libraries on Windows (32 bit and 64 bit systems)

Collecting Borg’s operator dynamics

## 3.2. Borg MOEA Wrappers

There are Borg MOEA wrappers available for a number of languages.   Currently the Python, Matlab and Perl wrappers are  documented in the blog.  I believe an updated version of the Borg Matlab wrapper for OSX documentation is required at the moment 💡.

Using Borg in Parallel and Serial with a Python Wrapper – Part 1

Using Borg in Parallel and Serial with a Python Wrapper – Part 2

Setting Borg parameters from the Matlab wrapper

Compiling the Borg Matlab Wrapper (Windows)

Compiling the Borg Matlab Wrapper (OSX/Linux)

Code Sample: Perl wrapper to run Borg with the Variable Infiltration Capacity (VIC) model

# 4. High performance computing (HPC)

With HPC we can handle and analyse massive amounts of data at high speed. Tasks that would normally take months can be done in days or even minutes and it can help us tackle very complex problems.  In addition, here are some Thoughts on using models with a long evaluation time within a Parallel MOEA framework from Joe.

In the group we have a healthy availability of HPC resources; however, there are some logistics involved when working with computing clusters.  Luckily, most of our contributors have experience using HPC and have documented it in the blog.   Also, I am currently using the  MobaXterm interface to facilitate file transfer between my local and remote directories, it also enables to easily navigate and edit files in your remote directory.  It is used by our collaborators in Politecnico di Milano who recommended it to Julie who then recommended it to me.   Moving on, here are some practical resources when working with remote clusters:

## 4.1. Getting started with clusters and key commands

The Cluster and Basic UNIX Commands

Using a local copy of Boost on your cluster account

## 4.2. Submission scripts in  Bash

Some ideas for your Bash submission scripts

Key bindings for Bash history-search

## 4.3. Making bash sessions more enjoyable

Speed up your Bash sessions with “alias”

Get more screens … with screen

Running tmux on the cluster

Making ssh better

## 4.4. Portable Batch System (PBS)

Job dependency in PBS submission

PBS Job Submission with Python

PBS job chaining

Common PBS Batch Options

## 4.5. Python parallelization and speedup

Introduction to mpi4py

Re-evaluating solutions using Python subprocesses

Speed up your Python code with basic shared memory parallelization

Connecting to an iPython HTML Notebook on the Cluster Using an SSH Tunnel

NumPy vectorized correlation coefficient

## 4.6. Debugging

Debug in Real-time on SLURM

## 4.7. File transfer

Globus Connect for Transferring Files Between Clusters and Your Computer

# Using Borg in Parallel and Serial with a Python Wrapper – Part 2

This blog post is Part 2 of a two-part series that will demonstrate how I have coupled a pure Python simulation model with the Borg multi-objective evolutionary algorithm (MOEA). I recommend reading Part 1 of this series before you read Part 2. In Part 1, I explain how to get Borg and provide sample code showing how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py). In Part 2, I provide details for more advanced simulation-optimization setups that require you pass additional information from the borg wrapper into the simulation model (the “function evaluation”) other than just decision variable values.

In Part 1, the example simulation model I use (PySedSim) is called through a function handle “Simulation_Caller” in the example_sim_opt.py file. Borg needs only this function handle to properly call the simulation model in each “function evaluation”. Borg’s only interaction with the simulation model is to pass the simulation model’s function handle (e.g., “Simulation_Caller”) the decision variables, and nothing else. In many circumstances, this is all you need.

However, as your simulation-optimization setup becomes more complex, in order for your simulation model (i.e., the function evaluation) to execute properly, you may need to pass additional arguments to the simulation model from Borg other than just the decision variables. For example, in my own use of Borg in a simulation-optimization setting, in order to do optimization I first import a variety of assumptions and preferences to set up a Borg-PySedSim run. Some of those assumptions and preferences are helpful to the simulation model (PySedSim) in determining how to make use of the decision variable values Borg feeds it. So, I would like to pass those relevant assumptions and preferences directly into the Borg wrapper (borg.py), so the wrapper can in turn pass them directly into the simulation model along with the decision variable values.

Before I show how to do this, let me provide a more concrete example of how/why I am doing this in my own research. In my current work, decision variable values represent parameters for a reservoir operating policy that is being optimized. The simulation model needs to know how to take the decision variable values and turn them into a useful operating policy that can be simulated. Some of this information gets imported in order to run Borg, so I might as well pass that information directly into the simulation model while I have it on hand, rather than importing it once again in the simulation model.

To do what I describe above, we just need to modify the two functions in the example_sim_opt.py module so that a new argument “additional_inputs” is passed from borg to the simulation handle.  Using my python code from blog post 1, I provide code below that is modified in the Simulation_Caller() function on lines 5, 21, 22 and 27; and in the Optimization() function on lines 55, 56 and 70. After that code, I then indicate how I modify the borg.py wrapper so it can accept this information.

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

'''
Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

Note: You could also just put your simulation/function evaluation code here.

Args:
vars: A list of decision variable values from Borg
additional_inputs: A list of python data structures you want to pass from Borg into the simulation model.
Returns:
performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
'''

borg_vars = vars  # Decision variable values from Borg

# Unpack lists of additional inputs from Borg (example assumes additional inputs is a python list with two items)

# Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
op_policy_params = np.asarray(borg_vars)
# Call/run simulation model with decision vars and additional relevant inputs, return multi-objective performance:
return performance

def Optimization():

'''

Purpose: Call this method from command line to initiate simulation-optimization experiment

Returns:
--pareto approximate set file (.set) for each random seed
--Borg runtime file (.runtime) for each random seed

'''

import borg as bg  # Import borg wrapper

parallel = 1  # 1= master-slave (parallel), 0=serial

# The following are just examples of relevant MOEA specifications. Select your own values.
nSeeds = 25  # Number of random seeds (Borg MOEA)
num_dec_vars = 10  # Number of decision variables
n_objs = 6  # Number of objectives
n_constrs = 0  # Number of constraints
num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
runtime_freq = 1000  # Interval at which to print runtime details for each random seed
decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective
borg_dict_1 = {'simulation_preferences_1': [1,2]}  # reflects data you want Borg to pass to simulation model
borg_dict_2 = {'simulation_preferences_2': [3,4]}  # reflects data you want Borg to pass to simulation model

# Where to save seed and runtime files
main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
output_location = main_output_file_dir + os_fold + 'sets'

# If using master-slave, start MPI. Only do once.
if parallel == 1:
bg.Configuration.startMPI()  # start parallelization with MPI

# Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
for j in range(nSeeds):
# Instantiate borg class, then set bounds, epsilon values, and file output locations
borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller, add_sim_inputs = [borg_dict_1, borg_dict_2])
borg.setBounds(*decision_var_range)  # Set decision variable bounds
borg.setEpsilons(*epsilon_list)  # Set epsilon values
# Runtime file path for each seed:
runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
if parallel == 1:
# Run parallel Borg
result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

if parallel == 0:
# Run serial Borg
result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
"runtimefile": runtime_filename})

if result:
# This particular seed is now finished being run in parallel. The result will only be returned from
# one node in case running Master-Slave Borg.
result.display()

# Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
f.write('#Borg Optimization Results\n')
f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
' are the ' + 'objective values\n')
for solution in result:
line = ''
for i in range(len(solution.getVariables())):
line = line + (str(solution.getVariables()[i])) + ' '

for i in range(len(solution.getObjectives())):
line = line + (str(solution.getObjectives()[i])) + ' '

f.write(line[0:-1]+'\n')
f.write("#")
f.close()

# Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
# the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
for solution in result:
line = ''
for i in range(len(solution.getObjectives())):
line = line + (str(solution.getObjectives()[i])) + ' '

f2.write(line[0:-1]+'\n')
f2.write("#")
f2.close()

print("Seed %s complete") %j

if parallel == 1:
bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
'''
Function to determine whether operating system is (1) Windows, or (2) Linux

Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
post-simulation.
'''

if platform.system() == 'Windows':
os_fold_op = '\\'
elif platform.system() == 'Linux':
os_fold_op = '/'
else:
os_fold_op = '/'  # Assume unix OS if it can't be identified

return os_fold_op


Next, you will need to acquire the Borg wrapper using the instructions I specified in my previous blog post. You will need to make only two modifications: (1) modify the Borg class in borg.py so it accepts the inputs you want to pass to the simulation; and (2) some additional internal accounting in borg.py to ensure those inputs are passed to the borg.py methods that deal with your function handle. I will address these two in order.

First, modify the Borg class in borg.py so it now accepts an additional input (I only show some of the borg.py code here, just to indicate where changes are being made):


class Borg:
def __init__(self, numberOfVariables, numberOfObjectives, numberOfConstraints, function, epsilons = None, bounds = None, directions = None, add_sim_inputs=None):

# add_sim_inputs is the new input you will pass to borg



Then, modify the portion of the borg.py wrapper where self.function is called, so it can accommodate any simulation inputs you have specified.


self.function = _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions)
else:
# More simulation inputs are specified and can be passed to the simulation handle



After the above, the last step is to modify the _functionWrapper method in borg.py:


def _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions=None, addl_inputs=None):
# addl_inputs will be passed into the simulation model
def innerFunction(v,o,c):
global terminate
try:
result = function(*[v[i] for i in range(numberOfVariables)])
else:
result = function([v[i] for i in range(numberOfVariables)], addl_inputs)



# Making Watershed Maps in Python

This post builds off of earlier posts by Jon Lamontagne and Jon Herman on making global maps in Python using matplotlib and basemap. However rather than making a global map, I’ll show how to zoom into a particular region, here the Red River basin in East Asia. To make these maps, you’ll need to have basemap installed (from github here, or using a Windows installer here).

The first step is to create a basemap. Both Jons used the ‘robin’ global projection to do this in their posts. Since I’m only interested in a particular region, I just specify the bounding box using the lower and upper latitudes and longitudes of the region I’d like to plot. As Jon H points out, you can also specify the resolution (‘f’ = full, ‘h’ =high, ‘i’ = intermediate, ‘l’ = low, ‘c’ = crude), and you can even use different ArcGIS images for the background (see here). I use ‘World_Shaded_Relief’. It’s also possible to add a lot of features such as rivers, countries, coastlines, counties, etc. I plot countries and rivers. The argument ‘zorder’ specifies the order of the layering from 1 to n, where 1 is the bottom layer and n the top.


from mpl_toolkits.basemap import Basemap
from matplotlib import pyplot as plt

fig = plt.figure()
fig.set_size_inches([17.05,8.15])

# plot basemap, rivers and countries
m = basemap(llcrnrlat=19.5, urcrnrlat=26.0, llcrnrlon=99.6, urcrnr=107.5, resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)



The above code makes the following image (it takes some time, since I’m using high resolution):

Now let’s add a shaded outline of the Red River basin. To do this, you need a shapefile of the basin. The FAO provides a shapefile of major watersheds in the world, from which you can extract the watershed you’re interested in using ArcGIS (see instructions here). In this shapefile, the Red River is labeled by its name in Vietnamese, ‘Song Hong.’ I chose not to draw the bounds of the basin in my map because it would be too busy with the country borders. Instead, I shaded the region gray (facecolor=’0.33′) with a slightly darker border (edgecolor=’0.5′) and slight transparency (alpha=0.5). To do that, I had to collect all of the patches associated with the shapefile (which I called ‘Basin’ when reading it in) that needed to be shaded.


from matplotlib.patches import Polygon
from matplotlib.collections import Patch Collection

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1: # attribute in attribute table of shapefile
patches.append(Polygon(np.array(shape), True))



This creates the following image:

Now let’s add the locations of major dams and cities in the basin using ‘scatter‘. You could again do this by adding a shapefile, but I’m just going to add their locations manually, either by uploading their latitude and longitude coordinates from a .csv file or by passing them directly.


import numpy as np

# plot dams
damsLatLong = np.loadtxt('DamLocations.csv', delimiter=',', skiprows=1, usecols=[1,2])
x, y = m(damsLatLong[:,1], damLatLong[:,0]) # m(longitude, latitude)
m.scatter(x, y, c='k', s = 150, marker = '^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)



This makes the following image:

If we want to label the dams and cities, we can add text specifying where on the map we’d like them to be located. This may require some guess-and-check work to determine the best place (comment if you know a better way!). I temporarily added gridlines to the map to aid in this process using ‘drawparallels‘ and ‘drawmeridians‘.


# label dams and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center', va='center', color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')



Now our map looks like this:

That looks nice, but it would be helpful to add some context as to where in the world the Red River basin is located. To illustrate this, we can create an inset of the greater geographical area by adding another set of axes with its own basemap. This one can be at a lower resolution.


from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1) # locations
axins.set_xlim(90, 115) # longitude boundaries of inset map
axins.set_ylim(8, 28) # latitude boundaries of inset map

# remove tick marks from inset axes
plt.xticks(visible=False)
plt.yticks(visible=False)

# add basemap to inset map
m2 = Basemap(llcrnrlat=8.0, urcrnclat=28.0, llcrnr=90.0, urcrnrlon=115.0, resolution='l', ax=axins)
m2.drawcountries(color='k', linewidth=0.5)



This image looks like this:

Now let’s highlight a country of interest (Vietnam) in green and also add the Red River basin in light gray again.


# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset



Now our map looks like this:

Finally, let’s label the countries in the inset. Some of the countries are too small to fit their name inside, so we’ll have to create arrows pointing to them using ‘annotate‘. In this function, ‘xy’ specifies where the arrow points to and ‘xytext’ where the text is written relative to where the arrow points.


# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(102.5, 20.2, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'China', fontsize=11, ha='center', va='center', color='k')
plt.text(9.5, 21.0, 'China', fontsize=11, ha='center', va='center', color='k')

# add arrows to label Vietnam and Cambodia
plt.annotate('Vietnam', xy=(108.0, 14.0), xycoords='data', xytext=(5.0, 20.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)
plt.annotate('Cambodia', xy=(104.5, 12.0), xycoords='data', xytext=(-60.0, -25.0), textcoords='offset points', \
color='k', arrowprops=dict(arrowstyle='-'), fontsize=11)



Now our map looks like this:

I think that’s pretty good, so let’s save it ;). See below for all the code used to make this map, with all the import statements at the beginning rather than sporadically inserted throughout the code!

If you’re looking for any other tips on how to make different types of maps using basemap, I recommend browsing through the basemap toolkit documentation and this basemap tutorial, where I learned how to do most of what I showed here.


from mpl_toolkits.basemap import Basemap
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import numpy as np

# set-up Vietnam basemap
fig = plt.figure()
fig.set_size_inches([17.05, 8.15])

# plot basemap, rivers and countries
m = Basemap(llcrnrlat=19.5,urcrnrlat=26.0,llcrnrlon=99.6,urcrnrlon=107.5,resolution='h')
m.drawrivers(color='dodgerblue',linewidth=1.0,zorder=1)
m.drawcountries(color='k',linewidth=1.25)

# plot Red River basin
patches = []
for info, shape in zip(m.Basin_info, m.Basin):
if info['OBJECTID'] == 1:
patches.append(Polygon(np.array(shape), True))

# plot dams
x, y = m(damsLatLong[:,1], damsLatLong[:,0])
m.scatter(x, y, c='k', s=150, marker='^')

# plot Hanoi
x, y = m(105.8342, 21.0278)
m.scatter(x, y, facecolor='darkred', edgecolor='darkred', s=150)

# label reservoirs and Hanoi
plt.text(104.8, 21.0, 'Hoa Binh', fontsize=18, ha='center',va='center',color='k')
plt.text(104.0, 21.7, 'Son La', fontsize=18, ha='center', va='center', color='k')
plt.text(105.0, 21.95, 'Thac Ba', fontsize=18, ha='center', va='center', color='k')
plt.text(105.4, 22.55, 'Tuyen Quang', fontsize=18, ha='center', va='center', color='k')
plt.text(105.8, 21.2, 'Hanoi', fontsize=18, ha='center', va='center', color='k')

# plot inset of greater geographic area
axins = zoomed_inset_axes(ax, 0.1, loc=1)
axins.set_xlim(90, 115)
axins.set_ylim(8,28)

plt.xticks(visible=False)
plt.yticks(visible=False)

m2 = Basemap(llcrnrlat=8.0,urcrnrlat=28.0,llcrnrlon=90.0,urcrnrlon=115.0,resolution='l',ax=axins)
m2.drawcountries(color='k',linewidth=0.5)

# plot Vietnam green in inset
patches2 = []
for info, shape in zip(m2.Vietnam_info, m2.Vietnam):
if info['Joiner'] == 1:
patches2.append(Polygon(np.array(shape), True))

# shade Red River basin gray in inset

# label countries
plt.text(107.5, 25.5, 'China', fontsize=11, ha='center',va='center',color='k')
plt.text(102.5, 20.2, 'Laos', fontsize=11, ha='center', va='center', color='k')
plt.text(101.9, 15.5, 'Thailand', fontsize=11, ha='center', va='center', color='k')
plt.text(96.5, 21.0, 'Myanmar', fontsize=11, ha='center', va='center', color='k')

plt.annotate('Vietnam', xy=(108.0,14.0), xycoords='data', xytext=(5.0,20.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)
plt.annotate('Cambodia', xy=(104.5,12.0), xycoords='data', xytext=(-60.0,-25.0), textcoords='offset points', \
color='k',arrowprops=dict(arrowstyle='-'),fontsize=11)

fig.savefig('RedRiverMap.png')
fig.clf()



# 9 basic skills for editing and creating vector graphics in Illustrator

This post intends to provide guidance for editing and creating vector graphics using Adobe Illustrator.     The goal is to learn some of the commonly used features to help you get started with your vectorized journey.  Let it be a conceptual diagram, or a logos or cropping people out of your photo, these 9 features (and a fair amount googling) will help you do the job.   Before we begin, it may be worthwhile to distinguish some of the main differences between a raster and a vector graphic.  A raster image is comprised of a collection of squares or pixels, and vector graphics are  based on mathematical formulas that define geometric forms (i.e.  polygons, lines, curves, circles and rectangles), which makes them independent of resolution.

The three main advantages of using vector graphics over raster images are illustrated below:

1. Scalability: Vector graphics scale infinitely without losing any image quality. Raster images guess the colors of missing pixels when sizing up, whereas vector graphics simply use the original mathematical equation to create a consistent shape every time.

2. Edibility: Vector files are not flattened, that is, the original shapes of an image exist separately on different layers; this provides flexibility on modifying different elements without impacting the entire image.

3. Reduced file size: A vector file only requires four data points to recreate a square ,whereas a raster image needs to store many small squares.

# 1. Starting a project

You can start a new project simply by clicking File> New, and the following window will appear.  You can provide a number of specifications for your document before starting, but you can also customize your document at any stage by clicking File> Document setup (shortcut Alt+Ctrl+P).

# 2. Creating basic shapes

## Lines & Arrows

Simply use the line segment tool ( A ) and remember to press the shift button to create perfectly straight lines.  Arrows can be added by using the stroke window (Window> Stroke) and (B) will appear, there’s a variety of arrow styles that you can select from and scale (C).  Finally,  in the line segment tool you can provide the exact length of your line.

## Polygons

Some shapes are already specified in Illustrator (e.g. rectangles , stars and circles (A), but many others such as triangles, need to be specified through the polygon tool.  To draw a triangle I need to specify the number of sides =3 as shown in  (B).

## Curvatures

To generate curvatures, you can use the pen tool (A).  Specify two points with the pen, hold the click in the second point and a handle will appear, this handle allows you to shape the curve.  If you want to add more curvatures, draw another point (B) and drag the handle in the opposite direction of the curve.  You can then select the color (C) and the width (D) of your wave.

# 3. Matching colors

If you need to match the color of an image (A) there are a couple of alternatives:

i) Using the “Eyedrop” tool ( B).  Select the component of the image that you want to match, then select the Eyedrop tool and click on the desired color (C).

ii) Using the color picker panel.  Select the image component with the desired color, then double click on the color picker (highlighted in red) and the following panel should appear.  You can see the exact color code and you can copy and paste it on the image that you wish to edit.

# 4. Extracting exact position and dimensions

In the following example, I want the windows of my house to be perfectly aligned.  First, in (A), I click on one of the windows of my house and the control panel automatically provides its x and y coordinates, as well its width and height.  Since I want to align both of the windows horizontally, I investigate the  Y coordinates  of the first window and copy it onto the y coordinate of he second window as shown in (B).  The same procedure would apply if you want to copy the dimensions from one figure to another.

# 5. Free-style drawing and editing

The pencil tool (A) is one of my favorite tools in Illustrator, since it corrects my shaky strokes, and allows me to  paint free style.  Once I added color and filled the blob that I drew, it started resembling more like a tree top (B).  You can edit your figure by right clicking it.  A menu will appear enabling you to rotate, reflect, shear and  scale, among other options.  I only wanted to tilt my tree so I specify a mild rotation  (C).

# 6. Cropping:

Cropping in Illustrator requires clipping masks and I will show a couple of examples using  Bone Bone, a fluffy celebrity cat.  Once a .png image is imported into illustrator, it can be cropped using  one of the following three methods:

Method 1.  Using the direct selection tool

Method 2. Using shapes

Method 3. Using the pen tool for a more detailed crop

To reverse  to the original image Select Object> Clipping mask> Release or Alt+Ctrl+7

# 7. Customize the art-board size

If you want your image to be saved without extra white space (B), you can adapt the size of the canvas with  the Art-board tool (or Shft+8) ( A).

# 8. Using layers:

Layers can help you organize artwork, specially when working with multiple components in an image.  If the Layers panel is not already in your tools, you can access it through Window>  Layers or through the F7 shortcut.  A panel like the  one below should appear.   You can name the layers by double clicking on them, so you can give them a descriptive name.  Note that you can toggle the visibility of each layer on or off. You can also lock a layer if you want to protect it from further change, like the house layer in the example below.  Note that each layer is color-coded, my current working layer is coded in red, so when I select an element in that layer it will be highlighted in red.  The layers can also have sub-layers to store individual shapes, like in the house layer, which is comprised of a collection of rectangles and a triangle.

# 9.  Saving vector and exporting raster

Adobe Illustrator, naturally allows you to save images in many vector formats, But you can also export raster images such as .png, .jpeg, .bmp, etc.. To export raster images do File> Export and something like the blurry panel below should show up.  You can specify the  resolution and the color of the background. I usually like a transparent background  since it allows you more flexibility when using your image in programs such as Power Point.

There are many more features available in Illustrator, but this are the ones that I find myself using quite often.  Also, you probably won’t have to generate images from scratch, there are many available resources online. You can download svg images for free which you an later customize in Illustrator.  You can also complement this post by reading Jon Herman’s Scientific figures in Illustrator.

# A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

$u = xE$

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

$\tilde{x} = \tilde{u}\tilde{E}^T$

Where:

$\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set$

$\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)$

$\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used$

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,...
2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,...
13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
legend('Original', 'Reconstruction')