Profiling C++ code with Callgrind

Often times, we have to write code to perform tasks whose complexity vary from mundane, such as retrieving and organizing data, to highly complex, such as simulations CFD simulations comprising the spine of a project. In either case, depending on the complexity of the task and amount of data to be processed, it may happen for the newborn code to leave us staring at an underscore marker blinking gracefully for hours on a command prompt during its execution until the results are ready, leading to project schedule delays and shortages of patience. Two standard and preferred approaches to the problem of time intensive codes are to simplify the algorithm and to make the code more efficient. In order to better select the parts of the code to work on, it is often useful to first find the parts of the code in which more time by profiling the code.

In this post, I will show how to use Callgrind, part of Valgrind, and KCachegrind to profile C/C++ codes on Linux — unfortunately, Valgrind is not available for Windows or Mac, although it can be ran on cluster from which results can be downloaded and visualized on Windows with QCachegrind. The first step is to install Valgrind and KCachegrind by typing the following commands in the terminal of a Debian based distribution, such as Ubuntu (equivalent yum commands area available for Red Hat based distributions):

$ sudo apt-get install valgrind
$ sudo apt-get install kcachegrind

Now that the required tools are installed, the next step is to compile your code with GCC/G++ (with a make file, cmake, IDE or by running the compiler directly from the terminal) and then run the following command in a terminal (type ctrl+shift+T to open the terminal):

$ valgrind --tool=callgrind path/to/your/compiled/program program_arguments

Callgrind will then run your program with some instrumentation added to its execution to measure time expenditures and cache use by each function in your code. Because of the instrumentation, Your code will take considerably longer to run under Callgrind than it typically would, so be sure to run a representative task that is as small as possible when profiling your code. During its execution, Callgrind will output a report similar to the one below on terminal itself:

==12345== Callgrind, a call-graph generating cache profiler
==12345== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==12345== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==12345== Command: path/to/your/compiled/program program_arguments
==12345==
==12345== For interactive control, run 'callgrind_control -h'.
IF YOUR CODE OUTPUTS TO THE TERMINAL, THE OUTPUT WILL BE SHOWN HERE.
==12345== 
==12345== Events    : Ir
==12345== Collected : 4171789731
==12345== 
==12345== I   refs:      4,171,789,731

The report above shows that it collected 4 billion events in order to generate the comprehensive report saved in the file callgrind.out.12345 — 12345 is here your process id, shown in the report above. Instead of submerging your soul into a sea of despair by trying to read the output file in a text editor, you should load the file into KCachegrind by typing:

$ kcachegrind calgrind.out.12345

You should now see a screen like the one below:

kcachegrind_initial.png

The screenshot above shows the profiling results for my code. The left panel shows the functions called by my code sorted by total time spent inside each function. Because functions call each other, callgrind shows two cost metrics as proxies for time spent in each function: Incl., showing the total cost of a function, and self, showing the time spent in each function itself discounting the callees. By clicking on “Self” to order to functions by the cost of the function itself, we sort the functions by the costs of their own codes, as shown below:

Untitled_sorted

Callgrind includes functions that are native to C/C++ in its analysis. If one of them appears in the highest positions of the left panel, it may be the case to try to use a different function or data structure that performs a similar task in a more efficient way. Most of the time, however, our functions are the ones in most of the top positions in the list. In the example above, we can see that a possible first step I can take to improve the time performance of my code is to make function “ContinuityModelROF::shiftStorage” more efficient. A few weeks ago, however, the function “ContinuityModel::continuityStep” was ranked first with over 30% of the cost, followed by a C++ map related function. I then replaced a map inside that function by a pointer vector, resulting in the drop of my function’s cost to less than 5% of the total cost of the code.

In case KCachegrind shows that a given function that is called from multiple places in the code is costly, you may want to know which function is the main culprit behind the costly calls. To do this, click on the function of interest (in this case, “_memcpy_sse2_unalight”) in the left panel, and then click on “Callers” in the right upper panel and on “Call Graph” in the lower right panel. This will show in list and graph forms the calls made to the function by other functions, and the asociated percent costs. Unfortunately, I have only the function “ContinuityModelROF::calculateROF” calling “_memcpy_sse2_unalight,” hence the simple graph, but the graph would be more complex if multiple functions made calls to “_memcpy_sse2_unalight.”

I hope this saves you at least the time spend reading this post!

A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

raw_data

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

u = xE

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

PCA visualization

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

\tilde{x} = \tilde{u}\tilde{E}^T

Where:

\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set

\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)

\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

final_synth.png

 

 

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in 
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,... 
 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,... 
 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Canadaigua min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')

 

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

Solving non-linear problems using linear programming

Solving non-linear problems using linear programming

This week’s post comes from recent conversations we’ve had around the Reed group concerning tools to quickly solve (approximately) non-linear programming problems.  First, some context.

As part of a simulation model our group is building, a drinking water allocation sub-problem must be solved.   Figure 1 is a simplified example of the sort of problem we are solving.

network

Figure 1: Mock water distribution network

There are three utilities that each have a demand (d_{1}, d_{2}, and d_{3}). The utilities are connected via some infrastructure, as shown in Figure 1.  When our total available water (R) is in excess of the demand (d_{1}+d_{2}+d_{3}), no rationing is needed.  When we do need to ration, we want to allocate the water to minimize the percent supply deficits across the three utilities:

equation 1

Equation 1

Subject to:

equation 2

The last constraint here describes the a limitation of the distribution network.  The real problem is much more complicated, but we needn’t detail that here.

This problem needs to be solved thousands, or hundreds of thousands of times in each simulation, so we want any solution technique to be fast.  The natural solution is linear programming (LP), which can solve problems with tens of thousands of variables and constraints nearly instantaneously.

We won’t discuss LP in great detail here, except to say that LP requires an objective and constraints that are linear with respect to the decision variables.  These restrictive requirements significantly reduce the number of potential optimal solutions that must be searched.  By systematically testing and pivoting between these potential optimal solutions, the popular Simplex Algorithm quickly converges to the optimal solution.

As stated in equation 1, our rationing scheme is indifferent to imposing small deficits across all three utilities, or imposing one large deficit to a single utility.  For example, the objective value in equation 1 is the same, whether each utility has a deficit of 5%, or if utility 1 has a deficit of 15%, and utilities 2 and 3 have no deficit.  In reality, many small deficits are likely preferable to one large one.  So what are we to do?

We could square our deficits.  In that case, our rationing scheme will prefer small distributed deficits over one large deficit:

equation 3

Equation 2

BUT, we can’t use LP to solve this problem, as our objective is now non-linear! There are non-linear programming algorithms that are relatively fast, but perhaps not fast enough.  Instead we could linearize our non-linear objective, as shown in Figure 2.

The strategy here is to divide a single allocation,  x_{1} for instance, into many decision variables, representing different ranges of the actual allocation x_{1}.  In each range, a linear segment approximates the actual quadratic objective function.  Any actual release x_{1} can be achieved by assigning the appropriate values to the new decision variables (k_{1}, k_{2}, and k_{3}), and the contribution to the objective function from that release can be approximated by:

equation 4

Equation 3

Subject to:

equation 5

If a more accurate description is needed, the range of x_{1} can be divided into more segments.  For our purposes just a few segments are probably sufficient.  A similar strategy can be adopted for x_{2} and x_{3}.  Of course the constraints from the original optimization problem would need to be translated into terms of the new decision variables.

Now we are adding many more decision variables and constraints, but this is unlikely to slow a modern LP algorithm too much; we are still solving a relatively simple problem.  BUT, how does the LP algorithm know to increase k_{1} to its maximum threshold before applying k_{2}?  Do we need to add a number of conditional constraints to ensure this is done properly?

It turns out we don’t!  Because our squared deficit curve in Figure 2 is monotonic and convex, we know that slope of the linear segments making up the approximation are increasing (becoming less negative).  Thus, in a minimization problem, the marginal improvement in the objective is highest for the k_{1} segment, followed by the k_{2} segment, followed by the k_{3} segment, and so on.  In other words a < b < c.For this reason, the algorithm will increase k_{1} to its maximum threshold before assigning a non-zero value to k_{2}, and so forth.  No need for complicated constraints!

Now this is not always the case.  If the function were not monotonic, or if it were convex for a maximization, or concave for a minimization, this would not work.  But, this trick works for a surprising number of applications in water resources systems analysis!

If nothing else this simple example serves as a reminder that a little bit of thought in formulating problems can save a lot of time later!

A simple allocation optimization problem in Platypus

For those of you who are not familiar with Project Platypus, its a repository that supports a collection of python libraries for multi-objective optimization, decision making and data analysis. All the libraries are written in a very intuitive way and it is just so slick.

In this post I will focus exclusively on the platypus library which supports a variety of multi-objective evolutionary algorithms (MOEAs), such as NSGA-II, NSGA-III, MOEA/D, IBEA, EpsMOEA, SPEA2, GDE3, OMOPSO and SMPSO. Along with a number of analysis tools and performance metrics (which we will be discussing in subsequent posts).

First you can install the entire framework by typing the following commands on your terminal:

git clone https://github.com/Project-Platypus/Platypus.git
cd Platypus
python setup.py develop

If you have trouble with this first step, please feel free to report it.  The library is still under development and it might be infested with some minor bugs.

I wanted to start this post with a classical allocation problem to illustrate how you would implement your own function and optimize it using platypus.   This example, by the way, was inspired on Professor Loucks’ public system’s modeling class.

We have the following problem:

  1. We want to allocate coconuts to three different markets.
  2. The goal is to find the allocations that maximizes the total benefits.
  3. We have only 6 coconut trucks to distribute to the three markets.

So, lets find the best allocations using the NSGAII algorithm supported by platypus:


from platypus.algorithms import NSGAII
from platypus.core import Problem
from platypus.types import Real

class allocation(Problem):

def __init__(self):
   super(allocation, self).__init__(3, 3, 1)
   self.types[:] = [Real(-10, 10), Real(-10, 10), Real(-10,10)]
   self.constraints[:] = "<=0"

def evaluate(self, solution):
   x = solution.variables[0]
   y = solution.variables[1]
   z = solution.variables[2]
   solution.objectives[:] = [6*x/(1+x), 7*y/(1+1.5*y), 8*z/(1+0.5*z)]
   solution.constraints[:] = [x+y+z-6]

algorithm = NSGAII(allocation())
algorithm.run(10000)

for solution in algorithm.result:
   print(solution.objectives)

Assuming that you already have Platypus installed.  You should be able to import the classes specified in lines 1-3 in the code above.   The first line, as you can tell,  is were you import the MOEA of your choosing, make sure its supported by platypus.  In the second line, we import the Problem class that allows you to define your own  function.  You also need to import the type class  to specify your decision variables’ type.  The library supports Real, Binary,  Integer, etc.  I define the allocation class in line 5.  In line 8, I specify the number of objectives=3, number of decision variables=3 and number of constraints =1.  The bounds of my decision variables are defined in line 9.  The function in line 12 evaluates and stores the solutions.

Note that for the solution.objectives, each of the objective functions are specified in a vector and separated by a comma.  You can set constraints in line 17.  If you have more than one constraint it will be in the same vector with a comma separation.  In line 19, the function is to be optimized with the NSGAII problem, in line 20 you set the number of iterations for the algorithm, I went up to 100000, it really depends on the difficulty of your problem and your choice of algorithm.  Finally, I print the objective values in line 23, but you can have the decision variables as well if you wish.  As you can see, the setup of your problem can be extremely easy.  The names of the classes and functions in the library are very instinctive and you can focus entirely on your  problem formulation.

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

This semester I’m taking a Multivariate statistics course taught by Professor Scott Steinschneider in the BEE department at Cornell. I’ve been really enjoying the course thus far and thought I would share some of what we’ve covered in the class with a blog post. The material below on multicollinearity is from Dr. Steinschneider’s class, presented in my own words.

What is Multicollinearity?

Multicollinearity is the condition where two or more predictor variables in a statistical model are linearly related (Dormann et. al. 2013). The existence of multicollinearity in your data set can result in an increase of the variance of regression coefficients leading to unstable estimation of parameter values. This in turn can lead to erroneous identification of relevant predictors within a regression and detracts from a model’s ability to extrapolate beyond the range of the sample it was constructed with. In this post, I’ll explain how multicollinearity causes problems for linear regression by Ordinary Least Squares (OLS), introduce three metrics for detecting multicollinearity and detail two “Tolerant Methods” for dealing with multicollinearity within a data set.

How does multicollinearity cause problems in OLS regression?

To illustrate the problems caused by multicollinearity, let’s start with a linear regression:

y=x\beta +\epsilon

Where:

y=x\beta +\epsilon

x = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} predictor \hspace{.1 cm} variables

\beta = a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} coefficients

\epsilon =  a \hspace{.1 cm} vector \hspace{.1 cm} of \hspace{.1 cm} residuals

The Gauss-Markov theorem states that the Best Linear Unbiased Estimator (BLUE) for each  coefficient can be found using OLS:

\hat{\beta}_{OLS} = (x^Tx)^{-1}x^Ty

This  estimate will have a variance defined as:

var(\hat{\beta}_{OLS}) =\sigma^2 (x^Tx)^{-1}

Where:

\sigma^2 = the \hspace{.1 cm} variance\hspace{.1 cm} of \hspace{.1 cm} the\hspace{.1 cm} residuals

If you dive into the matrix algebra, you will find that the term (xTx) is equal to a matrix with ones on the diagonals and the pairwise Pearson’s correlation coefficients (ρ) on the off-diagonals:

(x^Tx) =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}

As the correlation values increase, the values within (xTx)-1 also increase. Even with a low residual variance, multicollinearity can cause large increases in estimator variance. Here are a few examples of the effect of multicollinearity using a hypothetical regression with two predictors:

 \rho = .3 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.09 & -0.33 \\ -0.33 & 1.09 \end{bmatrix}

 \rho = .9 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 5.26 & -4.73 \\ -5.26 & -4.73 \end{bmatrix}

 \rho = .999 \rightarrow (x^Tx)^{-1} =\begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}^{-1} = \begin{bmatrix} 500.25 & -499.75 \\ -499.75 & 500.25\end{bmatrix}

So why should you care about the variance of your coefficient estimators? The answer depends on what the purpose of your model is. If your only goal is to obtain an accurate measure of the predictand, the presence of multicollinearity in your predictors might not be such a problem. If, however, you are trying to identify the key predictors that effect the predictand, multicollinearity is a big problem.

OLS estimators with large variances are highly unstable, meaning that if you construct estimators from different data samples you will potentially get wildly different estimates of your coefficient values (Dormann et al. 2013). Large estimator variance also undermines the trustworthiness of hypothesis testing of the significance of coefficients. Recall that the t value used in hypothesis testing for an OLS regression coefficient is a function of the sample standard deviation (the square root of the variance) of the  OLS estimator.

t_{n-2} =\frac{\hat{\beta_j}-0}{s_{\beta_j}}

An estimator with an inflated standard deviation, s_{\beta_j}, will thus yield a lower t value, which could lead to the false rejection of a significant predictor (ie. a type II error). See Ohlemüller et al. (2008) for some examples where hypothesis testing results are undermined by multicollinearity.

Detecting Multicollinearity within a data set

Now we know how multicollinearity causes problems in our regression, but how can we tell if there is multicollinearity within a data set? There are several commonly used metrics for which basic guidelines have been developed to determine whether multicollinearity is present.

The most basic metric is the pairwise Pearson Correlation Coefficient between predictors, r. Recall from your intro statistics course that the Pearson Correlation Coefficient is a measure of the linear relationship between two variables, defined as:

r_{x_1,x_2}=\frac{cov(x_1,x_2)}{\sigma_x\sigma_y}

A common rule of thumb is that multicollinearity may be a problem in a data set if any pairwise |r| > 0.7 (Dormann et al. 2013).

Another common metric is known as the Variance Inflation Factor (VIF). This measure is calculated by regressing each predictor on all others being used in the regression.

VIF(\beta_j) = \frac{1}{1-R^2_j}

Where Rj2 is the R2 value generated by regressing predictor xj on all other predictors. Multicollinearity is thought to be a problem if VIF > 10 for any given predictor (Dormann et al. 2012).

A third metric for detecting multicollinearity in a data set is the Condition Number (CN) of the predictor matrix defined as the square root of the ratio of the largest and smallest eigenvalues in the predictor matrix:

CN=\sqrt{\frac{\lambda_{max}}{\lambda_{min}}}

CN> 15 indicates the possible presence of multicollinearity, while a CN > 30 indicates serious multicollinearity problems (Dormann et al. 2013).

Dealing with Multicollinearity using Tolerant Methods

In a statistical sense, there is no way to “fix” multicollinearity. However, methods have been developed to mitigate its effects. Perhaps the most effective way to remedy multicollinearity is to make a priori judgements about the relationship between predictors and remove or consolidate predictors that have known correlations. This is not always possible however, especially when the true functional forms of relationships are not known (which is often why regression is done in the first place). In this section I will explain two “Tolerant Methods” for dealing with multicollinearity.

The purpose of Tolerant Methods is to reduce the sensitivity of regression parameters to multicollinearity. This is accomplished through penalized regression. Since multicollinearity can result in large and opposite signed  estimator values for correlated predictors, a penalty function is imposed to keep the value of predictors below a pre-specified value.

\sum_{j=1}^{p}|\beta|^l \leq c

Where c is the predetermined value representing model complexity, p is the number of predictors and l is either 1 or 2 depending on the type of tolerant method employed (more on this below).

Ridge Regression

Ridge regression uses the L2 norm, or Euclidean distance, to constrain model coefficients (ie. l = 2 in the equation above). The coefficients created using ridge regression are defined as:

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Ridge regression adds a constant, λ, to the term xTx to construct the estimator. It should be noted that both x and y should be standardized before this estimator is constructed. The Ridge regression coefficient is the result of a constrained version of the ordinary least squares optimization problem. The objective is to minimize the sum of square errors for the regression while meeting the complexity constraint.

\hat{\beta_r} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2  \\  \sum_{j=1}^{p}|\beta_j|^2 \leq c \end{cases}

To solve the constrained optimization, Lagrange multipliers can be employed. Let z equal the Residual Sum of Squares (RSS) to be minimized:

argmin(\beta) \hspace{.3cm}  z= (y-x\beta)^T(y-x\beta)+\lambda(\sum_{i=1}^{N}|\beta_j|^2-c)

This can be rewritten in terms of the L2 norm of β:

z = (y-x\beta)^T(y-x\beta)+\lambda||\beta||^2_2

Taking the derivative with respect to β and solving:

0 = \frac{\partial z}{\partial \beta} = -2x^T(y-x\beta)+2\lambda\beta

x^Ty = x^Tx\beta+\lambda\beta=(x^Tx+\lambda I)\beta

\hat{\beta}_{r} = (x^Tx+\lambda I)^{-1}x^Ty

Remember that the Gauss-Markov theorem states that the OLS estimate for regression coefficients is the BLUE, so by using ridge regression, we are sacrificing some benefits of OLS estimators in order to constrain estimator variance. Estimators constructed using ridge regression are in fact biased, this can be proven by calculating the expected value of ridge regression coefficients.

E[\hat{\beta_r}]=(I+\lambda(x^Tx)^{-1})\beta \neq \beta

For a scenario with two predictors, the tradeoff between reduced model complexity and increase bias in the estimators can be visualized graphically by plotting the estimators of the two beta values against each other. The vector of beta values estimated by regression are represented as points on this plot  (\hat{\beta}=[\beta_1, \beta_2]).  In Figure 1,\beta_{OLS} is plotted in the upper right quadrant and represents estimator that produces the smallest RSS possible for the model. The ellipses centered around  are representations of the increasing RSS resulting from the combination of β1 and β2  values, each RSS is a function of a different lambda value added to the regression.  The circle centered around the origin represents the chosen level of model complexity that is constraining the ridge regression. The ridge estimator is the point where this circle intersects a RSS ellipse. Notice that as the value of c increases, the error introduced into the estimators decreases and vice versa.

ridge_regression

Figure 1: Geometric Interpretation of a ridge regression estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimates represent RSS contours for each Beta 1, Beta 2 combination (denoted on here as z from the optimization equation above). The model complexity is constrained by distance c from the origin. The ridge regression estimator of Beta is shown as the red dot, where the RSS contour meets the circle defined by c.

The c value displayed in Figure 1 is only presented to explain the theoretical underpinnings of ridge regression. In practice, c is never specified, rather, a value for λ is chosen a priori to model construction. Lambda is usually chosen through a process known as k-fold cross validation, which is accomplished through the following steps:

  1. Partition data set into K separate sets of equal size
  2. For each k = 1 …k, fit model with excluding the kth set.
  3. Predict for the kth set
  4. Calculate the cross validation error (CVerror)for kth set: CV^{\lambda_0}_k = E[\sum(y-\hat{y})^2]
  5. Repeat for different values of , choose a that minimizes   CV^{\lambda_0} = \frac{1}{k}CV^{\lambda_0}_k

Lasso Regression

Another Tolerant Method for dealing with multicollinearity known as Least Absolute Shrinkage and Selection Operator (LASSO) regression, solves the same constrained optimization problem as ridge regression, but uses the L1 norm rather than the L2 norm as a measure of complexity.

\hat{\beta}_{Lasso} \begin{cases} argmin(\beta) \hspace{.1cm}\sum_{i=1}^{N} \epsilon_i^2 \\ \sum_{j=1}^{p}|\beta_j|^1 \leq c \end{cases}

LASSO regression can be visualized similarly to ridge regression, but since c is defined by the sum of absolute values of beta, rather than sum of squares, the area it constrains is diamond shaped rather than circular.  Figure 2 shows the selection of the beta estimator from LASSO regression. As you can see, the use of the L1 norm means LASSO regression selects one of the predictors and drops the other (weights it as zero). This has been argued to provide a more interpretable estimators (Tibshirani 1996).

lasso_regression

Figure 2: Geometric interpretation of Lasso Regression Estimator. The blue dot indicates the OLS estimate of Beta, ellipses centered around the OLS estimate represents RSS contours for each Beta 1, Beta 2 combination (denoted as z from the optimization equation). The mode complexity is constrained by the L1 norm representing model complexity. The Lasso estimator of Beta is shown as the red dot, the location where the RSS contour intersects the diamond defined by c.

Final thoughts

If you’re creating a model with multiple predictors, it’s important to be cognizant of potential for multicollinearity within your data set. Tolerant methods are only one of many possible remedies for multicollinearity (other notable techniques include data clustering and Principle Component Analysis) but it’s important to remember that no known technique can truly “solve” the problem of multicollinearity. The method chosen to deal with multicollinearity should be chosen on a case to case basis and multiple methods should be employed if possible to help identify the underlying structure within the predictor data set (Dormann et. al. 2013)

Citations

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D. and Lautenbach, S. 2013, “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance.” Ecography, 36: 27–46. doi:10.1111/j.1600-0587.2012.07348.x

Ohlemüller, R. et al. 2008. “The coincidence of climatic and species rarity: high risk to small-range species from climate change.” Biology Letters. 4: 568 – 572.

Tibshirani, Robert 1996. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological): 267-288.

 

 

 

Alluvial Plots

Alluvial Plots

We all love parallel coordinates plots and use them all the time to display our high dimensional data and tell our audience a good story. But sometimes we may have large amounts of data points whose tradeoffs’ existence or lack thereof cannot be clearly verified, or the data to be plotted is categorical and therefore awkwardly displayed in a parallel coordinates plot.

One possible solution to both issues is the use of alluvial plots. Alluvial plots work similarly to parallel coordinates plots, but instead of having ranges of values in the axes, it contains bins whose sizes in an axis depends on how many data points belong to that bin. Data points that fall within the same categories in all axes are grouped into alluvia (stripes), whose thicknesses reflect the number of data points in each alluvium.

Next are two examples of alluvial plots, the fist displaying categorical data and the second displaying continuous data that would normally be plotted in a parallel coordinates plot. After the examples, there is code available to generate alluvial plots in R (I know, I don’t like using R, but creating alluvial plots in R is easier than you think).

Categorical data

The first example (Figure 1) comes from the cran page for the alluvial plots package page. It uses alluvial plots to display data about all Titanic’s passengers/crew and group them into categories according to class, sex, age, and survival status.

alluvial_cran

Figure 1 – Titanic passenger/crew data. Yellow alluvia correspond to survivors and gray correspond to deceased. The size of each bin represents how many data points (people) belong to that category in a given axis, while the thickness of each alluvium represent how many people fall within the same categories in all axes. Source: https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html.

Figure 1 shows that most of the passengers were male and adults, that the crew represented a substantial amount of the total amount of people in the Titanic, and that, unfortunately, there were more deceased than survivors. We can also see that a substantial amount of the people in the boat were male adult crew members who did not survive, which can be inferred by the thickness of the grey alluvium that goes through all these categories — it can also be seen by the lack of an alluvia hitting the Crew and Child bins, that (obviously) there were no children crew members. It can be also seen that 1st class female passengers was the group with the greatest survival rate (100%, according to the plot), while 3rd class males had the lowest (ballpark 15%, comparing the yellow and gray alluvia for 3rd class males).

Continuous data

The following example shows the results of policy modeling for a fictitious water utility using three different policy formulations. Each data point represents the modeled performance of a given candidate policy in six objectives, one in each axis. Given the uncertainties associated with the models used to generate this data, the client utility company is more concerned about whether or not a candidate policy would meet certain performance criteria according to the model (Reliability > 99%, Restriction Frequency < 20%, and Financial Risk < 10%) than about the actual objective values. The utility also wants to have a general idea of the tradeoffs between objectives.

Figure 2 was created to present the modeling results to the client water utility. The colored alluvia represent candidate policies that meet the utility’s criteria, and grey lines represent otherwise. The continuous raw data used to generate this plot was categorized following ranges whose values are meaningful to the client utility, with the best performing bin always put in the bottom of the plot. It is important to notice that the height of the bins represent the number of policies that belong to that bin, meaning that the position of the gap between two stacked bins does not represent a value in an axis, but the fraction of the policies that belong to each bin. It can be noticed from Figure 2 that it is relatively difficult for any of the formulations to meet the Reliability > 99% criteria established by the utility. It is also striking that a remarkably small number of policies from the first two formulations and none of the policies from the third formulation meet the criteria established by the utilities. It can also be easily seen by following the right alluvia that the vast majority of the solutions with smaller net present costs of infrastructure investment obtained with all three formulations perform poorly in the reliability and restriction frequency objectives, which denotes a strong tradeoff. The fact that such tradeoffs could be seen when the former axis is on the opposite side of the plot to the latter two is a remarkable feature of alluvial plots.

tradeoffs_all3.svg.png

Figure 2 – Alluvial plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The parallel coordinates plots in Figure 3 displays the same information as the alluvial plot in Figure 2. It can be readily seen that the analysis performed above, especially when it comes to the tradeoffs, would be more easily done with Figure 2 than with Figure 3. However, if the actual objective values were important for the analysis, Figure 3 would be needed either by itself or in addition to Figure 2, the latter being used likely as a pre-screening or for a higher level analysis of the results.

tradeoffs_all-svg

Figure 3 – Parallel coordinates plot displaying modeled performance of candidate long-term planning policies. The different subplots show different formulations (1 in the top, 3 in the bottom).

The R code used to create Figure 1 can be found here. The code below was used to create Figure 2 — The packages “alluvia”l and “dplyr” need to be installed before attempting to use the provided code, for example using the R command install.packages(package_name). Also, the user needs to convert its continuous data into categorical data, so that each row corresponds to a possible combination of bins in all axis (one column per axis) plus a column (freqs) representing the frequencies with which each combination of bins is seen in the data.

# Example datafile: snippet of file "infra_tradeoffs_strong_freqs.csv"
Reliability, Net Present Cost of Inf. Investment, Peak Financial Costs, Financial Risk, Restriction Frequency, Jordan Lake Allocation, freqs
2<99,0<60,0<25,0<10,2>20,0<70,229
0>99,2>60,0<25,0<10,2>20,0<70,0
2<99,2>60,0<25,0<10,2>20,0<70,168
0>99,0<60,2>25,0<10,2>20,0<70,0
2<99,0<60,2>25,0<10,2>20,0<70,3
0>99,2>60,2>25,0<10,2>20,0<70,2
2<99,2>60,2>25,0<10,2>20,0<70,45
0>99,0<60,0<25,2>10,2>20,0<70,0
2<99,0<60,0<25,2>10,2>20,0<70,317
0>99,2>60,0<25,2>10,2>20,0<70,0
2<99,2>60,0<25,2>10,2>20,0<70,114
# load packages and prepare data
library(alluvial)
library(dplyr)

itss <- read.csv('infra_tradeoffs_strong_freqs.csv')
itsw <- read.csv('infra_tradeoffs_weak_freqs.csv')
itsn <- read.csv('infra_tradeoffs_no_freqs.csv')

# preprocess the data (convert do dataframe)
itss %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_strong
itsw %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_weak
itsn %>% group_by(Reliability, Restriction.Frequency, Financial.Risk, Peak.Financial.Costs, Net.Present.Cost.of.Inf..Investment, Jordan.Lake.Allocation) %>%
summarise(n = sum(freqs)) -> its_no

# setup output file
svg(filename="tradeoffs_3_formulations.svg",
width=8,
height=8,
pointsize=18)
p <- par(mfrow=c(3,1))
par(bg = 'white')

# create the plots
alluvial(
its_strong[,1:6],
freq=its_strong$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_strong$Financial.Risk == "0<10", "blue", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_strong$n < 1
)
alluvial(
its_weak[,1:6],
freq=its_weak$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_weak$Financial.Risk == "0<10", "chartreuse2", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_weak$n < 1
)
alluvial(
its_no[,1:6],
freq=its_no$n,
col = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
border = ifelse(its_strong$Reliability == "0>99" &
its_strong$Restriction.Frequency == "0<20" &
its_no$Financial.Risk == "0<10", "red", "grey"),
# border = "grey",
alpha = 0.5,
hide=its_no$n < 1
)
dev.off()

Customizing color matrices in matplotlib

In this post I intend to pass on some tricks on matplotlib color matrix customization.  I am guilty of beautifying some of my color matrices with Adobe Illustrator in the past, re-arranging labels, titles, colormaps, etc.  However, this time I had to generate way too many of them and I could see the beautifying process becoming extremely painful.  I will simply demonstrate how to do the following three plots simultaneously with relatively few lines of code in the hopes of providing useful elements for your own plot cutomization.

plot1a.png

plo2.png

plt3.png

Plot 1- Plot 3  were generated with the following script which I will explain in detail later int this post:

import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#Listing your files
files = glob.glob('./attainment_matrices/*.out')

#Organizing your files
data_plot1=[np.genfromtxt(f) for f in files[8:12]]
data_plot2=[np.genfromtxt(f) for f in files[0:4]]
data_plot3=[np.genfromtxt(f) for f in files[16:20]]
data=[data_plot1,data_plot2,data_plot3]

#Organizing titles and labels
plot_titles=['Plot 1','Plot 2', 'Plot 3']
subplot_titles= ['Subplot 1','Subplot 2', 'Subplot 3','Subplot 4']
labels= ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
y_labels= ['Y Title a$\longrightarrow$','Y Title b $\longrightarrow$','Y Title c $\longrightarrow$']
cmap_labels=['Colormap label a$\longrightarrow$', 'Colormap label b$\longrightarrow$', 'Colormap label c$\longrightarrow$']

# Some variables to adjust subplots if necessary
left = 0.125 # the left side of the subplots of the figure
right = 0.9 # the right side of the subplots of the figure
bottom = 0.3 # the bottom of the subplots of the figure
top = 0.82 # the top of the subplots of the figure
wspace = 0.2 # the amount of width reserved for blank space between subplots
hspace = 0.5 # the amount of height reserved for white space between subplots

#Font sizes
plot_fontsize=40
subplot_fontsize=32
tick_label_fontsize=22 # Ticks, colormap, x and y labels use this fontsize

#x-label adjustments
rotation= 45 # rotation of labels
adjust=0 #if you want the x labels to be displayed right at the middle then adjust=0.5

x=np.arange(0,5.5)
y=np.linspace(0,100,1001)

#colormaps
colormap=['Set3_r', 'YlGnBu','Paired']

# the j is the iteration variable for each subplot, and the l is the iteration variable
# for each plot.
for l in range(len(plot_titles)):
fig, ax=plt.subplots(1,len(subplot_titles),sharey=True)
plt.subplots_adjust(left=left, bottom=bottom, right=right, top=top, wspace=wspace, hspace=hspace)
#setting the titles wrapped by a transparent grey box at position=(x,y)
fig.suptitle(plot_titles[l], fontsize=plot_fontsize,
bbox={'facecolor':'grey', 'alpha':0.1, 'pad':12}, position=(0.1827, .95))

for j in range(len(subplot_titles)):
a= ax[j].pcolor(x,y,data[l][j], cmap=colormap[l])
ax[j].set_title(subplot_titles[j], fontsize= subplot_fontsize, y=1.03)
#Set the y-label only in the first subplot
ax[0].set_ylabel(y_labels[l], fontsize=tick_label_fontsize)
ax[j].set_xticks(x + adjust, minor=False)
#ax[j].set_xlim(left=0, right=5)
#ax[j].set_ylim(0,100)
ax[j].set_xticklabels(labels[:], rotation=rotation)
ax[j].tick_params(labelsize=tick_label_fontsize)

#colorbar settings:
leftc= 0.12504
bottomc=.13
width_c=.775
height_c=0.04
cbar_ax= fig.add_axes([leftc,bottomc,width_c,height_c])
#cbar= fig.colorbar(a, cax=cbar_ax, orientation='horizontal')
cbar = fig.colorbar(a,cax=cbar_ax, ticks=[0, 0.5, 1], orientation='horizontal')
cbar.ax.set_xticklabels(['Low', 'Medium', 'High'])
cbar.set_label(cmap_labels[l], fontsize=tick_label_fontsize, labelpad=25)
cbar.ax.tick_params(labelsize=tick_label_fontsize)

plt.show()
plt.show()

First, in lines 1 though 4 I specify the required libraries.  I use glob.glob to list the files for the analysis with their full path in line 8.  Then if you want to see the order in which the files are listed you can simply run the print command as follows:

print files[:]

And you should be able to see the order of the files like so:

[‘./data_directory/file1.out’, ‘./data_directory/file10.out’, … ‘./data_directory/file24.out’]

I used the numpy genfromtxt function in lines 11-13 to load the data from the specified files while organizing the data that would be used in plot 1, plot 2 and plot 3.   I then made an array of the previous data on line 14 so I could use it in a loop later on.

I organized the titles of the main plots, subplots, the x and y labels, as well as the colormap labels in lines 17-21.  All the parameters required to adjust the aspect ratio of the subplots are listed in lines 24 to 29.    If you simply want all of your subplots to be squared, you can add the aspect=’equal’  parameter directly in the plt.subplot() function.

The font for the plots, subplots, ticks and labels are specified in lines 32 to 34.  The x-labels can be adjusted in multiple ways.  In line 27 I set the rotation of the x-labels to 45 degrees.  If you want the labels to be completely vertical then you would do: rotation=90.  If you want horizontal labels, you don’t need to specify a rotation parameter.  Then, I used the adjust variable to specify the position of the x-label,  adjust=0 specifies that the label will be written starting at the left corner of the bar, if you want the label to be centered, then you can do adjust=0.5.

In line 44,  I list the different colormaps to be used by each plot. The outer loop in line 48, iterates through the 3 plots,  while the inner loop in line 55, iterates through the 4 subplots generated in each plot.    In line 49 we specify the number of rows and columns of subplots that will be generated.  I want them to share the  y axis, hence, sharey=True.   If you want your subplots to also share the x axis, you would simply add ‘sharex=True‘ in line 49.  The plt.subplots_adjust function in line 50, allows you to specify the exact aspect ratio of your subplots, including the white space between them and their location in the figure canvas, this is detailed in lines 24 to 29.  In line 52, I specified the title of the plot as a whole, since I have three different plots, I loop through each of the different titles.  The title is shown in a grey transparent box at the upper left corner of the canvas which was specified by position(x,y).

Lines 56 to 64 show the subplots’ code.  I use the pcolor function to generate the color matrices.  However, there are other methods to create them, such as pcolormesh, imshow, contour, etc.  In line 57 I loop through the subplot titles, then I assign their font size.  Here, the y=1.03 specifies the distance from the subplot title to the plot.  The more distance I want to create the larger this y value should be.  In line 59 I set the y-label, since I only want the y-label to be shown in the left most plot, I fix ax[0].set_ylabel(…), if you want each subplot to have their own y-labels then you can loop through each of them with the subplot iteration variable j, such as ax[j].set_ylabel(…).  Lines 61 to 62 (commented out), show how you could set the x and y axis limits.  In line 63, I set the x_ticklabels; similarly you could set the y_ticklabels if necessary.  The fontsize across all the ticks in line 64.

The colorbar settings are shown in lines 67 through 76.  Observe how you can specify the position of the left bottom corner of the colorbar, and from there you can assign the width and the height of the colorbar.  Note that there’s a couple of ways to specify the colorbar, the first one is shown in line 72, it will generate a colorbar with the default ticks.  However, if you want to cutomize or add text to your colorbar, you would have to do so as shown in lines 73-74.  The ticks parameter in line 73, specifies the position were the labels written in line 74 are displayed.  You can set the colorbar label with .set_label.   I loop through the colormap labels for each plot and assign their fontsize in line 75.  The labelpad allows you to specify the distance between the colorbar and the label.   Finally,  the font size of the colormap ticks are specified in line 76.

I hope you can find some of the previous elements useful when designing your own color matrices ;).