A visual introduction to data compression through Principle Component Analysis

Principle Component Analysis (PCA) is a powerful tool that can be used to create parsimonious representations of a multivariate data set. In this post I’ll code up an example from Dan Wilks’ book Statistical Methods in the Atmospheric Sciences to visually illustrate the PCA process. All code can be found at the bottom of this post.

As with many of the examples in Dr. Wilks’ excellent textbook, we’ll be looking at minimum temperature data from Ithaca and Canandaigua, New York  (if anyone is interested, here is the distance between the two towns).  Figure 1 is a scatter plot of the minimum temperature anomalies at each location for the month of January 1987.

raw_data

Figure 1: Minimum temperature anomalies in Ithaca and Canandaigua, New York in January 1987

As you can observe from Figure 1, the two data sets are highly correlated, in fact, they have a Pearson correlation coefficient of 0.924. Through PCA, we can identify the primary mode of variability within this data set (its largest “principle component”) and use it to create a single variable which describes the majority of variation in both locations. Let x define the matrix of our minimum temperature anomalies in both locations. The eigenvectors (E) of the covariance matrix of x describe the primary modes variability within the data set. PCA uses these eigenvectors to  create a new matrix, u,  whose columns contain the principle components of the variability in x.

u = xE

Each element in u is a linear combination of the original data, with eigenvectors in E serving as a kind of weighting for each data point. The first column of u corresponds to the eigenvector associated with the largest eigenvalue of the covariance matrix. Each successive column of u represents a different level of variability within the data set, with u1 describing the direction of highest variability, u2 describing the direction of the second highest variability and so on and so forth. The transformation resulting from PCA can be visualized as a rotation of the coordinate system (or change of basis) for the data set, this rotation is shown in Figure 2.

PCA visualization

Figure 2: Geometric interpretation of PCA

As can be observed in Figure 2, each data point can now be described by its location along the newly rotated axes which correspond to its corresponding value in the newly created matrix u. The point (16, 17.8), highlighted in Figure 2, can now be described as (23, 6.6) meaning that it is 23 units away from the origin in the direction of highest variability and 6.6 in the direction of second highest variability. As shown in Figure 2, the question of “how different from the mean” each data point is can mostly be answered by looking at its  corresponding u1 value.

Once transformed, the original data can be recovered through a process known as synthesis. Synthesis  can be thought of as PCA in reverse. The elements in the original data set x, can be approximated using the eigenvalues of the covariance matrix and the first principle component, u1.

\tilde{x} = \tilde{u}\tilde{E}^T

Where:

\tilde{x}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} reconstructed\hspace{.1cm} data\hspace{.1cm} set

\tilde{u}\hspace{.1cm} is\hspace{.1cm} the\hspace{.1cm} PCs\hspace{.1cm} used \hspace{.1cm} for \hspace{.1cm} reconstruction\hspace{.1cm} (in\hspace{.1cm} our\hspace{.1cm} case\hspace{.1cm} the\hspace{.1cm} first\hspace{.1cm} column)

\tilde{E}\hspace{.1cm} is \hspace{.1cm} the \hspace{.1cm} eigenvector\hspace{.1cm} of \hspace{.1cm} the \hspace{.1cm} PCs \hspace{.1cm} used

For our data set, these reconstructions seem to work quite well, as can be observed in Figure 3.

final_synth.png

 

 

Data compression through PCA can be a useful alternative tolerant methods for dealing with multicollinearity, which I discussed in my previous post. Rather than running a constrained regression, one can simply compress the data set to eliminate sources of multicollinearity. PCA can also be a helpful tool for identifying patterns within your data set or simply creating more parsimonious representations of a complex set of data. Matlab code used to create the above plots can be found below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Ithaca_Canandagua_PCA
% By: D. Gold
% Created: 3/20/17
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% This script will perform Principle Component analysis on minimum
% temperature data from Ithaca and Canadaigua in January, 1987 provided in 
% Appendix A of Wilks (2011). It will then estimate minimum temperature
% values of both locations using the first principle component.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% create data sets
clear all

% data from appendix Wilks (2011) Appendix A.1
Ith = [19, 25, 22, -1, 4, 14, 21, 22, 23, 27, 29, 25, 29, 15, 29, 24, 0,... 
 2, 26, 17, 19, 9, 20, -6, -13, -13, -11, -4, -4, 11, 23]';

Can = [28, 28, 26, 19, 16, 24, 26, 24, 24, 29, 29, 27, 31, 26, 38, 23,... 
 13, 14, 28, 19, 19, 17, 22, 2, 4, 5, 7, 8, 14, 14, 23]';

%% center the data, plot temperature anomalies, calculate covariance and eigs

% center the data
x(:,1) = Ith - mean(Ith);
x(:,2) = Can - mean(Can);

% plot anomalies
figure
scatter(x(:,1),x(:,2),'Filled')
xlabel('Ithaca min temp anomaly ({\circ}F)')
ylabel('Canandagua min temp anomaly ({\circ}F)')

% calculate covariance matrix and it's corresponding Eigenvalues & vectors
S = cov(x(:,1),x(:,2));
[E, Lambda]=eigs(S);

% Identify maximum eigenvalue, it's column will be the first eigenvector
max_lambda = find(max(Lambda)); % index of maximum eigenvalue in Lambda
idx = max_lambda(2); % column of max eigenvalue

%% PCA
U = x*E(:,idx);

%% synthesis
x_syn = E(:,idx)*U'; % reconstructed values of x

% plot the reconstructed values against the original data
figure
subplot(2,1,1)
plot(1:31,x(:,1) ,1:31, x_syn(1,:),'--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Ithaca min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')
subplot(2,1,2)
plot(1:31, x(:,2),1:31, x_syn(2,:)','--')
xlim([1 31])
xlabel('Time (days)')
ylabel('Canadaigua min. temp. anomalies ({\circ}F)')
legend('Original', 'Reconstruction')

 

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Amsterdam: Elsevier Academic Press.

Solving non-linear problems using linear programming

Solving non-linear problems using linear programming

This week’s post comes from recent conversations we’ve had around the Reed group concerning tools to quickly solve (approximately) non-linear programming problems.  First, some context.

As part of a simulation model our group is building, a drinking water allocation sub-problem must be solved.   Figure 1 is a simplified example of the sort of problem we are solving.

network

Figure 1: Mock water distribution network

There are three utilities that each have a demand (d_{1}, d_{2}, and d_{3}). The utilities are connected via some infrastructure, as shown in Figure 1.  When our total available water (R) is in excess of the demand (d_{1}+d_{2}+d_{3}), no rationing is needed.  When we do need to ration, we want to allocate the water to minimize the percent supply deficits across the three utilities:

equation 1

Equation 1

Subject to:

equation 2

The last constraint here describes the a limitation of the distribution network.  The real problem is much more complicated, but we needn’t detail that here.

This problem needs to be solved thousands, or hundreds of thousands of times in each simulation, so we want any solution technique to be fast.  The natural solution is linear programming (LP), which can solve problems with tens of thousands of variables and constraints nearly instantaneously.

We won’t discuss LP in great detail here, except to say that LP requires an objective and constraints that are linear with respect to the decision variables.  These restrictive requirements significantly reduce the number of potential optimal solutions that must be searched.  By systematically testing and pivoting between these potential optimal solutions, the popular Simplex Algorithm quickly converges to the optimal solution.

As stated in equation 1, our rationing scheme is indifferent to imposing small deficits across all three utilities, or imposing one large deficit to a single utility.  For example, the objective value in equation 1 is the same, whether each utility has a deficit of 5%, or if utility 1 has a deficit of 15%, and utilities 2 and 3 have no deficit.  In reality, many small deficits are likely preferable to one large one.  So what are we to do?

We could square our deficits.  In that case, our rationing scheme will prefer small distributed deficits over one large deficit:

equation 3

Equation 2

BUT, we can’t use LP to solve this problem, as our objective is now non-linear! There are non-linear programming algorithms that are relatively fast, but perhaps not fast enough.  Instead we could linearize our non-linear objective, as shown in Figure 2.

The strategy here is to divide a single allocation,  x_{1} for instance, into many decision variables, representing different ranges of the actual allocation x_{1}.  In each range, a linear segment approximates the actual quadratic objective function.  Any actual release x_{1} can be achieved by assigning the appropriate values to the new decision variables (k_{1}, k_{2}, and k_{3}), and the contribution to the objective function from that release can be approximated by:

equation 4

Equation 3

Subject to:

equation 5

If a more accurate description is needed, the range of x_{1} can be divided into more segments.  For our purposes just a few segments are probably sufficient.  A similar strategy can be adopted for x_{2} and x_{3}.  Of course the constraints from the original optimization problem would need to be translated into terms of the new decision variables.

Now we are adding many more decision variables and constraints, but this is unlikely to slow a modern LP algorithm too much; we are still solving a relatively simple problem.  BUT, how does the LP algorithm know to increase k_{1} to its maximum threshold before applying k_{2}?  Do we need to add a number of conditional constraints to ensure this is done properly?

It turns out we don’t!  Because our squared deficit curve in Figure 2 is monotonic and convex, we know that slope of the linear segments making up the approximation are increasing (becoming less negative).  Thus, in a minimization problem, the marginal improvement in the objective is highest for the k_{1} segment, followed by the k_{2} segment, followed by the k_{3} segment, and so on.  In other words a < b < c.For this reason, the algorithm will increase k_{1} to its maximum threshold before assigning a non-zero value to k_{2}, and so forth.  No need for complicated constraints!

Now this is not always the case.  If the function were not monotonic, or if it were convex for a maximization, or concave for a minimization, this would not work.  But, this trick works for a surprising number of applications in water resources systems analysis!

If nothing else this simple example serves as a reminder that a little bit of thought in formulating problems can save a lot of time later!

Using Borg in Parallel and Serial with a Python Wrapper – Part 1

Simulation and optimization are frequently used to solve complex water resources and environmental systems problems. By itself, a simulation model begs the question “what to simulate?” Similarly, by itself, an optimization model begs the question “is the solution really best?” For this reason, simulation and optimization models are frequently coupled.

This blog post is part 1 of a multi-part series that will demonstrate how I have coupled a pure Python simulation model with the multi-objective evolutionary optimization algorithm Borg. In this post, I will show how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py).

Please see this previous blog post for some background about Borg, and how to obtain it. My instructions below assume you have access to the Borg files.

In the setup I will describe below, Borg parameterizes and iteratively refines solutions (e.g., reservoir operating policies) to a problem, optimizing them in response to their simulated performance with respect to multiple objectives.

You will need the following Borg files (see link above for how to download these):

  • Serial (i.e., borg.c, libborg.so, etc.) and/or master-slave (i.e., borgms.c, libborgms.so, etc.) implementations of Borg, depending upon your ability to parallelize.
  • Python wrapper for Borg (borg.py), which will allow you to to access Borg easily in Python.

You will need to create the following files yourself (I provide sample code below for these files):

  • example_sim_opt.py—A python module that should contain two main functions:
    1. A simulation caller, which takes decision variables and returns multi-objective performance. This function is called “Simulation_Caller” in the example_sim_opt.py code below.
    2. An optimization function, which calls the Borg MOEA through its python wrapper borg.py. This borg.py wrapper provides extensive docstring documentation regarding required arguments, returned values, etc., so I do suggest reading through the wrapper if you have questions (e.g., about the python data types of arguments and returns).

Note that the file and function names above are just example names. You can name the above files whatever you want. Just be sure to modify the code I provide below to reflect the new names.

A sample of code for example_sim_opt.py is as follows:

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

def Simulation_Caller(vars):
    '''
    Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

    Note: You could also just put your simulation/function evaluation code here.

    Args:
        vars: A list of decision variable values from Borg

    Returns:
        performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
    '''

    borg_vars = vars  # Decision variable values from Borg

    # Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
    op_policy_params = np.asarray(borg_vars)
    # Call/run simulation model, return multi-objective performance:
    performance = pysedsim.PySedSim(decision_vars = op_policy_params)
    return performance

def Optimization():

    '''

    Purpose: Call this method from command line to initiate simulation-optimization experiment

    Returns:
        --pareto approximate set file (.set) for each random seed
        --Borg runtime file (.runtime) for each random seed

    '''

    import borg as bg  # Import borg wrapper

    parallel = 1  # 1= master-slave (parallel), 0=serial

    # The following are just examples of relevant MOEA specifications. Select your own values.
    nSeeds = 25  # Number of random seeds (Borg MOEA)
    num_dec_vars = 10  # Number of decision variables
    n_objs = 6  # Number of objectives
    n_constrs = 0  # Number of constraints
    num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
    runtime_freq = 1000  # Interval at which to print runtime details for each random seed
    decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
    epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective

    # Where to save seed and runtime files
    main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
    os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
    output_location = main_output_file_dir + os_fold + 'sets'

    # If using master-slave, start MPI. Only do once.
    if parallel == 1:
        bg.Configuration.startMPI()  # start parallelization with MPI

    # Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
    for j in range(nSeeds):
        # Instantiate borg class, then set bounds, epsilon values, and file output locations
        borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller)
        borg.setBounds(*decision_var_range)  # Set decision variable bounds
        borg.setEpsilons(*epsilon_list)  # Set epsilon values
        # Runtime file path for each seed:
        runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
        if parallel == 1:
            # Run parallel Borg
            result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

        if parallel == 0:
            # Run serial Borg
            result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
                                 "runtimefile": runtime_filename})

        if result:
            # This particular seed is now finished being run in parallel. The result will only be returned from
            # one node in case running Master-Slave Borg.
            result.display()

            # Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
            f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
            f.write('#Borg Optimization Results\n')
            f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
                    ' are the ' + 'objective values\n')
            for solution in result:
                line = ''
                for i in range(len(solution.getVariables())):
                    line = line + (str(solution.getVariables()[i])) + ' '

                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f.write(line[0:-1]+'\n')
            f.write("#")
            f.close()

            # Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
            # the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
            f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
            for solution in result:
                line = ''
                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f2.write(line[0:-1]+'\n')
            f2.write("#")
            f2.close()

            print("Seed %s complete") %j

    if parallel == 1:
        bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
    '''
    Function to determine whether operating system is (1) Windows, or (2) Linux

    Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
    post-simulation.
    '''

    if platform.system() == 'Windows':
        os_fold_op = '\\'
    elif platform.system() == 'Linux':
        os_fold_op = '/'
    else:
        os_fold_op = '/'  # Assume unix OS if it can't be identified

    return os_fold_op
 

The following is an example of how you would submit a batch script on a Linux cluster to run a parallelized simulation-optimization experiment using the example_sim_opt.py and borg.py files. Note that in the parallelized version, simulations (i.e., “function evaluations”) are being run in parallel by separate processors.

You would need the following two files:

  1. example_sim_opt_caller.py. This is a python file that is used to call example_sim_opt.py
  2. example_sim_opt_batch_script.pbs. This is batch script that runs example_sim_opt_caller.py in parallel on a cluster using open MPI.

Example code for example_sim_opt_caller.py:


'''
Purpose: To initiate the optimization process, which will iteratively call the simulation model.
'''

import example_sim_opt  # Import main optimization module that uses borg python wrapper

# Module within example
example_sim_opt.Optimization()

Example code for example_sim_opt_batch_script.pbs:


#PBS -l nodes=8:ppn=16
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -o pysedsim_output.out

cd $PBS_O_WORKDIR
source /etc/profile.d/modules.sh
module load openmpi-1.6.5-intel-x86_64/gnu
module load python-2.7.5
mpirun python example_sim_opt_caller.py

You could then run the above batch script with a command such as:

qsub example_sim_opt_batch_script.pbs

A simple allocation optimization problem in Platypus

For those of you who are not familiar with Project Platypus, its a repository that supports a collection of python libraries for multi-objective optimization, decision making and data analysis. All the libraries are written in a very intuitive way and it is just so slick.

In this post I will focus exclusively on the platypus library which supports a variety of multi-objective evolutionary algorithms (MOEAs), such as NSGA-II, NSGA-III, MOEA/D, IBEA, EpsMOEA, SPEA2, GDE3, OMOPSO and SMPSO. Along with a number of analysis tools and performance metrics (which we will be discussing in subsequent posts).

First you can install the entire framework by typing the following commands on your terminal:

git clone https://github.com/Project-Platypus/Platypus.git
cd Platypus
python setup.py develop

If you have trouble with this first step, please feel free to report it.  The library is still under development and it might be infested with some minor bugs.

I wanted to start this post with a classical allocation problem to illustrate how you would implement your own function and optimize it using platypus.   This example, by the way, was inspired on Professor Loucks’ public system’s modeling class.

We have the following problem:

  1. We want to allocate coconuts to three different markets.
  2. The goal is to find the allocations that maximizes the total benefits.
  3. We have only 6 coconut trucks to distribute to the three markets.

So, lets find the best allocations using the NSGAII algorithm supported by platypus:


from platypus.algorithms import NSGAII
from platypus.core import Problem
from platypus.types import Real

class allocation(Problem):

def __init__(self):
   super(allocation, self).__init__(3, 3, 1)
   self.types[:] = [Real(-10, 10), Real(-10, 10), Real(-10,10)]
   self.constraints[:] = "<=0"

def evaluate(self, solution):
   x = solution.variables[0]
   y = solution.variables[1]
   z = solution.variables[2]
   solution.objectives[:] = [6*x/(1+x), 7*y/(1+1.5*y), 8*z/(1+0.5*z)]
   solution.constraints[:] = [x+y+z-6]

algorithm = NSGAII(allocation())
algorithm.run(10000)

for solution in algorithm.result:
   print(solution.objectives)

Assuming that you already have Platypus installed.  You should be able to import the classes specified in lines 1-3 in the code above.   The first line, as you can tell,  is were you import the MOEA of your choosing, make sure its supported by platypus.  In the second line, we import the Problem class that allows you to define your own  function.  You also need to import the type class  to specify your decision variables’ type.  The library supports Real, Binary,  Integer, etc.  I define the allocation class in line 5.  In line 8, I specify the number of objectives=3, number of decision variables=3 and number of constraints =1.  The bounds of my decision variables are defined in line 9.  The function in line 12 evaluates and stores the solutions.

Note that for the solution.objectives, each of the objective functions are specified in a vector and separated by a comma.  You can set constraints in line 17.  If you have more than one constraint it will be in the same vector with a comma separation.  In line 19, the function is to be optimized with the NSGAII problem, in line 20 you set the number of iterations for the algorithm, I went up to 100000, it really depends on the difficulty of your problem and your choice of algorithm.  Finally, I print the objective values in line 23, but you can have the decision variables as well if you wish.  As you can see, the setup of your problem can be extremely easy.  The names of the classes and functions in the library are very instinctive and you can focus entirely on your  problem formulation.