June 30, 2015 by Bernardo Trindade

Setting up Eclipse for C/C++

IDEs are tools to make code development a lot easier, specially if your project has multiple files, classes, and functions. However, setting up the IDE can sometimes be as painful as developing complex codes without an IDE. This post will present a short tutorial about how to install and configure Eclipse for C/C++ on Windows 7 in a (hopefully) fairly painless manner. This tutorial is sequenced as follows:

Installation
1. Downloading the Java Runtime Environment.
2. Downloading the GCC compiler.
3. Downloading Eclipse.
First steps with Eclipse
1. Setting up a template (optional)
2. Creating a new project
3. Including libraries in your project

INSTALLATION

Downloading the Java Runtime Environment

To check if you have the Java Runtime Environment installed, go to java.com with either Internet Explorer or Firefox (Chrome will block the plugin) and click on “Do I have Java?”. Accept running all the pluggins and, If the website tells you you do not have java, you will have to download and install it from the link displayed on the website.

Downloading the GCC compiler

After the check is done, you will have to download the GCC compiler, which can be done from http://www.equation.com. On the side menu, there will be a link to Programming Tools, which after expanded shows a link to Fortran, C, C++. Click on this link and download the right GCC version for your system (32/64 bit), as shown in the following screenshot.

After downloading it, double click on the executable, accept the licence, and type “c:\MinGW” as the installation directory. This is important because this is the first folder where Eclipse will look for the compiler in your computer. Proceed with the installation.

Downloading Eclipse

Now it is time to download an install eclipse. Go to the Eclipse download website and download Eclipse IDE for C/C++ Developers. Be sure to select the right option for your computer (Windows, 32bit/64bit), otherwise eclipse may not install and even if it does it will not run after installed. If unsure about which version you should download, this information can be found at Control Panel -> System by looking at System type.

After downloading it, extract the file contents to “C:\Program Files\eclipse” (“Program Files (x86) if installing the 32 bits version) so that everything is organized. Note that for this you will need to start WinRAR or any other file compression program with administrative privileges. This can be done by right clicking the name of the program on the start menu and clicking on Run as Administrator.

Now, go to C:\Program Files\eclipse and double click on eclipse.exe to open eclipse. In case you get an error message saying, among other things:
Java was started but returned exit code=13 ... ... -os win32 -ws win32 ...
then delete the whole eclipse folder, go back to the eclipse download page, download eclipse 32 bit, and extract it as previously described. You should not see the same error again when trying to run eclipse.exe.

Now that Eclipse is up and running, it is time to use it.

FIRST STEPS WITH ECLIPSE

The first thing eclipse will do is ask you to choose a workspace folder. This is the folder where all your code projects will be stored. It should not matter too much which folder you choose, so using the default is probably a good idea.

Setting up templates (optional)

It is helpful to create a code template in order to avoid retyping the same standard piece of code every time you create a new file or project. Many scientific codes have similar imports (such as math.h and stdio.h) and all of them must have a main method (as any C++ code). If we create a code template with a few common imports and the int main function, we can just tell Eclipse when creating a new project to add these to a new .cpp file.

In order to create the mentioned template, go to Window -> Preferences. There, under C/C++ -> Code Style on the left panel, click on Code Templates. Under Configure generated code and comments, expand Files -> C++ Source File, and then click on New. Choose a meaningful name for your template (I chose “Cpp with main”) and type a short description. After that, copy and paste the template below under “Pattern”.

/*
File: ${file_name}

Author: ${user}
Date: ${date}
*/

#include <iostream>
#include <string>
#include <math.h>
#include <stdio.h>
#include <string.h>

using namespace std;

int main()
{
    // Your code here.

    return 0;
}

Note ${file_name}, ${data}, and ${user} are variables, which means that they will be replaced by your file’s actual data. To see a list of the other variables that can be inserted in your template, click on Insert Variable…. Click Ok and Ok again and your template will be ready to be used!

Creating a new project

Click on File -> New -> C++ Project. Under Project type choose Empty Project, then under Toolchains choose MinGW GCC, and, finally, type “project1” as your project name an click on Finish.

After your project is created, click on File -> New -> Source File. Type “say_something.cpp” (no quotes and do not forget the .cpp after the file name) as the name of your source file and choose the template you created as the template. The window should then look like this:

Click on Finish. If you used the template, replace the comment “// Your code here.” by “cout << “Yay, it worked!” << endl;”. Your code should look like the snippet below. If you have not created the template, just type the following code to your file.

/*
File: say_something.cpp

Author: bct52
Date: Jun 26, 2015
*/

#include <iostream>
#include <string>
#include <math.h>
#include <stdio.>
#include <string.h>

using namespace std;

int main()
{
    cout << "Yay, it worked!" << endl;

    return 0;
}

Now, build the code by clicking on the small hammer above the code window and, after the project is built, click on the run button (green circle with white play sign in the center). If everything went well, your window should look like the screenshot below, which means your code compiled and is runs as expected.

Including libraries in your project

When developing code, often times other people have had to develop pieces of code to perform some of the intermediate steps we want our code to perform. These pieces of code are often publicly available in the form of libraries. Therefore, instead of reinventing the wheel, it may be better to simply use a library.

Some libraries are comprised of one or a few files only, and can be included in a project simply by dragging the file into the Eclipse project. Others, however, are more complex and should be installed in the computer and then called from the code. The procedure for the latter case will be described here, as it is the most general case . The process of installation and usage of the Boost library with MinGW (GCC) will be used here as a case study.

The first step is downloading the library. Download the Boost library from here and extract it anywhere in your computer, say in C:\Users\my_username\Downloads (it really doesn’t matter where because these files will not be used after installation is complete).

Now it is time to install it. For this:

Hold the Windows keyboard button and press R, type “cmd”, and press enter.
On the command prompt, type “cd C:\Users\bct52\Downloads\boost_1_58_0” (or the directory where you extracted boost to) and press enter.
There should be a file called bootstrap.bat in this folder. If that is the case, run the command:
```
bootstrap.bat mingw
```
In order to compile Boost to be used with MinGW, compile Boost with the gcc toolset. You will have to choose an installation directory for Boost, which WILL NOT be the same directory where you extracted the files earlier. In my case, I used C:\boost. For this, run the command:
```
b2 install --prefix=C:\boost toolset=gcc
```
Now go read a book or work on something else because this will take a while.

Now, if the installation worked with just warnings, it is time to run a code example from Boost’s website that, or course, uses the Boost library. Create a new project called “reveillon” and add a source file to it called “days_between_new_years.cpp” following the steps from the “Creating a new project” section. there is no need to use the template this time.

You should now have a blank source file in front of you. If not, delete any text/comments/codes in the file so that the file is blank. Now, copy and paste the following code, from Boost’s example, into your file.

 /* Provides a simple example of using a date_generator, and simple
   * mathematical operatorations, to calculate the days since
   * New Years day of this year, and days until next New Years day.
   *
   * Expected results:
   * Adding together both durations will produce 366 (365 in a leap year).
   */
  #include <iostream>
  #include "boost/date_time/gregorian/gregorian.hpp"

  int
  main()
  {
    
    using namespace boost::gregorian;

    date today = day_clock::local_day();
    partial_date new_years_day(1,Jan);
    //Subtract two dates to get a duration
    days days_since_year_start = today - new_years_day.get_date(today.year());
    std::cout << "Days since Jan 1: " << days_since_year_start.days()
              << std::endl;
    
    days days_until_year_start = new_years_day.get_date(today.year()+1) - today;
    std::cout << "Days until next Jan 1: " << days_until_year_start.days()
              << std::endl;
    return 0;
  };

Note that line 9 (“#include “boost/date_time/gregorian/gregorian.hpp””) is what tells your code what exactly is being used from Boost in your code. Line 15 (“using namespace boost::gregorian;”) saves you from having to type boost::gregorian every time you want to use one of its functions.

However, the project will still not compile in Eclipse because Eclipse still does not know where to look for the Boost library. This will require a couple of simple steps:

Right click on the project (reveillon), under the Project Explorer side window, then click on Properties. Under C/C++ Build->Settings, click on Includes under GCC C++ Compiler. On the right there should be two blank boxes, the top one called Include paths (-I) and the other called Include files (-include). Under Include paths (top one), add the path “C:\boost\include\boost-1_58” (note that this path must reflect the path where you installed Boost as well as which version of Boost you have). This is where the compiler will look for the header file specified in the code with the #include statement.
The compiled library files themselves must be included through the linker. This step is necessary only if you are using a compiled library. For this, on the same window, click on Libraries under MinGW C++ Linker. Add the path to the Boost libraries folder to the Library search path (-L) (bottom box). this path will be “C:\boost\lib” (again, if you installed Boost in a different folder your path will be slightly different). Now the actual compiled library must be added to the Libraries (-i) (top box). First, we need to figure out the name of the compiled library file used in the code. In this case, it is the file “libboost_date_time-mgw51-mt-d-1_58.a”. Therefore, add boost_date_time-mgw51-mt-d-1_58 (no lib prefix, no .a postfix, and be sure to match the name of your file) to Libraries (-i). Click Ok and Ok again.

Now compile the code by clicking on the hammer button and run the rode by clicking on the play button. Below is a screenshot reflecting both steps above as well as the expected output after running the program.

That’s it. After your model is in a good shape and it is time to run it with Borg (or other optimization algorithm), just change your “int main()” to a function with your model’s name and the right Borg’s arguments, add the standard Borg main, and change the makefile accordingly. Details on how to do all this for Borg will be explained in a future post.

June 29, 2015 by JR Kasprzyk

On constraints within MOEAs

When thinking about setting up a problem formulation for an MOEA like Borg, it’s helpful to spend a minute (or 15!) talking about our friend the constraint. I’ve written about problem formulations in the past, and we even have a video about problem formulations, but there’s a subtlety that I though it would be helpful to discuss again here!

A comment about constraints in classical mathematical programming formulations

Classical mathematical programming formulations are chock-full of constraints. Hundreds of constraints, even! This is due to the fact that when you are setting up a linear program (LP) or something similar, the set of constraints allows the program to have some amount of realism. The classical linear programming water allocation problem (for example, something you would find in the classic Revelle et al textbook) is something like:

Decisions: water allocations to different actors in different periods of time

Objective: minimize cost of the allocation plan, or maximize the net benefits of the plan

Constraints:

Ensure that demands are met
Ensure that, at each supply node, inflow equals outflow
Ensure that the capacity of the conveyance system is not exceeded
Ensure that all decision variables are greater than or equal to zero
Keep the water quality values above or below some threshold

We have proposed 5 categories of constraints, and there may be even more. But mathematically, this simple formulation above could have hundreds of constraints, if you have to solve each equation for a particular point in time, or at a particular spatial location.

This is perfectly fine! If you are solving a LP, I encourage you to put more constraints in there because constraints are really the only way that you can make your solutions look more physically accurate. LP theory will tell you that, if there is one or more optimal solutions, at least one of them will be at the intersection of two or more of the constraints. So, the constraints are important (more on this later).

How the simulation-optimization of MOEAs is different

The MOEA approach we talk about on this blog is a simulation-optimization approach. What that means, is that instead of having a set of a few hundred constraints to model your problem, you are actually using a simulation model of the system to model it. This is going to fundamentally change the way that your decisions and constraints are defined, and it may even change your objectives. Let’s talk about each in turn, using our simple water allocation example as, well, as an example.

Decisions: It would probably make more sense to optimize rules about how the water is allocated, instead of actually treating each decision variable as a separate volume of water over time. There are a number of reasons why this is. In the formulation above, if each decision variable is an allocation of water in a particular period of time, once you start to plan over long planning horizons, the problem will get unwieldy, with thousands of variables. Plus, if you say that decision variable x_83 is the allocation of water to user 5, at t = 3, then how are you to plan what happens if you don’t have accurate or deterministic data for what happens at t = 3? But that’s really the purpose of another post. We’re here to talk about:

Constraints: So, in the above formulation, constraints were designed to create physical reality to your system. For example, if the decision variables are volumes of water, and you have a basic estimation of the concentration of some pollutant, you can create a constraint that says the number of kilograms of a pollutant should be less than a particular value. In that fashion, you can think of hundreds of constraints, and we just talked about this, right?

In the simulation-optimization approach, though the simulation model will provide physical reality to the system, not the constraint set. So it is better to treat constraints as limits on acceptable performance. For example if you are calculating reliability, you can say that all solutions should have at least some lower threshold of reliability (say, 98%, for example). Or, don’t have constraints at all and instead just handle the value as an objective! The point of doing tradeoff analysis with a MOEA is to find diversity of solutions, after all. If you find that the performance of the values is too poor, or that you have too many solutions, you can impose constraints next time. Or, you can also change the epsilon resolution within Borg in order to change the number of points in the tradeoff analysis; see this paper by Kollat and Reed where they showed the same tradeoff set with different resolutions, for example.

Constraints can also help with the representation of decisions within your formulation. After all, the most basic constraint in any LP is that xi >= 0, right? But think about creative ways to change the representation of decision variables such that all of the algorithm’s solutions are feasible. This will greatly help the search process!

For example, consider a problem where you have to create drought restriction plans at different levels. In other words, imagine that the value of restriction at level 2 needed to be greater than the value at level 1. Mathematically, you could set this up as a constraint (x2 >= x1), but it will be very difficult to meet this, especially with an initially random population within a MOEA. Especially when you have, say, 5 or 10 or 100 of these levels that each need to be greater as you go up. So, instead, you can do something like this. Let xi’ be the decision variable in the algorithm, and xi (without a prime) be the actual decision variable used in your model. Then, let:

x1 = x1′

x2 = x1 + x2′

The algorithm actually searches on x1′ and x2′. Now, if you set your bounds up correctly, the algorithm can never generate an infeasible solution, and you are spending all your computational resources within search actually finding good, feasible solutions instead of just trying to find a single set of feasible solutions with which to work. Something similar to this was used in Zeff et al. (2014).

Some comments on how constraint handling works within Borg

The concepts of epsilon non-domination, and non-domination, are used to determine whether solutions survive within the search process. The general concept of non-domination is, informally, that a solution’s objective function values are better in one objective and at least as good as another solution in all other objectives. But that’s not what we’re here to talk about! We are here to talk about constraints, remember? 🙂

Well remember that the definition of a feasible and infeasible solution is the same in LP-land versus for us. Feasible solutions meet all constraints, infeasible solutions violate one or more constraints. An infeasible solution is defined to not be acceptable to the decision maker. This is an extremely important point to remember, so it is worth repeating! An infeasible solution is unacceptable. Because of this, it’s up to you to set up the constraints to keep acceptable solutions, and throw away unacceptable (infeasible) solutions.

However, there is an important distinction between, say, an LP and what we do. Remember the simplex algorithm for solving LPs? What it does is intelligently finds a corner of the feasible space (defined by your constraint set), and then it basically walks through each of the corners, trying each one of them to determine if the corner is optimal. Notice what’s going on here, though. Most of the time, at least for a simple problem, it is easy to find a feasible solution in the simplex method, and then you are off to the races. If you can’t find a feasible solution, you’re in a lot of trouble!

Similarly, you also need to find one feasible solution to get started with MOEA search…preferably you’ll have many many feasible solutions! But what if it is difficult to find a feasible solution? This is going to impact your ability to find solutions, and if your function evaluation count, or search duration, is not long enough you can have a failed search.

So let’s get to the point of this section, which is explaining how constraint handling works within Borg. As explained in Reed et al., most of the algorithms we’ve typically tested within this blogosphere use restricted tournament selection. There are three conditions where a solution a can dominate b:

Both solutions are feasible, and based on objective function performance, a dominates b
The solution a is feasible, and b is infeasible.
Both a and b are infeasible, and the sum of a’s constraint violations are less than the sum of b‘s constraint violations

So as you can see, item 3 is very interesting! What item 3 means, is that Borg or these other algorithms will actually keep some infeasible solutions in the population or archive while all solutions are infeasible. This is important because it helps move the population toward eventually finding a feasible region of the decision space! But, according to item 2, any solution that is feasible will beat an infeasible solution, which is consistent with the fact that infeasible solutions are unacceptable, and oh gosh I already said that a few times didn’t I!

By the way, you can also use penalty functions to handle constraints, instead of using this built in restricted tournament selection feature. An example of that is in the Kollat and Reed paper that I already linked to. Plus there are other ways to do this too! By no means is this an exhaustive discussion, but I hope what we’ve talked about is helpful. Anyway:

Some informal recommendations on how to use constraints within Borg if you choose to use them

If you don’t have to use constraints, don’t use them! Optimize without them and see how you do (see above).
Make the constraint violations that you’re reporting to Borg meaningful. Notice item 3 in the restricted tournament selection actually does use the values that you report from constraint handling. I’ve seen some folks just say: if infeasible, c = 100000000. But that won’t actually help the restricted tournament mechanism move the set toward meaningful, feasible solutions. For example let’s say that r gives you a value of reliability for a system, and reliability must be above capital R. You could set your constraint violation to: c = R – r, which will give you the difference between your performance and what you are required to have. If all of your solutions are infeasible initially, the algorithm will favor a solution that has, say, 1% reliability lower than R over a solution that has 20% reliability lower than R.
If you have an expensive simulation, skip the objective function calculations if the solution is infeasible. The objective function values for an infeasible solution are not really used for anything, in restricted tournament selection. So you can report the constraint violation, but don’t spend time calculating objective functions that will never be used.

Two quick (important) caveats here: One, remember that infeasible solutions are unacceptable. And the reason why I keep bringing this up is the fact that you can technically shoot yourself in the foot here, and impose constraints that are too strict, where a real decision maker may actually want solutions that violate your imposed constraint. So be careful when you define them! Two, if you do skip the objective function calculation be sure to still report a value of your objectives to Borg, even if that value is just a placeholder. This is important, to make sure the Borg procedure keeps running. If you have questions about these caveats please leave them in the comment box below.

To conclude

This has been quite a long post so thanks for sticking with me til the end. Hopefully you’ve learned something, I know I did! We can continue the discussion on this blog (for the contributors) and on the comment boxes below (for everyone)! Have a great day.

June 25, 2015 by Jazmin Zatarain

Basic Borg MOEA use for the truly newbies (Part 2/2)

This post will walk you though a simple example using Borg MOEA to solve the DTLZ2, 3 objective instance. I suggest reading first Part 1 of this post. Once you have compiled the Borg MOEA, you are ready to follow this example.

The DTLZ2 problem, named after its creators :Deb, Thiele, Laumanns and Zitzler, is a very popular function to test MOEAs’ performance since its Pareto-optimal solutions are analytically known, and it is easily scalable to any number of objectives.

Step 1. Set the DTLZ2, 3 objective instance

In your borg folder, open the dtlz2.py function. You should see the following script:


from sys import *
from math import *

nvars = 11
nobjs = 2
k = nvars - nobjs + 1

while True:
 # Read the next line from standard input
 line = raw_input()

# Stop if the Borg MOEA is finished
 if line == "":
 break

# Parse the decision variables from the input
 vars = map(float, line.split())

# Evaluate the DTLZ2 problem
 g = 0

for i in range(nvars-k, nvars):
 g = g + (vars[i] - 0.5)**2

objs = [1.0 + g]*nobjs

for i in range(nobjs):
 for j in range(nobjs-i-1):
 objs[i] = objs[i] * cos(0.5 * pi * vars[j])
 if i != 0:
 objs[i] = objs[i] * sin(0.5 * pi * vars[nobjs-i-1])

# Print objectives to standard output, flush to write immediately
 print " ".join(["%0.17f" % obj for obj in objs])
 stdout.flush()

You can easily scale the number of objectives in the DTLZ2 function by modifying lines 4 and 5 in the code. Since we want to solve the 3 objective configuration of the problem, we will change the nobjs and set it equal to 3. In general : number of variables= number of objectives + 9; in this case nvars= 12. Alternatively, if we want to work with the 5 objective instance, then we would have nobjs= 5 and nvars= 14.


from sys import *
from math import *

nvars = 12
nobjs = 3
k = nvars - nobjs + 1

Once you have updated the file, save it and go back to your terminal.

Step 2. Run the DTLZ2 3 objective instance from your terminal

From your terminal make sure you are in the borg folder and type the following command:

./borg.exe -v 12 -o 3 -e 0.01,0.01,0.01 -l 0,0,0,0,0,0,0,0,0,0,0,0 -u 1,1,1,1,1,1,1,1,1,1,1,1 -n 10000 python dtlz2.py > dtlz2.set

The -v and the -o flags, as you probably guessed are the number of decision variables and objectives, respectively. The -e flag stands for the epsilon precision values, here the same epsilon values are specified for the three objectives for simplicity, just remember that they don’t necessarily need to be the same across objectives. The -l and -u flag are the decision variables’ lower and upper bounds. These bounds can vary according to the problem. The -n flag stands for the number of iterations, here we specify 1000, you can certainly increase the number of iterations. In the last part of the command you call the dtlz2.py function, finally, the optimization results are saved in a separate file using the > “greater than” symbol. Once you typed the previous script, you should see a new file in your borg folder called dtlz2.set. Before we go on, type the following command in your terminal and see what happens:

./borg.exe -h

The -h flag provides information about the different options available.

Step 3. Select the objectives

Now, open your newly generated dtlz2.set file. you will see something like this:

# BORG version: 1.8
# Current time: Thu Jun 25 17:12:00 2015
# Problem: python dtlz2.py
# Number of variables: 12
# Number of objectives: 3
# Number of constraints: 0
# Lower bounds: 0 0 0 0 0 0 0 0 0 0 0 0
# Upper bounds: 1 1 1 1 1 1 1 1 1 1 1 1
# Epsilons: 0.01 0.01 0.01
# Directions: 0 0 0
# Constraint mode: E
# Seed: (time)
1 0.814090298840556947 0.409932057219915436 0.453114658489247812 0.602037485943714312 0.612211413314472264 0.63507815287255176 0.284561154879054812 0.319971228064652446 0.510268965514402262 0.565795181158015859 0.352339424454035932 2.00000000000000014e-17 7.00000000000000035e-17 1.1566219845629353
0.200664515777928404 0.658663937901916241 0.459011341616803015 0.456423615588035791 0.72948415840455394 0.502347551242693147 0.55253340411468177 0.523000325718170678 0.424118159653203652 0.418071947834685986 0.307354991375789255 0.667924369576233357 0.55237113861310283 0.929550654944727661 0.352579197090774399
0.600637603847835599 0.635510585462702893 0.537202278296969316 0.480501397259140872 0.538845435998899225 0.565887992485444302 0.540269661215032948 0.32823362137901485 0.548470138166351484 0.485323707827721162 0.661267336189765076 0.603296309515394924 0.342802362666870974 0.531842633344500104 0.872739730448630957
0.200664515777928404 0.962019489314716703 0.478360997918643505 0.456423615588035791 0.72948415840455394 0.508917943579420218 0.434628101329525951 0.523000325718170678 0.416896640814134301 0.418071947834685986 0.307354991375789255 0.667924369576233357 0.0645572417675578242 1.08080813229192496 0.353051662882838901
0.580252259871913978 0.626070509625966998 0.537503391458319713 0.47945571905554607 0.540634044436276051 0.565466312655102277 0.5375236610793378 0.336204078423532282 0.541224805744518811 0.481480141441438469 0.645345342922852394 0.597006852306095737 0.362763876068478097 0.544895812217087494 0.844603875086156197

This is only a snippet of what you should see, here only five solutions are shown (one solution per row); for an epsilon of 0.01 you should see at least 100 solutions. The first 12 columns in the file represent the 12 decision variables and the last 3 columns are the objectives. In order to visualize the Pareto front, we want to select only the objectives. To do so, type the following command:

awk 'Begin {FS=" "}; /^#/ {print $0}; /^[-]?[0-9]/ {printf("%s %s %s\n", $13,$14,$15)}' dtlz2.set > dtlz2.obj

The previous command is fully described in Joe’s post. Once you run it you should see a new file called dtlz2.obj, with three columns, one for each of the objectives.

Step 4. Visualize the Pareto front

The Pareto front of the DTLZ2 problem is the first quadrant of a unit sphere centered at the origin and all the decision variables are equal to 0.5. For the 3-objective case that we solved in this example, the Pareto approximate front generated by Borg MOEA is depicted in the figure below:

With your results, attempt to generate a similar plot in the language of your choice. Once you have visualized the Pareto Front, you will have completed this exercise!

Click here to learn more about the DTLZ2 problem.

June 25, 2015 by Jazmin Zatarain

Basic Borg MOEA use for the truly newbies (Part 1/2)

The Truly Newbie Series brings to you Basic Borg MOEA use. Part 1 provides a quick overview of the Borg MOEA, it helps you obtain the Borg MOEA source code and compile it. Part 2 will take you by the hand on a basic Borg MOEA use, demonstrated with the DTLZ2, 3 objective test problem.

If you need basic background on multi-objective evolutionary algorithms, I strongly recommend reading Jon H’s post first.

1. What is the Borg Multiobjective Evolutionary Algorithm (MOEA)?

The Borg MOEA, developed by Dave Hadka and Pat Reed, is an iterative search algorithm for many-objective optimization. Some currently available MOEAs are highly dependent on their search parameters, that is, their crossover, mutation and selection operators, which are usually pre-specified and do not change throughout the run. The Borg MOEA avoids sensitivity to search parameters by assimilating a suite of operators from existing MOEAs and adaptively using them based on their success throughout the search. For instance, if an operator has a large contribution to the archive, it has a larger probability of being selected (the archive is basically where the top solutions are stored). MOEAs also struggle with problems that have multiple false optima, that is, they are looking for the mountain and they get stuck in a hill. The Borg MOEA avoids this problem by checking for progress in the archive over a fixed period of time, if there is no improvement detected, this might be a good indication that the MOEA is stuck in local optima and a new set of solutions are injected to reinvigorate the search. Borg can also shrink or enlarge its population to maintain proportionality to the archive size. If the population size starts becoming smaller than the archive size, then it’s time to bring some new solutions into the mix to keep diversity. In brief, checking for progress, adapting the population size and adding new solutions helps the algorithm to continue progress and prevents it from going around in circles for the entire runtime. Finally, the sorting mechanism in some MOEAs is performed by checking all of the solutions against all of the other solutions, this can be extremely time-consuming and the MOEAs may have trouble sorting when the number of solutions is exceedingly large, this is usually the case when dealing with a large objective count. Alternatively, Borg allows the user to specify the precision for each of the objectives. For instance, if revenue is one of our objectives, we might only care about measuring to the nearest hundred dollars, instead of measuring to the nearest dime. This sorting procedure enhances diversity since it avoids storing solutions that are not significantly different from each other. It also guarantees convergence to the best known set of tradeoffs and avoids wasting time when comparing solutions. In summary, the Borg MOEA is a highly auto-adaptive optimization tool that assimilates characteristics from existing MOEAs while adding several innovative components that help it deal with a broad range of challenging multi-objective problems. (You can probably infer at this point why it gets its scary Star Trek-referenced name).

Note: I did not mean to murder the Borg MOEA with this overly-simplified explanation, please do read the original paper to learn more about its cool features and full details:

Hadka, D., and Reed, P.M. “Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework.” Evolutionary Computation, 21(2):231-259, 2013.

2. How to get the Borg MOEA?

The Borg MOEA is freely available for academic use, you can go to the following link:

borgmoea.org

Next, request access by completing the Get it! section, the providers should contact you within three business days, in my experience, it takes them about 3 business minutes to get back to you. You will receive a Bitbucket invitation by e-mail, which means that you will need to open a bitbucket account first. Accept the invitation, and then you should see the following window. In the left menu, click downloads, once you are in the Downloads folder click Download Repository.

Next, go to your downloads folder, look for the dmh309-serial-borg-moea-eb8f1ca9d731 folder, extract the files and select a destination. In the folder, you should see the following files:

3. Compiling Borg MOEA

To compile the Borg MOEA, you will need a compatible C compiler installed, gcc is highy recommended. You will also need to have access to Linux-like terminal. If you don’t already have access to one, please refer to the terminal basics post for some guidance. Next, navigate the borg folder through your terminal, you can see that a makefile is provided to help build the Borg MOEA code automatically, you just need to type the command make, and the three executables distributed will be generated. In your terminal you can type the ls command to verify that you have the following: borg.exe, dtlz2_advanced.exe and dtlz2_serial.exe where created. Once you have these three executables, this step is completed!

Go to Basic Borg MOEA use for the truly newbies (Part 2/2)

June 23, 2015 by Tom Wild

Google Earth as a Visualization Tool for Water Resources Systems Planning (Part 2)

This post is part 2 of a multi-part series of posts that aims to describe ways in which Google Earth Pro (GEP) is useful as a visualization tool for water resources systems planning. In my previous blog post on this topic, I showed how Google Earth Pro (GEP) can be used to create visualizations of planned reservoir sites. (You can read the post here.). After you create a GEP visualization of interest, there are several ways to share that visualization with others. In this post I will show how you can create a tour of yourself exploring a visualization, and create and save a high quality video of that tour that you can send to others (or post in a blog, like I have done below). The following steps are required. 1. Create a visualization (such as I did here), or pick a site you want to explore in GEP. 2. Record a tour. Use the toolbar at the top of the GEP window to locate the “Record a tour” button (in the same place where you found the “Add a placemark” and “Add a polygon” buttons in the previous GEP post). A new little panel should appear in the lower left of the GEP window (see circled areas in image below), which will show you options to record a tour on your screen, including video (red button icon) and audio (microphone icon).

When you are ready to begin creating your tour, click on the record button, and GEP will now record the places you are visiting. While GEP is recording, try visiting a few locations on your screen, or flying over your reservoir visualization. Note that if you want to capture a video of actions you are taking in GEP (such as clicking on menus) to create a tutorial for others, then it would be better to use a video screen capture tool, as GEP does not record such actions as part of a tour. When you are done creating your tour, press the record button again to stop recording. Replay the tour to make sure it looks OK before saving it. Click on the “save” button that now appears in the menu on the lower left, and GEP will give you options to save your video as a “Tour”. Save it, and it should now appear in the “places” menu on the left hand side of the GEP window. 3. Create and save a high quality video of your tour. First, and this is important, close the little tour recorder menu on the lower left of your GEP window. Doing this will allow you to access GEP’s “movie maker” features. Go to Tools–>Movie Maker. In the window that appears, select a supported compression format. (I don’t know much about this, so I’ve just been using their default setting on this.). Also select the resolution you prefer. (The video I post below is the 1920×1080 (HD) option for resolution). In the “Record from…” category, select “A saved tour”, and select the tour you just created from your dropdown menu. Browse for a location where you want to save the video, and then click the “Create Movie” button. This will create and save a video on your desktop. Your screen will appear something like this before you create your movie:

4. Post your video for others to see. I have done this below as an embedded youtube video. If you’re wondering about different options for creating videos and posting them on a blog, I found Joe Kasprzyk’s previous post on this topic to be very helpful.

June 11, 2015 by Tom Wild

Google Earth as a Visualization Tool for Water Resources Systems Planning (Part 1)

This post is part 1 of a multi-part series of posts that aims to describe numerous ways in which Google Earth Pro (GEP) is useful as a visualization tool for water resources systems planning. (GEP is a virtual globe that lets you create and analyze geographic information). You can find Part 2 here.

Specifically, this post is designed to teach you how to create a visualization of a planned reservoir’s inundated surface area, given some basic information about a planned dam site. Dam development is on the rise in river basins around the world, particularly in Asia, Africa and South America. In conducting feasibility (or pre-feasibility) studies of dam development plans, it can be useful to prepare quick, simple visualizations of reservoirs.

Interestingly, Google recently eliminated what used to be a $400/year fee for GEP, so I highly suggest you use GEP instead of regular Google Earth (GE). GEP has several helpful features not available in GE, such as the ability to compute areas of polygons.

Before you try these exercises, I suggest you download/install the latest version of GEP from: https://www.google.com/earth/explore/products/desktop.html. On this website, scroll to the bottom of the page and click on “Download Google Earth Pro”. Agree to the terms, download and install. (At some point you may be asked for a license key. Assuming you do not have a key, use your email address and the key GEPFREE to sign in when prompted). I prepared these tutorials using version 7.1.2.2041 in Windows 7. Note that you can run both GE and GEP on your computer, so if you already have GE, I suggest you try this tutorial with GEP.

Suppose we have the following information about a planned dam (MRC 2014, see references section for data source):

a.) Dam coordinates:

Latitude: 15°58’30.00″N; Longitude: 106°55’50.16″E

b) Reservoir Full Supply Level (FSL) elevation (meters above sea level): 490 masl. Let’s assume the FSL represents the upper elevation of the impounded reservoir’s active storage zone.

Follow these steps to create a placemark for the location:

First, create a placemark by clicking on the “Add placemark” button (a yellow pushpin graphic) in the toolbar at the top of the screen. A “new placemark” window will appear. Enter a name for the dam (in the screenshot below, I call it “MyDam”), the dam’s latitude and longitude coordinates as they appear above, and any description you wish to add. It doesn’t matter where you are in the world on your screen, GEP will locate the placemark wherever you tell it to.

(Hint: before you create the placemark, you may want to zoom in a bunch using the slider on the right, as the placemark’s default view will be the one you have when you create the placemark, which makes it harder to see the pin if you’re zoomed out too much).

In the same “New placemark” window, switch to the “View” tab, and enter the dam’s coordinates again, then Click “OK”.

The placemark should appear in your “Places” list on the left. Double click on your Dam icon to zoom in on the location of the dam. (See the Hint from Step 1 if GEP doesn’t zoom in at all when you click on your placemark). You can zoom in on the placemark manually using the slider on the right of the screen. Your screen should appear something like:

Next, you will need to create a polygon that will ultimately represent the reservoir’s surface area. Follow these steps to do so:

Create a new polygon by clicking on the “Add polygon” button in the toolbar at the top of the screen. (Right next to the placemark button).
In the box titled “Google Earth – New Polygon”, give your reservoir a name, such as “MyDam Reservoir”, and in the “Description” tab, add any description you wish to add.

Switch to the “Style, Color” tab in the “Area” category to select a distinguishable color for the inundated area, such as blue.

Use the cursor to drag over/highlight the area you want the polygon to cover (left click and hold down until you’re done covering the desired area), then Click OK. As is demonstrated in the image below, you will want to cover a fairly large area, because the reservoir you ultimately create will be a slice of your polygon at the FSL elevation. If your polygon is not large enough, your reservoir graphic may get cut off.

(Given that you don’t know what the reservoir looks like ahead of time, this can be a trial-and-error process). Note that if you hold the mouse over the GEP imagery, it lists elevation information in the lower right corner, so you can check elevations at upstream locations in the visible tributary and sub-tributary channels to see how far upstream a 490 m elevation might extend, to be sure your polygon covers that area.

If you make a mistake in drawing your polygon and wish to get rid of it and try again, you can right click on it and cut or delete it.

Right click on your polygon, and select “properties”. Switch to the “Altitude” tab, and in the drop-down menu select the “Absolute” option. In the altitude box, enter the reservoir’s full supply level elevation (490 m).

An approximate reservoir shape should now appear, and the polygon “MyDam Reservoir” should now appear in your “Places” list on the left.

If you zoom in on the reservoir, its appearance will take on a clearer form, as is shown below.

You may notice that the reservoir extends below the location of the dam. You can remedy this issue by editing the polygon’s control points. Right click on the polygon, and click “properties”. You will see the points appear in red (see image below). You can move some of the points up closer to the dam, as is shown below, to eliminate the blue polygon portion that extends below the dam.

Your polygon should now appear like this:

If you have managed to successfully follow all of these steps, you have created a simple visualization of what turns out to be the Xe Kong 5 dam/reservoir, planned to be constructed on the Xe Kong River in Laos, a tributary river to the Mekong River (MRC 2014).

Note that GEP may perform the process I’ve outlined here quite poorly in flat areas such as floodplains, where slight changes in elevation can create very large discrepancies in reservoir inundated area. This is why I selected a location characterized by relatively steep terrain.

References:

Mekong River Commission (MRC) (2014), Hydropower Project Database, Basin Development Plan Programme, Mekong River Comm., Vientiane, Lao PDR.

June 9, 2015 by Julie

Using arcpy to calculate area-weighted averages of gridded spatial data over political units (Part 2)

Weather Data Processing

This is the second entry in a two-part blog on using arcpy, the Python package for performing ArcGIS spatial analysis. In the first part, I showed how to calculate area-weighted averages of gridded spatial data over political units using ZonalStatisticsAsTable. This method is only applicable to raw data in the form of an ArcGIS raster. Here I will show another method for when the data does not come in this form, like our weather data. Again, for this discussion, I will assume the data is being processed at the woreda level.

The temperature data used in our study was stored in a netCDF file and the precipitation data in several binary files. While the data is not stored in a GIS raster, it is gridded satellite data, so the first step was to create a polygon grid using arcpy and determine which grid cells corresponded to which locations in the raw data file. Next, this grid was intersected with the woreda shapefile to compute the percentage of each woreda’s area covered by each grid cell. Then, area-weighted average temperature and precipitation values could be computed for each woreda.

All of the temperature and precipitation processing can be run from the functions makeTempDBF.py, and processRainData.sh, respectively. The temperature and precipitation data are downloaded using downloadTempData.py and downloadRainData.py. The temperature data is stored in the netCDF file air.mon.mean.nc, which can be read by Python if you have the netCDF4 package. The precipitation data is stored in separated binary files for each day of the year. The FORTRAN code convert_rfe_bin2asc.f provided by NOAA is used to convert the binary files to human-readable text files. This is done in the shell script with the following commands:

f95 convert_rfe_bin2asc.f -o convert_rfe_bin2asc

BINFILES=$(find . -name all_products.bin\*)

for BINFILE in ${BINFILES}
do
	./convert_rfe_bin2asc ${BINFILE} ${BINFILE}.txt
	rm ${BINFILE}
done

The temperature data is for the entire globe, and the precipitation data for all of Africa, so the first step in the processing is to determine which elements of the temp variable matrix in the netCDF file, and which rows of the daily precipitation text files, correspond to points in Ethiopia. I did this through arcpy with the function intersectGrid.py, shown below.

import arcpy as ap
import os
from convertDBFtoCSV import convertDBFtoCSV

def intersectGrid(AggLevel, workingDir, variable):
    '''Intersects the GHCN temperature data grid with the AggLevel = "Woreda" or "Kebele" shapefile'''

    #create grid shapefile
    Grid = workingDir + "\\All" + variable + "Grid.shp"
    if(os.path.exists(Grid)==False):
        if variable == "Temp":
            origin_coord = "-180 -90"
            nrows = "360"
            ncols = "720"
            polygon_width = "0.5 degrees"
        else:
            origin_coord = "-20.05 -40.05"
            nrows = "801"
            ncols = "751"
            polygon_width = "0.1 degrees"

        polygon_height = polygon_width
        ap.GridIndexFeatures_cartography(Grid, "", "", "", "", polygon_width, polygon_height, origin_coord, nrows, ncols)
        ap.DefineProjection_management(Grid,coor_system="GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',\
        SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]]")

        #add 3 or 4 fields to grid shapefile: latitude (LAT), longitude (LONG) and
        #for precipitation, row (row) of text file corresponding to each grid in the shapefile;
        #for temperature, row (row) and column (col) of netCDF file corresponding to each grid in the shapefile
        ap.AddField_management(Grid, "LAT", "DOUBLE", 7, 2, "", "", "", "", "")
        ap.AddField_management(Grid, "LONG", "DOUBLE", 7, 2, "", "", "", "", "")
        ap.AddField_management(Grid, "row", "SHORT", 6, "", "", "", "", "", "")
        if variable == "Temp":
            ap.AddField_management(Grid, "col", "SHORT", 5, "", "", "", "", "", "")

        #calculate lat and long fields
        expression1 = "float(!SHAPE.CENTROID!.split()[0])"
        expression2 = "float(!SHAPE.CENTROID!.split()[1])"
        ap.CalculateField_management(Grid, "LONG", expression1, "PYTHON")
        ap.CalculateField_management(Grid, "LAT", expression2, "PYTHON")

        #calculate row and col fields
        if variable == "Temp":
            Grid = calcTempFields(Grid)
        else:
            Grid = calcRainFields(Grid)

    #clip the grid to Ethiopia and convert its .dbf to a .csv for later use
    GridClip = workingDir + "\\" + variable + "GridClip" + AggLevel + ".shp"
    if AggLevel == 'Woreda':
        EthiopiaBorders = os.path.dirname(workingDir) + "\\Shapefiles\\WoredasAdindan.shp"
    elif AggLevel == 'Kebele':
        EthiopiaBorders = os.path.dirname(workingDir) + "\\Shapefiles\\Ethiopia Kebeles without Somali region.shp"

    ap.Clip_analysis(Grid, EthiopiaBorders, GridClip)
    dbf = GridClip[0:-4] + ".dbf"
    GridCSV = convertDBFtoCSV(dbf)

    #intersect the clipped grid with the woreda or kebele shapefile and project to Adindan
    GridIntersect = workingDir + "\\" + variable + AggLevel + "Intersect.shp"
    ap.Intersect_analysis([GridClip, EthiopiaBorders],GridIntersect)
    GridIntersectProject = GridIntersect[0:-4] + "Project.shp"
    ap.Project_management(GridIntersect,GridIntersectProject,out_coor_system="PROJCS['Adindan_UTM_Zone_37N',GEOGCS['GCS_Adindan',\
    DATUM['D_Adindan',SPHEROID['Clarke_1880_RGS',6378249.145,293.465]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]],\
    PROJECTION['Transverse_Mercator'],PARAMETER['False_Easting',500000.0],PARAMETER['False_Northing',0.0],\
    PARAMETER['Central_Meridian',39.0],PARAMETER['Scale_Factor',0.9996],PARAMETER['Latitude_Of_Origin',0.0],\
    UNIT['Meter',1.0]]",transform_method="Adindan_To_WGS_1984_1",in_coor_system="GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',\
    SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]]")

    #calculate area of intersection between grid and woreda or kebele shapefile after adding a field to store it
    ap.AddField_management(GridIntersectProject, "PartArea", "DOUBLE", 12, 6, "", "", "", "", "")
    expression = "float(!SHAPE.AREA@SQUAREKILOMETERS!)"
    ap.CalculateField_management(GridIntersectProject, "PartArea", expression, "PYTHON")

    #convert GridIntersect's .dbf to a .csv for later use
    dbf = GridIntersectProject[0:-4] + ".dbf"
    intersectCSV = convertDBFtoCSV(dbf)

    return intersectCSV, GridCSV

def calcTempFields(Grid):
    expression1 = "getCol(!LONG!)"
    code_block1 = """def getCol(long):
    if long > 0:
        return 2*(long+0.25)-1
    elif long < 0:
        return 2*(long-0.25)+720"""
    ap.CalculateField_management(Grid, "col", expression1, "PYTHON", code_block1)

    expression2 = "180-2*(!LAT!-0.25)-1"
    ap.CalculateField_management(Grid, "row", expression2, "PYTHON")

    return Grid

def calcRainFields(Grid):
    expression = "750*(10*(!LAT!+40))+10*(!LONG!+20)"
    ap.CalculateField_management(Grid, "row", expression, "PYTHON")

    return Grid

Lines 11-22 of this function define the location of the grid cell boundaries for the variable of interest (“Temp” or “Rain”). These are used to make a polygon shapefile called AllTempGrid.shp or AllRainGrid.shp in lines 23-25 using the functions ap.GridIndexFeatures_cartography and ap.DefineProjection_management. The first creates the grid, and the second defines the projection. Since the grid was formed from latitude and longitude coordinates, a geographic coordinate system was chosen for the projection.

Confession: I didn’t know what the proper arcpy command was for this projection, so I first did it in the interface. I used “ArcToolbox\Data Management Tools\Projections and Transformations\Define Projection,” chose AllTempGrid as my “Input Dataset or Feature Class” and then selected “Geographic Coordinate System\World\WGS1984” for the Coordinate System. After executing a command in the interface, you can find the corresponding Python code by Opening “Results,” right-clicking on the command and selecting “Copy as Python Snippet.” See screen shots below.

After defining the grid’s projections, the function ap.AddField_management is used in lines 30-34 to add columns to the attribute table of this shapefile for the latitude and longitude, and for the temperature grid, row and column of the temp variable matrix in the netCDF file corresponding to that grid point. For the precipitation data, just the row of the text files corresponding to that grid cell is added (all of the data is stored in the same column of the text files, but each row is a different point). The latitude and longitude values are calculated and populated in lines 37-40 using the function ap.CalculateField_management. The row (and for temperature, column) values are calculated in lines 43-46 using the same function, but with conditional statements.

The gridded shapefile is then clipped to the woreda shapefile in lines 49-55 to form TempGridClipWoreda.shp or RainGridClipWoreda.shp using ap.Clip_analysis. Clip_analysis operates on shapefiles, whereas Clip_management (which I used to process the soil data) operates on rasters. The database file of TempGridClipWoreda.shp/RainGridClipWoreda.shp is converted to a csv called GridCSV in lines 56-57 for later use. Finally, ap.Intersect_analysis is used to intersect the clipped shapefiles with the woreda shapefile to form TempWoredaIntersect.shp or RainWoredaIntersect.shp in lines 60-61. The intersections look like this:

The last step is to determine the area of intersection between the woredas and the grid cells. This is done by adding a field for the “partial area,” or as I called it on line 71, “PartArea,” using ap.AddField_management. The woreda area given by the field “WorAreaKm2” is in square kilometers, so I wanted to calculate the partial area in square kilometers as well. In order to calculate areas, the projection must be a projected coordinate system, not a geographic coordinate system. The coordinate system of TempWoredaIntersect.shp and RainWoredaIntersect.shp is the same as the geographic coordinate system of the grid, so on line 63 it is converted to the projected coordinate system of the woreda shapefile, Adindan_UTM_ZONE_37N using the function ap.Project_management. This command was again found using the interface. The Output Coordinate System was found under “Projected Coordinate System\UTM\Africa\Adindan UTM Zone 37N.”

Once the intersection shapefile is projected, its area is then calculated on lines 72-73. Finally, the database file of the intersected shapefile is converted to a csv in lines 76-77. The intersected shapefile’s csv gives the area of intersection of each grid cell with each woreda. This, along with GridCSV is then used by the function findTempWeightMatrix.py (called on line 26 of makeTempDBF.py) or findRainWeightMatrix.py (called on line 22 of makeRainDBF.py) to determine the percentage of each woreda’s area that is covered by each temperature/precipitation grid cell. These percentages are stored in a matrix called WeightMatrix. The code for makeTempDBF.py is shown below.

import os
import numpy as np
from downloadTempData import downloadTempData
from intersectGrid import intersectGrid
from readTempData import readTempData
from findTempWeightMatrix import findTempWeightMatrix

def makeTempDBF(AggLevel):
    '''Calculates weighted average monthly temperature at AggLevel = "Woreda" or "Kebele"'''

    #Set the working directory
    workingDir = os.getcwd()

    #Download the monthly temperature data from NOAA GHCN CAMS
    DataFile = downloadTempData(workingDir)

    #Intersect the GHCN grid with the Woreda or Kebele shapefile and convert the .dbf to a .csv
    intersectCSV, GridCSV = intersectGrid(AggLevel, workingDir, "Temp")

    #Read in raw monthly temperature data at every GHCN grid cell and store it in the matrix 'allData'
    #dimensions are a x b x c where a = # of gridcells, b = 5 (lat, long, yr, mo, temp),
    #and c = # of months
    allData, gridCells = readTempData(DataFile, GridCSV)

    #Read in weights of each data point to each woreda or kebele
    WeightMatrix, ID = findTempWeightMatrix(AggLevel, intersectCSV, gridCells)
    Year = allData[0,2,:]
    Month = allData[0,3,:]

    #Calculate area-weighted average temperature data in each woreda or kebele
    Temp = np.dot(np.transpose(WeightMatrix),allData[:,4,:])

    #Write the data to 'WoredaTempDBF.csv' or 'KebeleTempDBF.csv'
    writeFiles(AggLevel, Temp, ID, Year, Month) 

    return None

def writeFiles(AggLevel, Temp, ID, Year, Month):
    if AggLevel == 'Kebele':
        f = open('KebeleTempDBF.csv','w')
        f.write('AggLevel,RK_CODE,Year,Month,Temp\n')
    elif AggLevel == 'Woreda':
        f = open('WoredaTempDBF.csv','w')
        f.write('AggLevel,WID,Year,Month,Temp\n')

    for i in range(np.shape(Temp)[0]):
        for j in range(np.shape(Temp)[1]):
            if AggLevel == 'Woreda':
                f.write('W,'+ str(ID[i]) + ',' + str(int(Year[j])) + ',' + str(int(Month[j])) + ',')
            elif AggLevel == 'Kebele':
                f.write('K,'+ str(ID[i]) + ',' + str(int(Year[j])) + ',' + str(int(Month[j])) + ',')
            f.write(str(Temp[i,j]))
            f.write('\n')

    f.close()

    return None

All of the raw data is read in by readTempData.py on line 23 of makeTempDBF.py and stored in a matrix allData. The area-weighted average temperature data can then be calculated from the dot product of allData and WeightMatrix, as shown on line 31. Finally, all of the area-weighted average woreda data is written to WoredaTempDBF.csv on line 34. A snippet of this CSV is shown below. The process is analogous for precipitation.

As you can see, data processing with GIS is much easier if the data come in the form of a GIS raster or shapefile like our soil data! If the soil data had been at a lower resolution, though, we probably would have had to use the same method as for the weather data, since the area-weighting is more accurate.

June 9, 2015 by Julie

Using arcpy to calculate area-weighted averages of gridded spatial data over political units (Part 1)

Note: In order to run arcpy, you need an installation of ArcGIS

This is a two-part blog on using arcpy, the Python package for performing ArcGIS spatial analysis. Specifically, I will detail two methods of calculating area-weighted averages of gridded spatial data over political units.

Many of you know I’ve been working with several Cornell development economists and the Ethiopian Agricultural Transformation Agency (ATA) to develop a fertilizer profitability tool for targeting agricultural interventions (see more info here). For this project, we built a statistical crop yield model using publicly available soil and weather data, as well as household survey data from the Ethiopian Central Statistical Agency (CSA) Agricultural Sample Survey. Because the Ag Sample Survey only reports the political unit in which a surveyed household is located, not its GPS coordinates, we needed to calculate area-weighted soil and weather data over these units.

The smallest political unit in Ethiopia is a “kebele,” and the next smallest, a “woreda,” (see maps below; the kebele map does not include the Somali region, which is too arid for agriculture). I allowed the user to choose either of these to process the data over, to see which resulted in a more predictive statistical crop yield model. For this discussion, I’ve assumed averages are being calculated at the woreda level.

We obtained soil data from the African Soil Information Service (AfSIS) at 250 m resolution on more than 30 different soil characteristics, many of them at several different depths (see more info here). I retrieved the daily rainfall data at 0.1° resolution from NOAA here (with some help from Jon Herman!), and monthly temperature data at 0.5° resolution from NOAA here.

Because the soil and weather datasets were in different formats, they needed to be processed differently. In this part, I’ll discuss the soil data, which came in the form of tiffs covering all of Africa. These were the easiest to process since they came in an Arc GIS-readable format.

Soil Data Processing

The first step in processing the soil data is to import arcpy as ap and then turn on the Spatial Analyst Toolbox with the command ap.CheckOutExtension(“Spatial”). The data can be clipped to Ethiopia using the function ap.Clip_Management. The average soil data in each political unit can then be calculated with the function ap.sa.ZonalStatisticsAsTable. The “.sa” means it is using the Spatial Analyst Extension, which is why this needs to be turned on. More info on the Clip_Management and ZonalStatisticsAsTable arcpy methods can be found on ESRI’s ArcGIS Resources site, as well as through their application in my code below.

It should be noted that technically, Zonal Statistics as Table does not calculate an area-weighted average; it converts the polygon shapefile to a raster and then finds the average cell value of all raster cells in each polygon. This is very close to the weighted average if the cell size of the data being averaged is much smaller than the size of the political unit it is being averaged over. Since the soil data are at such a high resolution (250 m), this method was deemed sufficient. In part because the weather data was a much lower resolution, a different method was used to calculate area-weighted averages in that case.

All of my codes can be found on github, but the bulk of the processing is done in the function processSoilData.py, shown below. As you can see, I imported arcpy on line 2, turned on Spatial Analyst on line 17, clipped the data to Ethiopia on line 53, and found the average value in each political unit in line 58, but clearly there’s more to the code than that.

import os
import arcpy as ap
from convertDBFtoCSV import convertDBFtoCSV
from joinSoilCSVs import joinSoilCSVs
from downloadSoilData import downloadSoilData

def processSoilData(AggLevel):
    '''Calculates average soil characteristics at AggLevel = "Woreda" or "Kebele" and outputs them to WoredaSoilData.csv or KebeleSoilData.csv'''

    #set the working directory
    workingDir = os.getcwd()

    #Download all of the soil data
    downloadSoilData(AggLevel, workingDir)

    #Turn on Spatial Statistics package and define field over which ZonalStatisticsAsTable will be calculated (Woreda or Kebele ID)
    ap.CheckOutExtension("Spatial")
    if AggLevel == 'Kebele':
        in_zone_data = os.path.dirname(workingDir) + "\\Shapefiles\\Ethiopia Kebeles without Somali region.shp"
        in_template_dataset = in_zone_data
        zone_field = "KebeleID"
    elif AggLevel == 'Woreda':
        in_zone_data = os.path.dirname(workingDir) + "\\Shapefiles\\WoredasWGS1984.shp"
        in_template_dataset = in_zone_data
        zone_field = "WOREDANO_"

    #Define the projection and change the working directory to the directory with all of the soil data folders
    latLongRef = "Coordinate Systems\Geographic Coordinate Systems\World\WGS 1984.prj"
    os.chdir(workingDir)
    directories = [f for f in os.listdir(os.getcwd()) if os.path.isfile(f) == False]

    for i in range(len(directories)):
        #Find all the tiffs with soil data in each soil characteristic folder
        os.chdir(workingDir + "\\" + directories[i])
        filelist = os.listdir(os.getcwd())
        tiffs = []
        clipTiffs = []
        for j in range(len(filelist)):
            name = filelist[j]
            if name[-8::] == '250m.tif':
                tiffs.append(name)
            elif name[-9::] == '_Clip.tif':
                clipTiffs.append(name)

        for j in range(len(clipTiffs)):
            clipTiffs[j] = os.getcwd() + "\\" + clipTiffs[j]

        for j in range(len(tiffs)):
            in_raster = os.getcwd() + "\\" + tiffs[j]
            out_raster = os.getcwd() + "\\" + tiffs[j][0:-4] + "_Clip.tif"
            #Clip the tiffs to Ethiopia if they haven't been already
            if out_raster not in clipTiffs:
                ap.Clip_management(in_raster, "#", out_raster, in_template_dataset, "#", 1)

            #Calculate Zonal Statistics of soil data at AggLevel
            in_value_raster = out_raster
            out_table = os.getcwd() + "\\" + tiffs[j][0:-4] + AggLevel + ".dbf"
            ap.sa.ZonalStatisticsAsTable(in_zone_data, zone_field, in_value_raster, out_table)

    #Convert the DBFs with all the AggLevel soil data to CSVs
    #Change the working directory to the directory with all the soil data folders
    os.chdir(workingDir)
    directories = [f for f in os.listdir(os.getcwd()) if os.path.isfile(f) == False]

    for i in range(len(directories)):
        #Find all the DBFs with soil data in each soil characteristic folder
        os.chdir(workingDir + "\\" + directories[i])
        filelist = os.listdir(os.getcwd())
        DBFs = []
        for j in range(len(filelist)):
            name = filelist[j]
            if name[-10::] == AggLevel + '.dbf':
                DBFs.append(name)

        #Convert the DBF to a CSV
        for j in range(len(DBFs)):
            convertDBFtoCSV(os.getcwd() + "\\" + DBFs[j])

    #Join the CSVs with all the AggLevel soil data into one CSV titled "WoredaSoilData.csv" or "KebeleSoilData.csv" depending on the AggLevel
    joinSoilCSVs(AggLevel, workingDir)

    return None

First, the data is downloaded from the ftp site using the function downloadSoilData.py, called on line 14. This function also unzips the data and stores the tiffs for each soil characteristic in a different directory (see code here for more details). Each directory can have multiple tiffs, since some of the soil characteristics are measured at several different depths. The tiffs are clipped to Ethiopia in lines 49-53 of processSoilData.py. The loop beginning on line 32 simply goes through all of the different directories, and the loop beginning on line 38 finds all of the clipped and unclipped tiffs in those directories.

The zonal statistics are calculated as described above and shown in lines 56-58 of processSoilData.py. Unfortunately, while the data is completely processed at this point, it’s stored in database files (.dbf) which can only be read by humans through ArcGIS or a database like Access. Each characteristic and layer is also stored in a separate .dbf, which is not very convenient. I converted all of these database files to comma separated files using the function convertDBFtoCSV.py (which still requires arcpy) on line 77 and then joined all of the csv’s of different soil characteristics and layers into one file using the function joinSoilCSVs.py on line 80 (which doesn’t require arcpy!). The loop beginning on line 65 simply goes through all of the soil directories, and the loop on line 70 finds all of the dbf’s in those directories.

The final result is a table where the first column stores the woreda or kebele ID, and the remaining columns store the average values of each soil characteristic at each measured depth. Here is a snippet:

Click here for Part 2 on processing the weather data.

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

Month: June 2015