Water Programming Blog Guide (Part 2)

Water Programming Blog Guide (Part 1)

This second part of the blog guide will cover the following topics:

  1. Version control using git
  2. Generating maps and working with spatial data in python
  3. Reviews on synthetic streamflow and synthetic weather generation
  4. Conceptual posts

1. Version Control using git

If you are developing code it’s worth the time to gain familiarity with git to maintain reliable and stable development.  Git allows a group of people to work together developing large projects minimizing the chaos when multiple people are editing the same files.   It is also valuable for individual projects as it allows you to have multiple versions of a project, show the changes that you have made over time and undo those changes if necessary.  For a quick introduction to git terminology and functionality, check out  Getting Started: Git and GitHub. The Intro to git Part 1: Local version control and  Intro to git Part 2: Remote Repositories  posts will guide you through your first git project (local or remote) while providing a set of useful commands.  Other specialized tips can be found in: Git branch in bash prompt and GitHub Pages. And if you are wondering how to use git with pycharm, you’ll find these couple of posts useful: A Guide to Using Git in PyCharm – Part 1A Guide to Using Git in PyCharm – Part 2

2. Generating maps and working with spatial data in python

To learn more about python’s capabilities on this subject,  this  lecture  summarizes key python libraries relevant for spatial analysis.  Also,  Julie and the Jons have documented their efforts when working with spatial data and with python’s basemap, leaving us with some valuable examples:

Working with raster data

Python – Extract raster data value at a point

Python – Clip raster data with a shapefile

Using arcpy to calculate area-weighted averages of gridded spatial data over political units (Part 1) , (Part 2)

Generating maps

Making Watershed Maps in Python

Plotting geographic data from geojson files using Python

Generating map animations

Python makes the world go ’round

Making Movies of Time-Evolving Global Maps with Python

3. Reviews on synthetic streamflow and weather generation

We are lucky to have thorough reviews on synthetic weather and synthetic streamflow generation written by our experts Julie and Jon L.  The series on synthetic weather generation consists of five parts. Part I and Part II cover parametric and non-parametric methods, respectively. Part III covers multi-site generation.  Part IV discusses how to modify both parametric and non-parametric methods to simulate weather with climate change projections and finally Part V covers how to simulate weather with seasonal climate forecasts:

Synthetic Weather Generation: Part I , Part II , Part III , Part IV , Part V

The synthetic streamflow review provides a historical perspective while answering key questions on “Why do we care about synthetic streamflow generation?  “Why do we use it in water resources planning and management? and “What are the different methods available?

Synthetic streamflow generation

4.  Conceptual posts

Multi-objective evolutionary algorithms (MOEAs)

We frequently use multi-objective evolutionary algorithms due to their power and flexibility to solve multi-objective problems in water resources applications, so you’ll find sufficient documentation in the blog on basic concepts, applications and performance metrics:

MOEAs: Basic Concepts and Reading

You have a problem integrated into your MOEA, now what?

On constraints within MOEAs

MOEA Performance Metrics

Many Objective Robust Decision Making (MORDM) and Problem framing

The next post discusses the MORDM framework which combines many objective evolutionary optimization, robust decision making, and interactive visual analytics to frame and solve many objective problems under uncertainty.  This is a valuable reading along with the references within.  The second post listed provides a systematic way of thinking about problem formulation and defines the key components of a many-objective problem:

Many Objective Robust Decision Making (MORDM): Concepts and Methods

“The Problem” is the Problem Formulation! Definitions and Getting Started

Econometric analysis and handling multi-variate data

To close this second part of the blog guide, I leave you with a couple selected topics  from the Econometrics and Multivariate statistics courses at Cornell documented by Dave Gold:

A visual introduction to data compression through Principle Component Analysis

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

 

Water Programming Blog Guide (Part I)

The Water Programming blog continues to expand collaboratively through contributors’ learning experiences and their willingness to share their knowledge in this blog.  It now covers a wide variety of topics ranging from quick programming tips to comprehensive literature reviews pertinent to water resources research and multi-objective optimization.  This post intends to provide guidance to new, and probably to current users by bringing to light what’s available in the blog and by developing a categorization of topics.

This first post will cover:

Software requirements

1.Programming languages and IDEs

2.Frameworks of optimization, sensitivity analysis and decision support

3.The Borg MOEA

4.Parallel computing

Part II)  will focus  on version control, spatial data and maps, conceptual posts and literature reviews.  And finally Part III)  will cover visualization and figure editing, LaTex and literature management, tutorials and miscellaneous research and programming tricks.

*Note to my fellow bloggers: Feel free to disagree and make suggestions on the categorization of your posts, also your thoughts on facilitating an easier navigation through the blog are very welcomed.  For current contributors,  its always suggested to tag and categorize your post, you can also follow the guidelines in Some WordPress Tips to enable a better organization of our blog.  Also, if you see a 💡, it means that a blog post idea has been identified.

Software Requirements

If you are new to the group and would like to know what kind of software you require to get started with research.  Joe summed it up pretty well in his  New Windows install? Here’s how to get everything set up post, where he points out all the installations that you should have on your desktop.  Additionally, you can find some guidance on:  Software to Install on Personal Computers and Software to Install on Personal Computers – For Mac Users!. You may also want to check out  What do you need to learn?  if you are entering the group.  These posts are subject to update 💡 but they are a good starting point.

1. Programming languages and Integrated Development Environments (IDEs)

Dave Hadka’s Programming Language Overview provides a summary of key differences between the C, C++, Python and Java.  The programming tips found in the blog cover Python, C, C++,  R and Matlab, there are also some specific instances were Java is used which will be discussed in section 2.  I’ll give some guidance on how to get started on each of these programming languages and point out some useful resources in the blog.

1.1. Python

Python is  a very popular programming language in our group so there’s sufficient guidance and resources available in the blog.  Download is available here, also some online tutorials that I really recommend to get you up to speed with Python are:  learn python the hard waypython for everybody and codeacademy.   Additionally,  stackoverflow is a useful resource for specific tasks.  The python resources available in our blog are outlined as follows:

Data analysis and organization

Data analysis libraries that you may want to check out are  scipy,   numpy   and pandas. Here are some related posts:

Importing, Exporting and Organizing Time Series Data in Python – Part 1 and  Part 2

Comparing Data Sets: Are Two Data Files the Same?

Using Python IDEs

The use of an integrated development environment (IDE) can enable code development and  make the debugging process easier.  Tom has done a good amount of development in PyCharm, so he has generated a  sequence of posts that provide guidance on how to take better advantage of PyCharm:

A Guide to Using Git in PyCharm – Part 1 , Part 2

Debugging in Python (using PyCharm) – Part 1 , Part 2 and Part 3

PyCharm as a Python IDE for Generating UML Diagrams

Josh also provides instructions to setup PyDev in eclipse in his Setting up Python and Eclipse post,  another Python IDE that you may want to check out is Spyder.

Plotting

The plotting library for python is matplotlib.  Some of the example found in the blog will provide some guidance on importing and using the library.  Matt put together a github repository with several  Matlab and Matplotlib Plotting Examples, you can also find guidance on generating more specialized plots:

Customizing color matrices in matplotlib

Easy vectorized parallel plots for multiple data sets

Interactive plotting basics in matplotlib

Python Data Analysis Part 1a: Borg Runtime Metrics Plots (Preparing the Data) , Part 1b: Setting up Matplotlib and Pandas , Part 2: Pandas / Matplotlib Live Demo.

Miscellaneous  Python tips and tricks

Other applications in Python that my fellow bloggers have found useful are related to machine learning:  Basic Machine Learning in Python with Scikit-learn,  Solving systems of equations: Root finding in MATLAB, R, Python and C++  and using  Python’s template class.

1.2. Matlab

Matlab with its powerful toolkit, easy-to-use IDE and high-level language can be used for quick development as long as you are not concerned about speed.  A major disadvantage of this software is that it is not free … fortunately I have a generous boss paying for it.  Here are examples of Matlab applications available in the blog:

A simple command for plotting autocorrelation functions in Matlab

Plotting Probability Ellipses for Bivariate Normal Distributions

Solving Analytical Algebra/Calculus Expressions with Matlab

Generating .gifs from Matlab Figures

Code Sample: Stacked Bars and Lines in Matlab

1.3. C++

I have heard C++ receive extremely unflattering nicknames lately, it is a low-level language which means that you need to worry about everything, even memory allocation, but the fact is that it is extremely fast and powerful and is widely used in the group for modeling, simulation and optimization purposes which would take forever in other  languages.

Getting started

If you are getting started with C++,there are some online  tutorials , and you may want to check out the following material available in the blog:

Setting up Eclipse for C/C++

Getting started with C and C++

Matt’s Thoughts on C++

Training

Here is some training material that Joe put together:

C++ Training: Libraries , C++ Training: Valgrind ,  C++ Training: Makefiles ,  C++ Training: Exercise 1C++ Training: Using gprofCompiling Code using Makefiles

Debugging

If you are developing code in C++ is probably a good idea to install an IDE,  I recently started using CLion, following Bernardo’s and Dave’s recommendation, and I am not displeased with it.  Here are other posts available within this topic:

Quick testing of code online

Debugging the NWS model: lessons learned

Sample code

If you are looking for sample code of commonly used processes in C++, such as defining vectors and arrays, generating output files and timing functions, here are some examples:

C++: Vectors vs. Arrays

A quick example code to write data to a csv file in C++

Code Sample: Timing Functions for C++

1.4. R

R is another free open source environment widely used for statistics.  Joe recommends a reading in his Programming language R is gaining prominence in the scientific community post.  Downloads are available here.  If you happen to use an R package for you research, here’s some guidance on How to cite packages in R.  R also supports a very nice graphics package and the following posts provide plotting examples:

Survival Function Plots in R

Easy labels for multi-panel plots in R

R plotting examples

Parallel plots in R

1.5. Command line/ Linux:

Getting familiar with the command line and linux environment is essential to perform many of the examples and tutorials available in the blog.  Please check out the   Terminal basics for the truly newbies if you want an introduction to the terminal basics and requirements, also take a look at Using gdb, and notes from the book “Beginning Linux Programming”.   Also check out some useful commands:

Using linux “cut” , Using linux “split” , Using linux “grep”

Useful Linux commands to handle text files and speed up work

Using Linux input redirection in a C++ code test

Emacs in Cygwin

2. Frameworks for optimization, sensitivity analysis, and decision support

We use a variety of free open source libraries to perform commonly used analysis in our research.  Most of the libraries that I outline here were developed by our very own contributors.

2.2. MOEAFramework

I have personally used this framework for most of my research.  It has great functionality and speed. It is an open source Java library that supports several multi-objective evolutionary algorithms (MOEAs) and provides  tools to statistically test their performance.  It has other powerful capabilities for sensitivity and data analysis.   Download and documentation material are available here.  In addition to the documentation and examples provided on the MOEAFramework site, other useful resources and guidance can be found in the following posts:

Setup guidance

MOEAframework on Windows

How to specify constraints in MOEAframework (External Problem)

Additional information on MOEAframework Setup Guide

Extracting data

Extracting Data from Borg Runtime Files

Runtime metrics for MOEAFramework algorithms, extracting metadata from Borg runtime, and handling infinities

Parameter Description Files for the MOEAFramework algorithms and for the Borg MOEA

Other uses

Running Sobol Sensitivity Analysis using MOEAFramework

Speeding up algorithm diagnosis by epsilon-sorting runtime files

2.2. Project Platypus

This is the newest python framework developed by Dave Hadka that support a collection of libraries for optimization, sensitivity analysis, data analysis and decision making.  It’s free to download in the Project Platypus github repository .  The repository comes with its own documentation and examples.  We are barely beginning to document our experiences with this platform 💡, but it is very intuitive and user friendly.  Here is the documentation available in the blog so far:

A simple allocation optimization problem in Platypus

Rhodium – Open Source Python Library for (MO)RDM

Using Rhodium for RDM Analysis of External Dataset

 

 

2.3. OpenMORDM

This is an open source library in R for Many Objective robust decision making (MORDM), for more details and documentation on both MORDM and the library use, check out the following post:

Introducing OpenMORDM

2.4. SALib

SALib is a python library developed by Jon Herman that supports commonly used methods to perform sensitivity analysis.  It is available here, aside from the documentation available in the github repository,  you can also find  guidance on some of the available methods in the following posts:

Method of Morris (Elementary Effects) using SALib

Extensions of SALib for more complex sensitivity analyses

Running Sobol Sensitivity Analysis using SALib

SALib v0.7.1: Group Sampling & Nonuniform Distributions

There’s also an R Package for sentitivity analysis:  Starting out with the R Sensitivity Package.  Since we are on the subject, Jan Kwakkel provides guidance on Scenario discovery in Python as well.

2.5. Pareto sorting function in python (pareto.py)

This is a non-dominated sorting function for multi-objective problems in python available in Matt’s github repository.  You can find more information about it in the following posts:

Announcing version 1.0 of pareto.py

Announcing pareto.py: a free, open-source nondominated sorting utility

3.  Borg MOEA

The Borg Multi-objective Evolutionary Algorithm (MOEA) developed by Dave Hadka and Pat Reed, is widely used in our group due to its ability to tackle complex many-objective problems.   We have plenty of documentation and references in our blog so you can get familiar with it.

3.1. Basic Implementation

You can find a brief introduction and basic use in Basic Borg MOEA use for the truly newbies (Part 1/2) and (Part 2/2).  If you want to link your own simulation model to the optimization algorithm, you may want to check: Compiling, running, and linking a simulation model to Borg: LRGV Example.  Here are other Borg-related posts in the blog:

Basic implementation of the parallel Borg MOEA

Simple Python script to create a command to call the Borg executable

Compiling shared libraries on Windows (32 bit and 64 bit systems)

Collecting Borg’s operator dynamics

3.2. Borg MOEA Wrappers

There are Borg MOEA wrappers available for a number of languages.   Currently the Python, Matlab and Perl wrappers are  documented in the blog.  I believe an updated version of the Borg Matlab wrapper for OSX documentation is required at the moment 💡.

Using Borg in Parallel and Serial with a Python Wrapper – Part 1

Using Borg in Parallel and Serial with a Python Wrapper – Part 2

Setting Borg parameters from the Matlab wrapper

Compiling the Borg Matlab Wrapper (Windows)

Compiling the Borg Matlab Wrapper (OSX/Linux)

Code Sample: Perl wrapper to run Borg with the Variable Infiltration Capacity (VIC) model

4. High performance computing (HPC)

With HPC we can handle and analyse massive amounts of data at high speed. Tasks that would normally take months can be done in days or even minutes and it can help us tackle very complex problems.  In addition, here are some Thoughts on using models with a long evaluation time within a Parallel MOEA framework from Joe.

In the group we have a healthy availability of HPC resources; however, there are some logistics involved when working with computing clusters.  Luckily, most of our contributors have experience using HPC and have documented it in the blog.   Also, I am currently using the  MobaXterm interface to facilitate file transfer between my local and remote directories, it also enables to easily navigate and edit files in your remote directory.  It is used by our collaborators in Politecnico di Milano who recommended it to Julie who then recommended it to me.   Moving on, here are some practical resources when working with remote clusters:

4.1. Getting started with clusters and key commands

Python for automating cluster tasks: Part 1, Getting started and Part 2, More advanced commands

The Cluster and Basic UNIX Commands

Using a local copy of Boost on your cluster account

4.2. Submission scripts in  Bash

Some ideas for your Bash submission scripts

Key bindings for Bash history-search

4.3. Making bash sessions more enjoyable

Speed up your Bash sessions with “alias”

Get more screens … with screen

Running tmux on the cluster

Making ssh better

4.4. Portable Batch System (PBS)

Job dependency in PBS submission

PBS Job Submission with Python

PBS job chaining

Common PBS Batch Options

4.5. Python parallelization and speedup

Introduction to mpi4py

Re-evaluating solutions using Python subprocesses

Speed up your Python code with basic shared memory parallelization

Connecting to an iPython HTML Notebook on the Cluster Using an SSH Tunnel

NumPy vectorized correlation coefficient

4.6. Debugging

Debug in Real-time on SLURM

Debugging MPI By Dave Hadka

4.7. File transfer

Globus Connect for Transferring Files Between Clusters and Your Computer

 

9 basic skills for editing and creating vector graphics in Illustrator

This post intends to provide guidance for editing and creating vector graphics using Adobe Illustrator.     The goal is to learn some of the commonly used features to help you get started with your vectorized journey.  Let it be a conceptual diagram, or a logos or cropping people out of your photo, these 9 features (and a fair amount googling) will help you do the job.   Before we begin, it may be worthwhile to distinguish some of the main differences between a raster and a vector graphic.  A raster image is comprised of a collection of squares or pixels, and vector graphics are  based on mathematical formulas that define geometric forms (i.e.  polygons, lines, curves, circles and rectangles), which makes them independent of resolution.

The three main advantages of using vector graphics over raster images are illustrated below:

1. Scalability: Vector graphics scale infinitely without losing any image quality. Raster images guess the colors of missing pixels when sizing up, whereas vector graphics simply use the original mathematical equation to create a consistent shape every time.

scalability-01.png

2. Edibility: Vector files are not flattened, that is, the original shapes of an image exist separately on different layers; this provides flexibility on modifying different elements without impacting the entire image.

edibility.png

3. Reduced file size: A vector file only requires four data points to recreate a square ,whereas a raster image needs to store many small squares.

reduced_file_size.png

 

9 key things to know when getting started with Adobe Illustrator:

1. Starting a project

You can start a new project simply by clicking File> New, and the following window will appear.  You can provide a number of specifications for your document before starting, but you can also customize your document at any stage by clicking File> Document setup (shortcut Alt+Ctrl+P).

pic1.PNG

2. Creating basic shapes

Lines & Arrows

Simply use the line segment tool ( A ) and remember to press the shift button to create perfectly straight lines.  Arrows can be added by using the stroke window (Window> Stroke) and (B) will appear, there’s a variety of arrow styles that you can select from and scale (C).  Finally,  in the line segment tool you can provide the exact length of your line.

Slide1.PNG

Polygons 

Some shapes are already specified in Illustrator (e.g. rectangles , stars and circles (A), but many others such as triangles, need to be specified through the polygon tool.  To draw a triangle I need to specify the number of sides =3 as shown in  (B).

Slide1.PNGCurvatures 

To generate curvatures, you can use the pen tool (A).  Specify two points with the pen, hold the click in the second point and a handle will appear, this handle allows you to shape the curve.  If you want to add more curvatures, draw another point (B) and drag the handle in the opposite direction of the curve.  You can then select the color (C) and the width (D) of your wave.

Slide1.PNG

3. Matching colors

If you need to match the color of an image (A) there are a couple of alternatives:

i) Using the “Eyedrop” tool ( B).  Select the component of the image that you want to match, then select the Eyedrop tool and click on the desired color (C).

Slide1.PNG

ii) Using the color picker panel.  Select the image component with the desired color, then double click on the color picker (highlighted in red) and the following panel should appear.  You can see the exact color code and you can copy and paste it on the image that you wish to edit.

Slide1.PNG

4. Extracting exact position and dimensions

In the following example, I want the windows of my house to be perfectly aligned.  First, in (A), I click on one of the windows of my house and the control panel automatically provides its x and y coordinates, as well its width and height.  Since I want to align both of the windows horizontally, I investigate the  Y coordinates  of the first window and copy it onto the y coordinate of he second window as shown in (B).  The same procedure would apply if you want to copy the dimensions from one figure to another.

twohouses.png

5. Free-style drawing and editing 

The pencil tool (A) is one of my favorite tools in Illustrator, since it corrects my shaky strokes, and allows me to  paint free style.  Once I added color and filled the blob that I drew, it started resembling more like a tree top (B).  You can edit your figure by right clicking it.  A menu will appear enabling you to rotate, reflect, shear and  scale, among other options.  I only wanted to tilt my tree so I specify a mild rotation  (C).

Slide1.PNG

Slide1.PNG

6. Cropping:

Cropping in Illustrator requires clipping masks and I will show a couple of examples using  Bone Bone, a fluffy celebrity cat.  Once a .png image is imported into illustrator, it can be cropped using  one of the following three methods:

Method 1.  Using the direct selection tool

Slide1.PNG

Method 2. Using shapesSlide2.PNG

Method 3. Using the pen tool for a more detailed cropSlide3.PNG

To reverse  to the original image Select Object> Clipping mask> Release or Alt+Ctrl+7

7. Customize the art-board size

If you want your image to be saved without extra white space (B), you can adapt the size of the canvas with  the Art-board tool (or Shft+8) ( A).  Slide1.PNG

8. Using layers:

Layers can help you organize artwork, specially when working with multiple components in an image.  If the Layers panel is not already in your tools, you can access it through Window>  Layers or through the F7 shortcut.  A panel like the  one below should appear.   You can name the layers by double clicking on them, so you can give them a descriptive name.  Note that you can toggle the visibility of each layer on or off. You can also lock a layer if you want to protect it from further change, like the house layer in the example below.  Note that each layer is color-coded, my current working layer is coded in red, so when I select an element in that layer it will be highlighted in red.  The layers can also have sub-layers to store individual shapes, like in the house layer, which is comprised of a collection of rectangles and a triangle.

layer.png

closelayer.png

9.  Saving vector and exporting raster

Adobe Illustrator, naturally allows you to save images in many vector formats, But you can also export raster images such as .png, .jpeg, .bmp, etc.. To export raster images do File> Export and something like the blurry panel below should show up.  You can specify the  resolution and the color of the background. I usually like a transparent background  since it allows you more flexibility when using your image in programs such as Power Point.

background.png

There are many more features available in Illustrator, but this are the ones that I find myself using quite often.  Also, you probably won’t have to generate images from scratch, there are many available resources online. You can download svg images for free which you an later customize in Illustrator.  You can also complement this post by reading Jon Herman’s Scientific figures in Illustrator.

A simple allocation optimization problem in Platypus

For those of you who are not familiar with Project Platypus, its a repository that supports a collection of python libraries for multi-objective optimization, decision making and data analysis. All the libraries are written in a very intuitive way and it is just so slick.

In this post I will focus exclusively on the platypus library which supports a variety of multi-objective evolutionary algorithms (MOEAs), such as NSGA-II, NSGA-III, MOEA/D, IBEA, EpsMOEA, SPEA2, GDE3, OMOPSO and SMPSO. Along with a number of analysis tools and performance metrics (which we will be discussing in subsequent posts).

First you can install the entire framework by typing the following commands on your terminal:

git clone https://github.com/Project-Platypus/Platypus.git
cd Platypus
python setup.py develop

If you have trouble with this first step, please feel free to report it.  The library is still under development and it might be infested with some minor bugs.

I wanted to start this post with a classical allocation problem to illustrate how you would implement your own function and optimize it using platypus.   This example, by the way, was inspired on Professor Loucks’ public system’s modeling class.

We have the following problem:

  1. We want to allocate coconuts to three different markets.
  2. The goal is to find the allocations that maximizes the total benefits.
  3. We have only 6 coconut trucks to distribute to the three markets.

So, lets find the best allocations using the NSGAII algorithm supported by platypus:


from platypus.algorithms import NSGAII
from platypus.core import Problem
from platypus.types import Real

class allocation(Problem):

def __init__(self):
   super(allocation, self).__init__(3, 3, 1)
   self.types[:] = [Real(-10, 10), Real(-10, 10), Real(-10,10)]
   self.constraints[:] = "<=0"

def evaluate(self, solution):
   x = solution.variables[0]
   y = solution.variables[1]
   z = solution.variables[2]
   solution.objectives[:] = [6*x/(1+x), 7*y/(1+1.5*y), 8*z/(1+0.5*z)]
   solution.constraints[:] = [x+y+z-6]

algorithm = NSGAII(allocation())
algorithm.run(10000)

for solution in algorithm.result:
   print(solution.objectives)

Assuming that you already have Platypus installed.  You should be able to import the classes specified in lines 1-3 in the code above.   The first line, as you can tell,  is were you import the MOEA of your choosing, make sure its supported by platypus.  In the second line, we import the Problem class that allows you to define your own  function.  You also need to import the type class  to specify your decision variables’ type.  The library supports Real, Binary,  Integer, etc.  I define the allocation class in line 5.  In line 8, I specify the number of objectives=3, number of decision variables=3 and number of constraints =1.  The bounds of my decision variables are defined in line 9.  The function in line 12 evaluates and stores the solutions.

Note that for the solution.objectives, each of the objective functions are specified in a vector and separated by a comma.  You can set constraints in line 17.  If you have more than one constraint it will be in the same vector with a comma separation.  In line 19, the function is to be optimized with the NSGAII problem, in line 20 you set the number of iterations for the algorithm, I went up to 100000, it really depends on the difficulty of your problem and your choice of algorithm.  Finally, I print the objective values in line 23, but you can have the decision variables as well if you wish.  As you can see, the setup of your problem can be extremely easy.  The names of the classes and functions in the library are very instinctive and you can focus entirely on your  problem formulation.

Customizing color matrices in matplotlib

In this post I intend to pass on some tricks on matplotlib color matrix customization.  I am guilty of beautifying some of my color matrices with Adobe Illustrator in the past, re-arranging labels, titles, colormaps, etc.  However, this time I had to generate way too many of them and I could see the beautifying process becoming extremely painful.  I will simply demonstrate how to do the following three plots simultaneously with relatively few lines of code in the hopes of providing useful elements for your own plot cutomization.

plot1a.png

plo2.png

plt3.png

Plot 1- Plot 3  were generated with the following script which I will explain in detail later int this post:

import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#Listing your files
files = glob.glob('./attainment_matrices/*.out')

#Organizing your files
data_plot1=[np.genfromtxt(f) for f in files[8:12]]
data_plot2=[np.genfromtxt(f) for f in files[0:4]]
data_plot3=[np.genfromtxt(f) for f in files[16:20]]
data=[data_plot1,data_plot2,data_plot3]

#Organizing titles and labels
plot_titles=['Plot 1','Plot 2', 'Plot 3']
subplot_titles= ['Subplot 1','Subplot 2', 'Subplot 3','Subplot 4']
labels= ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
y_labels= ['Y Title a$\longrightarrow$','Y Title b $\longrightarrow$','Y Title c $\longrightarrow$']
cmap_labels=['Colormap label a$\longrightarrow$', 'Colormap label b$\longrightarrow$', 'Colormap label c$\longrightarrow$']

# Some variables to adjust subplots if necessary
left = 0.125 # the left side of the subplots of the figure
right = 0.9 # the right side of the subplots of the figure
bottom = 0.3 # the bottom of the subplots of the figure
top = 0.82 # the top of the subplots of the figure
wspace = 0.2 # the amount of width reserved for blank space between subplots
hspace = 0.5 # the amount of height reserved for white space between subplots

#Font sizes
plot_fontsize=40
subplot_fontsize=32
tick_label_fontsize=22 # Ticks, colormap, x and y labels use this fontsize

#x-label adjustments
rotation= 45 # rotation of labels
adjust=0 #if you want the x labels to be displayed right at the middle then adjust=0.5

x=np.arange(0,5.5)
y=np.linspace(0,100,1001)

#colormaps
colormap=['Set3_r', 'YlGnBu','Paired']

# the j is the iteration variable for each subplot, and the l is the iteration variable
# for each plot.
for l in range(len(plot_titles)):
fig, ax=plt.subplots(1,len(subplot_titles),sharey=True)
plt.subplots_adjust(left=left, bottom=bottom, right=right, top=top, wspace=wspace, hspace=hspace)
#setting the titles wrapped by a transparent grey box at position=(x,y)
fig.suptitle(plot_titles[l], fontsize=plot_fontsize,
bbox={'facecolor':'grey', 'alpha':0.1, 'pad':12}, position=(0.1827, .95))

for j in range(len(subplot_titles)):
a= ax[j].pcolor(x,y,data[l][j], cmap=colormap[l])
ax[j].set_title(subplot_titles[j], fontsize= subplot_fontsize, y=1.03)
#Set the y-label only in the first subplot
ax[0].set_ylabel(y_labels[l], fontsize=tick_label_fontsize)
ax[j].set_xticks(x + adjust, minor=False)
#ax[j].set_xlim(left=0, right=5)
#ax[j].set_ylim(0,100)
ax[j].set_xticklabels(labels[:], rotation=rotation)
ax[j].tick_params(labelsize=tick_label_fontsize)

#colorbar settings:
leftc= 0.12504
bottomc=.13
width_c=.775
height_c=0.04
cbar_ax= fig.add_axes([leftc,bottomc,width_c,height_c])
#cbar= fig.colorbar(a, cax=cbar_ax, orientation='horizontal')
cbar = fig.colorbar(a,cax=cbar_ax, ticks=[0, 0.5, 1], orientation='horizontal')
cbar.ax.set_xticklabels(['Low', 'Medium', 'High'])
cbar.set_label(cmap_labels[l], fontsize=tick_label_fontsize, labelpad=25)
cbar.ax.tick_params(labelsize=tick_label_fontsize)

plt.show()
plt.show()

First, in lines 1 though 4 I specify the required libraries.  I use glob.glob to list the files for the analysis with their full path in line 8.  Then if you want to see the order in which the files are listed you can simply run the print command as follows:

print files[:]

And you should be able to see the order of the files like so:

[‘./data_directory/file1.out’, ‘./data_directory/file10.out’, … ‘./data_directory/file24.out’]

I used the numpy genfromtxt function in lines 11-13 to load the data from the specified files while organizing the data that would be used in plot 1, plot 2 and plot 3.   I then made an array of the previous data on line 14 so I could use it in a loop later on.

I organized the titles of the main plots, subplots, the x and y labels, as well as the colormap labels in lines 17-21.  All the parameters required to adjust the aspect ratio of the subplots are listed in lines 24 to 29.    If you simply want all of your subplots to be squared, you can add the aspect=’equal’  parameter directly in the plt.subplot() function.

The font for the plots, subplots, ticks and labels are specified in lines 32 to 34.  The x-labels can be adjusted in multiple ways.  In line 27 I set the rotation of the x-labels to 45 degrees.  If you want the labels to be completely vertical then you would do: rotation=90.  If you want horizontal labels, you don’t need to specify a rotation parameter.  Then, I used the adjust variable to specify the position of the x-label,  adjust=0 specifies that the label will be written starting at the left corner of the bar, if you want the label to be centered, then you can do adjust=0.5.

In line 44,  I list the different colormaps to be used by each plot. The outer loop in line 48, iterates through the 3 plots,  while the inner loop in line 55, iterates through the 4 subplots generated in each plot.    In line 49 we specify the number of rows and columns of subplots that will be generated.  I want them to share the  y axis, hence, sharey=True.   If you want your subplots to also share the x axis, you would simply add ‘sharex=True‘ in line 49.  The plt.subplots_adjust function in line 50, allows you to specify the exact aspect ratio of your subplots, including the white space between them and their location in the figure canvas, this is detailed in lines 24 to 29.  In line 52, I specified the title of the plot as a whole, since I have three different plots, I loop through each of the different titles.  The title is shown in a grey transparent box at the upper left corner of the canvas which was specified by position(x,y).

Lines 56 to 64 show the subplots’ code.  I use the pcolor function to generate the color matrices.  However, there are other methods to create them, such as pcolormesh, imshow, contour, etc.  In line 57 I loop through the subplot titles, then I assign their font size.  Here, the y=1.03 specifies the distance from the subplot title to the plot.  The more distance I want to create the larger this y value should be.  In line 59 I set the y-label, since I only want the y-label to be shown in the left most plot, I fix ax[0].set_ylabel(…), if you want each subplot to have their own y-labels then you can loop through each of them with the subplot iteration variable j, such as ax[j].set_ylabel(…).  Lines 61 to 62 (commented out), show how you could set the x and y axis limits.  In line 63, I set the x_ticklabels; similarly you could set the y_ticklabels if necessary.  The fontsize across all the ticks in line 64.

The colorbar settings are shown in lines 67 through 76.  Observe how you can specify the position of the left bottom corner of the colorbar, and from there you can assign the width and the height of the colorbar.  Note that there’s a couple of ways to specify the colorbar, the first one is shown in line 72, it will generate a colorbar with the default ticks.  However, if you want to cutomize or add text to your colorbar, you would have to do so as shown in lines 73-74.  The ticks parameter in line 73, specifies the position were the labels written in line 74 are displayed.  You can set the colorbar label with .set_label.   I loop through the colormap labels for each plot and assign their fontsize in line 75.  The labelpad allows you to specify the distance between the colorbar and the label.   Finally,  the font size of the colormap ticks are specified in line 76.

I hope you can find some of the previous elements useful when designing your own color matrices ;).

 

 

 

Easy vectorized parallel plots for multiple data sets

I will share a very quick and straight-forward solution to generate parallel plots in python of multiple groups of data.   The idea is transitioning from the parallel axis plot tool  to a method that enables  the plots to be exported as a vectorized image.   You can also take a look at Matt’s python parallel.py code available in github: https://github.com/matthewjwoodruff/parallel.py .

This is the type of figure that you will get:

parallel3.png

The previous figure was generated with the following lines of code:

import numpy as np
import pandas as pd
from pandas.tools.plotting import parallel_coordinates
import matplotlib.pyplot as plt
import seaborn

data = pd.read_csv('sample_data.csv')

parallel_coordinates(data,'Name', color= ['#225ea8','#7fcdbb','#1d91c0'], linewidth=5, alpha=.8)
plt.ylabel('Direction of Preference $\\rightarrow$', fontsize=12)

plt.savefig('parallel_plot.svg')

Lines 1-4 are the required libraries.  I just threw in the seaborn library to give it the gray background but it is not necessary.  In the parallel_coordinates function, you need to specify the data, ‘Name’ and the color of the different groups.  You can substitute the color  variable for colormap and specify the colormap that you wish to use (e.g. colormap=’YlGnBu’).   I also specified an alpha for transparency to see overlapping lines. If you want to learn more, you can take a look at the parallel_coordinates source code.  I found this stack overflow link very useful,  it shows some examples on editing the source code to enable other capabilities.

Finally, the following snippet shows the format of the input data (the sample_data.csv  file that is read in line 7 ) :

sample_data.png

Columns A-G the different categories to be plotted are specified (e.g. the objectives of a problem) and in Column H the names of the different data groups are specified.  And there you have it, I hope you find this plotting alternative useful.

Basic implementation of the parallel Borg MOEA

This post walks through the implementation of the different parallel versions of the Borg MOEA: master-slave and multi-master slave.  The implementation is demonstrated  using the DTLZ2 example provided with the multi-master source code. We will break this post in four sections, the fist one describes the Main file were the problem is defined and the required libraries are specified.  The second part describes the Makefile to compile and create the executables for the dtlz2 problem, the third section describes the Submission file to manage the distribution of jobs in a cluster, and finally we’ll cover a  Job submission example in the fourth section.

Both of the parallel Borg implementations are described in detail in the following papers:

Hadka, D., and Reed, P.M., “Large-scale Parallelization of the Borg MOEA for Many-Objective Optimization of Complex Environmental Systems”, Environmental Modelling & Software, v69, 353-369, 2015.

Reed, P.M. and Hadka, D., “Evolving Many-Objective Water Management to Exploit Exascale Computing”, Water Resources Research, v50, n10, 8367–8373, 2014.

1. Main file

1.1. Required headers

The main file refers to the file were we specify the main function called : dtlz2_mm.c,  dtlz2 is the test problem we are solving,  the mm refers to multi-master implementation.  First, you will need the mpi header (line 6), this is a message passing library required for parallel computers and clusters.  You will also need to provide the Borg multi-master header file: borgmm.h (line 7),  if your working  directory is different from that of the multi-master directory, you will need to specify the path, for intstance: “./mm-borg-moea/borgmm.h”.

#include <stdio.h>;
#include <stdlib.h>;
#include <math.h>;
#include <time.h>;
#include <math.h>;
#include <mpi.h>;
#include "borgmm.h";

1.2. Problem definition

In the following lines the DTLZ2 problem is defined as it would be in the serial version.  The following funciton is responsible for reading the decision variables and evaluating the problem.  We will be using the 2 objective version of this problem; however, you can scale it up.  The rule is nvars=nobjs+9.  Hence, if you want to try up to 5 objectives you can simply change lines 2 and 3 to be: nvars= 14 and nobjs=5.

#define PI 3.14159265358979323846
int nvars = 11;
int nobjs = 2;
void dtlz2(double* vars, double* objs, double* consts) {
	int i;
	int j;
	int k = nvars - nobjs + 1;
	double g = 0.0;

	for (i=nvars-k; i<nvars; i++) {
		g += pow(vars[i] - 0.5, 2.0);
	}

	for (i=0; i<nobjs; i++) {
		objs[i] = 1.0 + g;

		for (j=0; j<nobjs-i-1; j++) {
			objs[i] *= cos(0.5*PI*vars[j]);
		}

		if (i != 0) {
			objs[i] *= sin(0.5*PI*vars[nobjs-i-1]);
		}
	}
}

1.3. Enabling multiple seeds

The following lines enable to use random seeds as external arguments.  This links your main file to your submission file where multiple seeds are submitted to the cluster.  This segment is coupled with the submission file discussed in section 3.

int main(int argc, char* argv[]) {
unsigned int seed = atoi(argv[1]);

1.4. Variable declarations

In lines 1-2 of the following code, a couple of loop variables are declared, j and rank, which will be used later in section 1.7 and 1.9.  Also, the maximum number of function evaluations are declared. We will print out the runtime file and the file with the Pareto set; hence, the size of the output file names are also specified in lines 4-7.

int j;
int rank;
int NFE = 100000;
char outputFilename[256];
FILE* outputFile = NULL;
char runtime[256];
char timing[256];

1.5. Parallel borg parameters and core allocation

In the following segment, we specify some key parameters for the parallel Borg.  First, all multi-master runs need to call startup (in line 1).  The name of the variable argc stands for “argument count”; argc contains the number of arguments we pass to the program. The name of the variable argv stands for “argument vector”, which are the arguments that we pass to the program.  Then, the number of islands is specified in line 2.  If you want to use the master-slave configuration, you can simply assign one island.  The only difference with the pure master-slave code is that it uses uniform population samples, whereas the multi-master code uses latin hypercube population samples.  Line 3 specifies the maximum wallclock time estimated for your job.  Finally, global latin hypercube initialization is specified in line 5 to ensure each island gets a well sampled distribution of solutions.

BORG_Algorithm_ms_startup(&argc, &argv);
BORG_Algorithm_ms_islands(2);
BORG_Algorithm_ms_max_time(0.1);
BORG_Algorithm_ms_max_evaluations(NFE);
BORG_Algorithm_ms_initialization(INITIALIZATION_LATIN_GLOBAL);

Keep in mind the core allocation, if you have 32 available cores for a computation, when using a master-slave configuration, one core will be allocated to the master and the remaining  31 cores will be available for the function evaluations.  If you use a 2 multi-master configuration, one core will be allocated to the controller and two will be allocated to the masters (1 core per master), leaving 29 cores available for the function evaluations.  If you are using a small cluster, I wouldn’t recommend going higher than 4 masters.

1.6. Problem definition

The next segment creates the DTLZ2 problem.  In line 1, the number of decision variables, objectives and constraints are specified, the last argument, dtlz2, references the function that evaluates the DTLZ2 problem shown in section 1.2.  In lines 3-5 the lower and upper bounds for each decision variable are set to 0 and 1.  In lines 7-9 the epsilon values used by the Borg MOEA are specified to define the problem resolution for each objective to 0.01.

BORG_Problem problem = BORG_Problem_create(nvars, nobjs, 0, dtlz2);

for (j=0; j<nvars; j++) {
BORG_Problem_set_bounds(problem, j, 0.0, 1.0);
}

for (j=0; j<nobjs; j++) {
BORG_Problem_set_epsilon(problem, j, 0.01);
}

1.7. Printing runtime output

We specify the output frequency in line 1, this will print 100 generations, since we specified 100,000 maximum function evaluations, Borg will provide runtime output every 1000 NFE.  Lines 2 and 3 save the Pareto sets and the runtime dynamics to a file.  The %d gets replaced by the index of the seed and the %%d gets replaced by the index of the master, make sure that the sets and runtime folders exist in your working directory.

	BORG_Algorithm_output_frequency((int)NFE/100);
	sprintf(outputFilename, &quot;./sets/DTLZ2_S%d.set&quot;, seed);
	sprintf(runtime, &quot;./runtime/DTLZ2_S%d_M%%d.runtime&quot;, seed);
	BORG_Algorithm_output_runtime(runtime);

1.8. Parallelizing seeds

The MPI_Comm_rank routine gets the rank of this process. The rank is used to ensure each parallel process uses a different random seed.

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    BORG_Random_seed(37*seed*(rank+1));

1.9. Printing the Pareto set

The optimization is performed by the multi-master Borg MOEA on the problem in line 1.  Then, only the controller process will return a non-NULL result.  The controller aggregates all of the Pareto optimal solutions generated by each master. Then print the Pareto optimal solutions to a separate file and frees any allocated memory.  Lines 13-15 shutdown the parallel processes and exit the program.


	BORG_Archive result = BORG_Algorithm_ms_run(problem);

	if (result != NULL) {
		outputFile = fopen(outputFilename, &quot;w&quot;);
		if (!outputFile) {
			BORG_Debug(&quot;Unable to open final output file\n&quot;);
		}
		BORG_Archive_print(result, outputFile);
		BORG_Archive_destroy(result);
		fclose(outputFile);
	}

	BORG_Algorithm_ms_shutdown();
	BORG_Problem_destroy(problem);
	return EXIT_SUCCESS;
}

2. Makefile

The following makefile compiles the different versions of the Borg MOEA (serial, master-slave and multi-master slave) as well as the DTLZ2 examples for each version and generates the executables.  The CC compiler is required for the serial version while the mpicc compiler is required for the parallel versions.  From your terminal, access the mm-borg-moea directory and type make to compile the dtlz2 examples.

CC = gcc
MPICC = mpicc
CFLAGS = -O3
LDFLAGS = -Wl,-R,\.
LIBS = -lm
UNAME_S = $(shell uname -s)

ifneq (, $(findstring SunOS, $(UNAME_S)))
	LIBS += -lnsl -lsocket -lresolv
else ifneq (, $(findstring MINGW, $(UNAME_S)))
	# MinGW is not POSIX compliant
else
	POSIX = yes
endif

compile:
	$(CC) $(CFLAGS) $(LDFLAGS) -o dtlz2_serial.exe dtlz2_serial.c borg.c mt19937ar.c $(LIBS)

ifdef POSIX
	$(CC) $(CFLAGS) $(LDFLAGS) -o borg.exe frontend.c borg.c mt19937ar.c $(LIBS)
endif

	$(MPICC) $(CFLAGS) $(LDFLAGS) -o dtlz2_ms.exe dtlz2_ms.c borgms.c mt19937ar.c $(LIBS)
	$(MPICC) $(CFLAGS) $(LDFLAGS) -o dtlz2_mm.exe dtlz2_mm.c borgmm.c mt19937ar.c $(LIBS)
.PHONY: compile

 

3. Submission file

This is an example of a submission file.  Parallel Borg uses portable batch system (PBS) to manage the distribution of batch jobs across the available nodes in the cluster.  In the following script, the -N flag is the name of the job, the -l flag is the list of required resources, in this file we are requesting 1 node with 16 processors per node.  We are also requesting for 5 hours of wallclock time.  Index of the output and error files and the path of the output stream.  Line 6 specifies the path of the current working directory.  In lines 10 through 16 we run multiple seeds in a loop and call mpirun with  the name of the executable that was generated compiling the examples in section 2.  The following script is saved as a bash file, for instance: mpi-dtlz2.sh.

#!/bin/bash
#PBS -N dtlz2
#PBS -l nodes=1:ppn=16
#PBS -l walltime=5:00:00
#PBS -j oe
#PBS -o output

cd $PBS_O_WORKDIR

NSEEDS=9
SEEDS=$(seq 0 ${NSEEDS})

for SEED in ${SEEDS}
do
  mpirun ./dtlz2_mm.exe ${SEED}
done

4.  Submitting and managing jobs in a cluster

Once our submission file is ready, we can grant it permission using the following command:

chmod -x ./mpi-dtlz2.sh

Submit it to the cluster as such:

qsub mpi-dtlz2.sh

You can also delete a job using the qdel command and the job identifier:

qdel 24590

You can also hold a job:

qhold

Show status of the jobs:

qstat

Displays information about active, queued or recently completed jobs:

showq