Reed Group’s basic C++ code style conventions

It is always good practice for programmers to adopt some sort of style convention when developing new code. This helps keeping the code readable for both authors and collaborators, as well as for people that read your code on online repositories. Here I will set a precedent for a minimal C++ code style for Reed’s group encompassing C++ features we normally use based on the most common practices out there, so that we can more easily help each other with out codes and keep consistency when publishing them. This post may be updated if somebody sets precedents for C++ features didn’t account for (e.g. namespaces).

Naming conventions

  • Classes: Uppercase first letter. If the class name is comprised of more than one word, all words should be written together (no dashes, underscores, etc.) and the first letter of each word should be capitalized. E.g.: MyAwesomeClass.
  • Functions: Lowercase first letter. If the function name is comprised of more than one word, all words should be written together (no dashes, underscores, etc.) and the first letter of each word except the first should be capitalized. E.g.: myGreatFunction. If you are creating a getter or a setter, be sure to follow the this standard. E.g. the getter for variable “this_variable” would be “getThisVariable.”
  • Variables: All in small letters with words separated by “_” (underscore). E.g. this_variable.
  • Constants: All letters capitalized and words separated by underscores. E.g. MY_GREAT_CONSTANT.

Other naming rules

Besides naming conventions, there are other good practices when it comes to coming up with names in your code:

  • Do not assign one letter names, unless it is a temporary variable such as i, j, k used as indexes.
  • Assign informative names to your classes, functions, variables and constants. If you have a variable called “length,” another called “thisLength” and a third one called “realLength” your code will be really hard to follow.
  • Being concise is great (nobody reads code for its poetic variable names) but avoid shortening your names too much. Calling a variable “catchmentFallCreekIthaca” makes it much easier for someone else to know the information contained in that variable than calling it “catfacreith.”
  • We all get really frustrated with our codes at times, and want to curse it really bad. It’s fine to do it in your office when nobody is hearing, but be sure to not let that leak into your code and to keep some decency: e.g. avoid having in your code “this&%$*%DoesNot&$%*#@Work = true” or anything of the sort.

Other rules

  • Avoid magic numbers (hard coded numbers). Codes like the one below not only are hard to understand but also make the reader question if the results of the code are actually right:
    if (312 * evaporation + inflow / 52 - 7 * demand) {
        // Do something here
    }
    

    Now imagine if the value 212 is the value of an area and is used in 83 different parts of your code: that’s a problem. Instead, declaring those numbers as constants would be preferred:

    const double DRYVILLE_RESERVOIR_AREA = 312.0;
    const double NUMBER_OF_WEEKS_IN_YEAR = 52.14;
    const double NUMBER_OF_DAYS_IN_WEEK = 7.0;
    
    // Lots of code here, since constants are normally declared at the top of the code.
    
    if (DRYVILLE_RESERVOIR_AREA * evaporation + inflow / NUMBER_OF_WEEKS_IN_YEAR - NUMBER_OF_DAYS_IN_WEEK * demand) {
        // Do something here
    }
    
  • Keep your cpp files shorter than 500 lines. If you start approaching 500 lines, it may be the case that your class can be broken into parent and multiple children classes, or into two completely different classes.
  • Have only the main.cpp file in the root directory. All other files, if any, should be in directories so that the code is easy to navigate through.
  • If there is an issue or simplification to be fixed at some point in the future, use the “//FIXME:” comment to indicate it, as in the code below:
    //FIXME: replace constant area below by storage vs. area curve.
    if (DRYVILLE_RESERVOIR_AREA * evaporation + inflow / NUMBER_OF_WEEKS_IN_YEAR - NUMBER_OF_DAYS_IN_WEEK * demand) {
    // Do something here
    }
    

Note that different languages have different standards. If coding in Python or Matlab, for example, be sure to follow the best practices for these languages. Also, if developing code in collaboration with another research group, be sure to negotiate a convention.

Water Programming Blog Guide (Part 2)

Water Programming Blog Guide (Part 1)

This second part of the blog guide will cover the following topics:

  1. Version control using git
  2. Generating maps and working with spatial data in python
  3. Reviews on synthetic streamflow and synthetic weather generation
  4. Conceptual posts

1. Version Control using git

If you are developing code it’s worth the time to gain familiarity with git to maintain reliable and stable development.  Git allows a group of people to work together developing large projects minimizing the chaos when multiple people are editing the same files.   It is also valuable for individual projects as it allows you to have multiple versions of a project, show the changes that you have made over time and undo those changes if necessary.  For a quick introduction to git terminology and functionality, check out  Getting Started: Git and GitHub. The Intro to git Part 1: Local version control and  Intro to git Part 2: Remote Repositories  posts will guide you through your first git project (local or remote) while providing a set of useful commands.  Other specialized tips can be found in: Git branch in bash prompt and GitHub Pages. And if you are wondering how to use git with pycharm, you’ll find these couple of posts useful: A Guide to Using Git in PyCharm – Part 1A Guide to Using Git in PyCharm – Part 2

2. Generating maps and working with spatial data in python

To learn more about python’s capabilities on this subject,  this  lecture  summarizes key python libraries relevant for spatial analysis.  Also,  Julie and the Jons have documented their efforts when working with spatial data and with python’s basemap, leaving us with some valuable examples:

Working with raster data

Python – Extract raster data value at a point

Python – Clip raster data with a shapefile

Using arcpy to calculate area-weighted averages of gridded spatial data over political units (Part 1) , (Part 2)

Generating maps

Making Watershed Maps in Python

Plotting geographic data from geojson files using Python

Generating map animations

Python makes the world go ’round

Making Movies of Time-Evolving Global Maps with Python

3. Reviews on synthetic streamflow and weather generation

We are lucky to have thorough reviews on synthetic weather and synthetic streamflow generation written by our experts Julie and Jon L.  The series on synthetic weather generation consists of five parts. Part I and Part II cover parametric and non-parametric methods, respectively. Part III covers multi-site generation.  Part IV discusses how to modify both parametric and non-parametric methods to simulate weather with climate change projections and finally Part V covers how to simulate weather with seasonal climate forecasts:

Synthetic Weather Generation: Part I , Part II , Part III , Part IV , Part V

The synthetic streamflow review provides a historical perspective while answering key questions on “Why do we care about synthetic streamflow generation?  “Why do we use it in water resources planning and management? and “What are the different methods available?

Synthetic streamflow generation

4.  Conceptual posts

Multi-objective evolutionary algorithms (MOEAs)

We frequently use multi-objective evolutionary algorithms due to their power and flexibility to solve multi-objective problems in water resources applications, so you’ll find sufficient documentation in the blog on basic concepts, applications and performance metrics:

MOEAs: Basic Concepts and Reading

You have a problem integrated into your MOEA, now what?

On constraints within MOEAs

MOEA Performance Metrics

Many Objective Robust Decision Making (MORDM) and Problem framing

The next post discusses the MORDM framework which combines many objective evolutionary optimization, robust decision making, and interactive visual analytics to frame and solve many objective problems under uncertainty.  This is a valuable reading along with the references within.  The second post listed provides a systematic way of thinking about problem formulation and defines the key components of a many-objective problem:

Many Objective Robust Decision Making (MORDM): Concepts and Methods

“The Problem” is the Problem Formulation! Definitions and Getting Started

Econometric analysis and handling multi-variate data

To close this second part of the blog guide, I leave you with a couple selected topics  from the Econometrics and Multivariate statistics courses at Cornell documented by Dave Gold:

A visual introduction to data compression through Principle Component Analysis

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant Methods

 

Profiling C++ code with Callgrind

Often times, we have to write code to perform tasks whose complexity vary from mundane, such as retrieving and organizing data, to highly complex, such as simulations CFD simulations comprising the spine of a project. In either case, depending on the complexity of the task and amount of data to be processed, it may happen for the newborn code to leave us staring at an underscore marker blinking gracefully for hours on a command prompt during its execution until the results are ready, leading to project schedule delays and shortages of patience. Two standard and preferred approaches to the problem of time intensive codes are to simplify the algorithm and to make the code more efficient. In order to better select the parts of the code to work on, it is often useful to first find the parts of the code in which more time by profiling the code.

In this post, I will show how to use Callgrind, part of Valgrind, and KCachegrind to profile C/C++ codes on Linux — unfortunately, Valgrind is not available for Windows or Mac, although it can be ran on cluster from which results can be downloaded and visualized on Windows with QCachegrind. The first step is to install Valgrind and KCachegrind by typing the following commands in the terminal of a Debian based distribution, such as Ubuntu (equivalent yum commands area available for Red Hat based distributions):

$ sudo apt-get install valgrind
$ sudo apt-get install kcachegrind

Now that the required tools are installed, the next step is to compile your code with GCC/G++ (with a make file, cmake, IDE or by running the compiler directly from the terminal) and then run the following command in a terminal (type ctrl+shift+T to open the terminal):

$ valgrind --tool=callgrind path/to/your/compiled/program program_arguments

Callgrind will then run your program with some instrumentation added to its execution to measure time expenditures and cache use by each function in your code. Because of the instrumentation, Your code will take considerably longer to run under Callgrind than it typically would, so be sure to run a representative task that is as small as possible when profiling your code. During its execution, Callgrind will output a report similar to the one below on terminal itself:

==12345== Callgrind, a call-graph generating cache profiler
==12345== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==12345== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==12345== Command: path/to/your/compiled/program program_arguments
==12345==
==12345== For interactive control, run 'callgrind_control -h'.
IF YOUR CODE OUTPUTS TO THE TERMINAL, THE OUTPUT WILL BE SHOWN HERE.
==12345== 
==12345== Events    : Ir
==12345== Collected : 4171789731
==12345== 
==12345== I   refs:      4,171,789,731

The report above shows that it collected 4 billion events in order to generate the comprehensive report saved in the file callgrind.out.12345 — 12345 is here your process id, shown in the report above. Instead of submerging your soul into a sea of despair by trying to read the output file in a text editor, you should load the file into KCachegrind by typing:

$ kcachegrind calgrind.out.12345

You should now see a screen like the one below:

kcachegrind_initial.png

The screenshot above shows the profiling results for my code. The left panel shows the functions called by my code sorted by total time spent inside each function. Because functions call each other, callgrind shows two cost metrics as proxies for time spent in each function: Incl., showing the total cost of a function, and self, showing the time spent in each function itself discounting the callees. By clicking on “Self” to order to functions by the cost of the function itself, we sort the functions by the costs of their own codes, as shown below:

Untitled_sorted

Callgrind includes functions that are native to C/C++ in its analysis. If one of them appears in the highest positions of the left panel, it may be the case to try to use a different function or data structure that performs a similar task in a more efficient way. Most of the time, however, our functions are the ones in most of the top positions in the list. In the example above, we can see that a possible first step I can take to improve the time performance of my code is to make function “ContinuityModelROF::shiftStorage” more efficient. A few weeks ago, however, the function “ContinuityModel::continuityStep” was ranked first with over 30% of the cost, followed by a C++ map related function. I then replaced a map inside that function by a pointer vector, resulting in the drop of my function’s cost to less than 5% of the total cost of the code.

In case KCachegrind shows that a given function that is called from multiple places in the code is costly, you may want to know which function is the main culprit behind the costly calls. To do this, click on the function of interest (in this case, “_memcpy_sse2_unalight”) in the left panel, and then click on “Callers” in the right upper panel and on “Call Graph” in the lower right panel. This will show in list and graph forms the calls made to the function by other functions, and the asociated percent costs. Unfortunately, I have only the function “ContinuityModelROF::calculateROF” calling “_memcpy_sse2_unalight,” hence the simple graph, but the graph would be more complex if multiple functions made calls to “_memcpy_sse2_unalight.”

I hope this saves you at least the time spend reading this post!