Parasol Resources

In my last post, I introduced Parasol: an open source, interactive parallel coordinates library for multi-objective decision making. For additional background on Parasol, check out the library’s webpage to view interactive example applications, the GitHub repo for documentation, and the paper [Raseman et al. (2019)] for a more rigorous overview of Parasol.

parasol-tutorial

A demonstration of Parasol features like linked plots and data tables, highlighting individual polylines, and brushing and marking data.

In this post, I will describe how to start making your own interactive visualizations with Parasol.

Parasol Wiki

The best place to look for Parasol resources is the wiki page on the GitHub repo. Here you will find the API documentation, an introduction to web development for Parasol, and step-by-step tutorials taking you through the app development process. Because the wiki is likely to grow over time, I won’t provide links to these resources individually. Instead, I’ll give an overview of these resources here but I encourage you to visit the wiki for the most up-to-date tutorials and documentation.

API

The Application Programming Interface (API) is a list of all the features of the Parasol library. If you want to know everything that Parasol can do, the API is the perfect reference. If you are just getting started, however, this can be overwhelming. If that is the case, go through the tutorials and come back to the API page when you want to take your apps to the next level.

Web Development Basics

Creating Parasol apps requires some level of web development knowledge, but to build simple apps, you only need to understand the basics of HTML, JavaScript, and CSS. Check out the web development tutorial on the wiki for an overview of those concepts.

Tutorials

To start developing right away, go through the tutorials listed on the wiki page. These posts take the user through how to create a minimal Parasol app, they touch on the primary features of the API, and more advanced topics like how to use HTML buttons and sliders to create more polished apps.

Creating shareable apps with JSFiddle

jsfiddle-tutorial-3-completed

Using JSFiddle to create shareable Parasol apps (without having to host them on a website)

I recently discovered JSFiddle–a sandbox tool for learning web development. Not only has JSFiddle made web development more accessible and easier for me to learn, but I realized its a great way to share Parasol apps in an efficient, informal way. Check out the tutorial on the wiki for more details and the following links to my own Parasol “fiddles”:

Give it a try!

As I’ve said before, check out Parasol and give us your feedback. If you think this library is valuable, submit an Issue or Star the GitHub repo, or write a comment below. We are open to new ideas and features about how Parasol can better suit the needs of developers.

References

Raseman, William J., Joshuah Jacobson, and Joseph R. Kasprzyk. “Parasol: An Open Source, Interactive Parallel Coordinates Library for Multi-Objective Decision Making.” Environmental Modelling & Software 116 (June 1, 2019): 153–63. https://doi.org/10.1016/j.envsoft.2019.03.005.

Parasol: an open source, interactive parallel coordinates library for multi-objective decision making

For my entire graduate student career, I’ve gravitated toward parallel coordinates plots… in theory. These plots scale well for high-dimensional, multivariate datasets (Figure 1); and therefore, are ideal for visualizing multi-objective optimization solutions. However, I found it difficult to create parallel coordinates plots with the aesthetics and features I wanted.

basic

Figure 1. An example parallel coordinates plot that visualizes the “cars” dataset. Each polyline represents the attributes of a particular car and similar types of cars have the same color polyline.

I was amazed when I learned about Parcoords–a D3-based parallel coordinates library for creating interactive web visualizations. D3 (data-driven documents) is a popular JavaScript library which offers developers total control over their visualizations. Building upon D3, the Parcoords library is capable of creating beautiful, functional, and shareable parallel coordinates visualizations. Bernardo (@bernardoct), David (@davidfgold), and Jan Kwakkel saw the potential of Parcoords, each developing tools (tool #1 and tool #2) and additional features. Exploring these tools, I liked how they linked parallel coordinates plots with interactive data tables using the SlickGrid library. Being able to inspect individual solutions was so powerful. Plus, the features in Parcoords like interactive brushing, reorderable axes, easy-to-read axes labels, blew my mind. Working with large datasets was suddenly intuitive!

The potential of these visualizations was inspiring, but learning web development (i.e., JavaScript, CSS, and HTML) was daunting. Not only would I have to learn web development, I would need to learn how to use multiple libraries, and figure out how to link plots and tables together to make these types of tools. It just seemed like too much work for this sort of visualization to catch on… and that was the inspiration behind Parasol. Together with my advisor (Joe Kasprzyk, @jrk301) and Josh Jacobson, we built Parasol to streamline the development of linked, web-based parallel coordinates visualizations.

We’ve published a paper describing the Parasol library in Environmental Modelling & Software, so please refer to that paper for an in-depth discussion:

Raseman, William J., Joshuah Jacobson, and Joseph R. Kasprzyk. “Parasol: An Open Source, Interactive Parallel Coordinates Library for Multi-Objective Decision Making.” Environmental Modelling & Software 116 (June 1, 2019): 153–63. https://doi.org/10.1016/j.envsoft.2019.03.005.

In this post, I’ll provide a brief, informal overview of Parasol’s features and some example applications.

Cleaning up the clutter with Parasol

Parallel coordinates can visualize large, high-dimensional datasets, but at times they can become difficult to read when polylines overlap and obscure the underlying data. This issue is known as “overplotting” (see Figure 1 and 2a). That’s why we’ve implemented a suite of “clutter reduction techniques” in Parasol that enables the user to tidy up the data dynamically.

Probably the most powerful clutter reduction technique is brushing (Figure 2b). Using interactive filters, the user can filter out unwanted data and focus on a subset of interest. Users can also alter polyline transparency to reveal density in the data (Figure 2c), assign colors to polylines (Figure 2d), and apply “curve bundling” to group similar data together in space (Figure 2e). These clutter reduction techniques can also be used simultaneously to enhance the overall effect (Figure 2f).

figure-1

Figure 2. Vanilla parallel coordinates plots (a) suffer from “overplotting”, obscuring the underlying data. Interactive filters (b: brushes) and other clutter reduction techniques (c-f) alleviate these issues.

Another cool feature of both Parcoords and Parasol is dynamically reorderable axes (Figure 3). With static plots, the user can only look at pairwise relationships between variables plotted on adjacent axes. With reorderable axes, they are free to explore any pairwise comparison they choose!

reorderable

Figure 3. Users can click and drag axes to interactively reorder them.

API resources

The techniques described above (and many other features) are implemented using Parasol’s application programming interface (API). In the following examples (Figures 4-6), elements of the API are denoted using ps.XXX(). For a complete list of Parasol features, check out the API Reference on the Parasol GitHub repo.

Example applications

The applications (shown in Figures 4-6) demonstrate the library’s ability to create a range of custom visualization tools. To open the applications, click on the URL in the caption below each image and play around with them for yourself. Parallel coordinates plots are meant to be explored!

Example app #1 (Figure 4) illustrates the use of linked parallel coordinates plots. If the user applies a brush on one plot, the changes will be reflected on both linked plots. Furthermore, app developers can embed functionalities of the Parasol API using HTML buttons. For instance, if the user applies multiple brushes to the plots, they can reset all the brushes by clicking on the “Reset Brushes” button which invokes ps.resetSelections(). Other functionalities allow the user to remove the brushed data from a plot [ps.removeData()] or keep only the brushed data [ps.keepData()] and export brushed data [ps.exportData()].

Example app #2 (Figure 5) demonstrates utility of linking data tables to parallel coordinates plots. By hovering a mouse over a data table row, the corresponding polyline will become highlighted. This provides the user with the ability to fine-tune their search and inspect individual data points–a rare feature in parallel coordinates visualizations.

App #2 also demonstrates how k-means clustering [ps.cluster()] can enhance these visualizations by sorting similar data into groups or clusters. In this example, we denote the clusters using color. Using a slider, users can alter the number of clusters (k) in real-time. Using checkboxes, they can customize which variables are included in the clustering calculation.

Example app #3 also incorporates clustering but encodes the clusters using both color and “curve bundling”. With curve bundling, polylines in the same cluster are attracted to one another in the whitespace between the axes. Bundling is controlled by two parameters: 1) curve smoothness and 2) bundling strength. This app allows the user to play around with these parameters using interactive sliders.

Similar to highlighting, Parasol features “marking” to isolate individual data points. Highlighting is temporary, when the user’s mouse leaves the data table, the highlight will disappear. By clicking a checkbox on the data table, the user can “mark” data of interest for later. Although marks are more subtle than highlights, they provide a similar effect.

Lastly, this app demonstrates the weighted sum method [ps.weightedSum()]. Although we don’t generally recommend aggregating objectives in multi-objective optimization literature, there are times when assigning weights to different variables and calculating an aggregate score can be useful. In the above example, the user can input different combinations of weights with text input or using sliders.

We want to hear from you!

For the most up-to-date reference on Parasol, see the GitHub repo, and for additional examples, check out the Parasol webpage. We would love to hear from you, especially if you have suggestions for new features or bugs to report. To do so, post on the issues page for the repo.

If you make a Parasol app and want to share it with the world, post a link in the comments below. It would be great to see what people create!

Parasol tutorials

If you aren’t sure how to start developing Parasol applications, don’t worry. In subsequent posts, I’ll take you step-by-step through how to make some simple apps and give you the tools to move on to more complicated, custom applications.

Those tutorials are currently under development and can be found on the Parasol GitHub wiki.

Recommended Software for the Kasprzyk Group

In this post, I provide a list of recommended software for multi-objective optimization research and a bit of context about each item. This is an update of two posts (for Windows users and Mac-users, respectively) that Joe made several years ago and is intended for Windows users. Although this list is catered toward members of the Kasprzyk Group at the University of Colorado Boulder (CU), it should be relevant to most readers of this blog.

Please feel free to make comments and give additional suggestions!

Text Editors

Text editors are a great way to view data (e.g., csv or space-delimited data) or review some code. If you are looking to run code in an interactive manner, you’ll want to get an interactive development environment (IDE) which corresponds to the programming language you are working with. I’ll get into that more in the “Programming Languages” section. If you have a Windows machine, you are probably used to opening things with the default text editor, Notepad. Notepad is the worst.

Notepad++

Notepad++ is infinitely better! The formatting is great, it has a bunch of plug-ins you can download, it improves readability, and runs quickly. Notepad++ is my preferred text editor if you want to look at a couple files. If you are looking to manage a larger project with several files and multiple directories, Atom is the way to go.

Atom

Atom is extremely powerful and customizable. It is made for software developers in mind and is basically the modern version of Notepad++ (which has been around for a while). It integrates easily with Git and GitHub (see “Version Control and Open Source Repositories” for explanation of these tools), it has an extensive library of packages–similar to Notepad++ plug-ins–and its growing constantly. Atom also blurs the line between text editor and IDE, because with the Atom-IDE packages, it can have IDE-like functionality.

Programming Languages

You never know what languages you might work with in your research, but the main languages we use are Python, R, and C/C++. If Joe asks your preference, tell him that Python >> R and C/C++ >> R. Although if you catch him at a moment of weakness, he might admit that R can do some stats things, I guess.

Python

If you are working in Python (download), it will likely depend on the project whether you are working in 2.7 or 3.X (the X’s being whatever version they are on). A lot of scientific computing research still uses Python 2.7, but more people are transitioning to Python 3. In fact, they have decided to stop maintaining Python 2.7 at midnight on January 1, 2020. So remember to pour one out for your homie, Python 2.7, on New Years 2020. Although there are plenty of other python IDEs  (e.g., Spyder, Rodeo), we generally use PyCharm Community (download) in the group. For ease of installing packages, download Anaconda (download) which installs Python and over 150 scientific packages automatically.

R

If you are working it R (download), most everyone uses Rstudio (download Open Source Licence) as an IDE. You will also want to download Rtools, which is helpful to have installed for building some packages which require it. Generally, these packages require command line tools and compiling languages other than R.

C/C++

Although most projects are focused on Python and R, we have some work in C and C++ as well. These very interrelated languages, unlike Python and R, need to be compiled before you can run the code. To do this, you’ll need to download a compiler. We generally recommend installing MinGW (download). This will allow you to compile programs that will work across different platforms (e.g., Windows, Linux) while working on a Windows machine. Pro tip: if you have Rtools installed, it comes with MinGW so you don’t need to download it separately! If you want to compile things in for POSIX application deployment for Windows, you’ll want Cygwin. If you have no idea what that means, that’s okay. It’s my belief that you shouldn’t download Cygwin unless you need it because it takes up a lot of space on your system. But if you don’t care about storage than go ahead! For most cases I run into, MinGW does the job.

Why do we even care about compiling things across platforms? Well, you may be running code on a Windows machine, but if you are doing any work with the supercomputer, you will also need to run that code in a Linux environment. So you want to make sure your code will work on both platforms.

Although it isn’t perfect, CodeLite (download) is a nice (and free) IDE for C/C++ work.

Version Control and Online Repositories

Version control is the best way to your manage your code. It allows you to track your changes, make notes, and even revert to older versions of your code.

Git

Git (download) is the most popular way to implement version control but other methods do exist. Learning Git takes a bit of time, but it is an essential skill to learn which will pay dividends in the future. Most IDE (interactive development environments) interface with Git and even some text editors (I’m looking at you, Atom).

GitHub

GitHub is where people post their code and data (bundled into so-called repositories) to share with the world. It is amazing collaborative environment, and since we do computational research, it is best practices to create a GitHub repository for code related to papers that you publish. Soon it will be a requirement in most journals!

Both Git and GitHub have stellar documentation and tutorials about how to get started, so you will have lots of support when you start learning the ropes.

Command Line Interfaces, Supercomputing, and File Transfer

If you haven’t used a command line interface (CLI) before, you will definitely learn in this line of work. CLIs can be useful for interacting with Git (I prefer it for most version control tasks), installing open source software, and tasks like copying/moving files or creating/moving directories on your local machine. My apologies in advance if I butcher the language in this section related to Linux, shell, bash, etc!

Git Bash

For these tasks, you can use the Command Prompt which is a default program in Windows. But more likely, you’ll want to use a CLI which accepts Linux commands (that way you can use the same commands for your local CLI–which runs Windows–and for interacting with the supercomputer–which runs Linux). I like using Git Bash since it gets installed automatically when you install Git. You can also use Cygwin, but like I said before, this is generally overkill if you are just interested in writing Linux commands if you have Git Bash or something else already installed. I’m sure there are many other CLIs to choose from as well that I’m not aware of.

While its a nifty skill to know how to use the command line while working on your local computer, it is absolutely essential when working on the supercomputer (i.e., cloud or remote computing). Although I’ve seen some graphical interfaces and interactive environments for using cloud computing resources, it is far more common to perform tasks from the command line. You can even connect to the supercomputer with a CLI using the ‘ssh’ command.

MobaXterm

To make your life a bit easier though when connecting to remote computing resources, you can download MobaXterm or Putty and WinSCP. As David explains in his post, MobaXterm is probably the best way to go. It does the job of both Putty (connecting to remote computing) and WinSCP (moving files between your local computer and a remote resource).

If you’re doing any work on the supercomputer at CU, check out Research Computing for tutorials and other details about our supercomputers.

Cloud Storage

University of Colorado has unlimited Google Drive storage which is linked to your CU Gmail account; therefore, it is the cloud storage of choice for the group. Google Drive File Stream allows you to access files stored on your Drive on your local computer without having to download your whole drive. Meaning it won’t take up a ton of memory but it will ‘feel’ like the files have been downloaded onto your computer (as long as you are connected to the internet). If you know you won’t be connected to the internet, you can easily download certain files/folders or even your whole Drive if you would like.

The supercomputer can serve as cloud storage; however, it is best to keep those files backed up locally, if possible. I’ve heard too many horror stories about people storing important data on the supercomputer and it getting erased! Although this can be avoided by storing things in the right place, you might sleep better if you’ve got another copy.

Multi-objective Optimization and Visualization

Borg

You’ve probably heard of Borg Multi-objective Evolutionary Algorithm (MOEA). If not, you will soon! There’s no direct download link for Borg, but you can fill out a form on its website to request the source code.

DiscoveryDV

Once you’ve performed an optimization, you will want to visualize the results. You can do this in your favorite programming language, but it is often difficult to interact with the data that way. For an interactive visualization experience, we generally use DiscoveryDV. Just like Borg, DiscoveryDV is not available for download directly, but you can request it on their website.

Open Source Projects

Additionally, here are a few open source projects related to multi-objective optimization, robust decision making, and visualization that may be useful to be familiar with: Project Platypus, OpenMORDM, and Exploratory modeling workbench.

Reference Managers

Reference managers are amazing things. Finding the right one will save you a lot of time and effort in the future. As Jazmin mentions in her post on research workflows, there are a ton to choose from. Check out this comparison table of reference management software, if you want to go down that rabbit hole.

Zotero

Joe’s group uses one called Zotero (download both the standalone and Chrome connector) which is free, easy to use, and integrates well with Microsoft Word.

Graphic Design

There are many ways to create custom figures for papers–PowerPoint is an easy choice because you likely already have Microsoft Office on your computer. However, Adobe Illustrator is much more powerful. Since Illustrator requires a license, ask Joe for more details if you need the software.

Happy downloading!

Multivariate Distances: Mahalanobis vs. Euclidean

Some supervised and unsupervised learning algorithms, such as k-nearest neighbors and k-means clustering, depend on distance calculations. In this post, I will discuss why the Mahalanobis distance is almost always better to use than the Euclidean distance for the multivariate case. There is overlap between the ideas here and David’s post on multicollinearity, so give that one a read too!

Why you should care about multivariate distance

Synthetic time series generation methods are of interest to many of us in systems optimization and the topic been covered extensively on this blog. For an overview on that subject, check out Jon’s blog post on synthetic streamflow. It’s a great read and will give you many more resources to explore.

Among synthetic time series generation methods the k-nearest neighbors (k-NN) bootstrap resampling algorithm, developed by Lall and Sharma (1996), is a popular method for generating synthetic time series data of streamflow and weather variables. The k-NN algorithm resamples the observed data in a way that attempts to preserve the statistics of that data (e.g., mean and standard deviation at each timestep, lag-1 autocorrelation, etc.) but creates new and interesting synthetic records for the user to explore. As the name implies, this algorithm relies on finding k (generally set to be ⌊√(N)⌉, where N is the number of years of observed data) “nearest neighbors” to do its magic.

Determining the nearest neighbors

Finding those neighbors is straightforward in the univariate case (when there is only a single variable you want to simulate)—you just calculate the Euclidean distance. The shorter the distance, the “nearer” the neighbor. Well, it gets a bit more complicated in the multivariate case. There, you’ve got different units involved and correlation among variables which throws a wrench in the whole Euclidean distance thing. So, in most cases the Mahalanobis distance is preferred. Let me explain…

Example: how multivariate distance can help buy a car

Say we want to buy a four-wheel drive (4wd) car that will get us up into the mountains. We’ve got our eye set on a dream car, a 4wd Jeep, but we know we should shop around. So, let’s look at other 4wd cars on the market and compare their highway gas mileage and displacement (the total volume of all the cylinders in your engine) to find other cars we might be interested in. In other words, we are looking for the dream car’s nearest neighbors, with respect to those two measures.

fig1

Figure 1. Comparing highway gas mileage with displacement for our dream car and the others available.

Euclidean Distance

By glancing at the plot above, the distance calculation might appear trivial. In fact, you can probably roughly rank which points lie closest to the dream car just by eyeballing it. But when you try to do the calculation for Euclidean distance (equation 1), it will be skewed based on the units for gas mileage and displacement.

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^T(\overrightarrow{x}-\overrightarrow{y})}                                          (1)

Where: \overrightarrow{x} represents the attributes of our car and \overrightarrow{y} represents the attributes of another car.

For example, what if instead of miles per gallon, gas mileage was reported in feet per gallon? By changing those units, gas mileage would have multiple orders of magnitude more weight in the distance calculation than displacement. In that case, gas mileage would basically be the only thing that matters, which isn’t fair to poor old displacement. Therefore, when using the Euclidean distance to compare multiple variables we need to standardize the data which eliminates units and weights both measures equally. To do so, we can calculate the z-score (equation 2) for each observation:

z = \frac{x - \mu}{\sigma}                                      (2)

Where: z is the z-score (standardized variable), x is an observation, \mu and \sigma are the mean and standard deviation of the observation variable, respectively.

Visually, this is just like looking at our plot from before with no units at all.

fig2

Figure 2. Scale removed from Figure 1 to show that we need to remove the influence of units on the Euclidean distance calculation.

Now we can calculate the Euclidean distance and find the nearest neighbors!

euclidean-distance_cars

Figure 3. The Euclidean distance and rank assigned to each car, where rank 0 is our “dream car”. If we were interested in a k-nearest neighbor algorithm with k=4, the points in the orange box would be selected as the neighbors.

Take note of the k-nearest neighbors in the orange box. Let’s see whether or not we get the same neighbors with the Mahalanobis distance.

Mahalanobis Distance

The Mahalanobis distance calculation (equation 3) differs only slightly from Euclidean distance (equation 1).

d(\overrightarrow{x},\overrightarrow{y})=\sqrt{(\overrightarrow{x}-\overrightarrow{y})^TS^{-1}(\overrightarrow{x}-\overrightarrow{y})}                                          (3)

Where: \overrightarrow{x} represents the attributes of our car, \overrightarrow{y} represents the attributes of another car, and S^{-1} is the covariance matrix of \overrightarrow{x} and \overrightarrow{y}

Unlike the Euclidean distance though, the Mahalanobis distance accounts for how correlated the variables are to one another. For example, you might have noticed that gas mileage and displacement are highly correlated. Because of this, there is a lot of redundant information in that Euclidean distance calculation. By considering the covariance between the points in the distance calculation, we remove that redundancy.

mahalanobis-distance_cars

Figure 4. The Mahalanobis distance and rank assigned to each car, where rank 0 is our “dream car”.

And look! By comparing the ranks in the orange boxes in Figures 3 and 4, we can see that  although the ranks are similar between the two distance metrics, they do in fact yield different nearest neighbors. So which points get more weight when using the Mahalnobis distance vs. using the Euclidean distance?

To answer that question, I’ve standardized the distance calculations so we can compare them to one another and plotted each on a 1-to-1 line. If the distance metrics were exactly the same, all the points would end up on that line and they would each have a Mahalanobis to Euclidean ratio of 0. However, we see that certain points get more weight (i.e., a larger distance calculated) depending on the distance metric used.

mahalanobis-euclidean-distance-ratio

Figure 5. Mahalanobis to Euclidean distances plotted for each car in the dataset. The points are colored based on the Mahalnobis to Euclidean ratio, where zero means that the distance metrics have equal weight. Purple means the Mahalanobis distance has greater weight than Euclidean and orange means the opposite.

Let’s map the Mahalanonbis to Euclidean ratio onto our gas mileage v. displacement plot.

fig6

Figure 6. The gas mileage vs. displacement of the cars as color-coded by the Mahalanobis to Euclidean ratio from Figure 5.

Notice that many of the points at the top left and bottom right part of the screen are orange, meaning that the Euclidean distance calculation would give them more weight. And then there’s that point at the bottom center of plot. That one gets far more weight when using Mahalanobis distance. To understand this let’s look at the axes of greatest variability in the data, these are also known as principle components. For a primer on that subject, check out David’s post and Ronhini’s post on principle component analysis!

mahalanobis-euclidean_pca

When using Mahalanobis, the ellipse shown on the plot is squeezed towards circle. Along the first principle component axis, there is a lot of work to get it there! The points in the top right and bottom right corners move quite a bit to get towards a nice neat circle. Along the second principle component axis, there is not much squishing to do. The difference between these distance calculations are due to this “squishification” (a term used by the great 3blue1brown so it must be real). The Mahalnobis distance can be thought of calculating the Euclidean distance after performing this “squishification”. In fact, when the variables are completely uncorrelated, no squishing can happen, thus these two calculations are identical (i.e., S^{-1}=1).

Why you should use Mahalanobis distance (in general)

Which one should I use and when? When in doubt, Mahalanobis it out. When using the Mahalanobis distance, we don’t have to standardize the data like we did for the Euclidean distance. The covariance matrix calculation takes care of this. Also, it removes redundant information from correlated variables. Even if your variables aren’t very correlated it can’t hurt to use Mahalanobis distance, it will just be quite similar to the results you’ll get from Euclidean. You’ll notice that most recent k-NN resampling literature uses the Mahalanobis distance: Yates et al. (2003) and Sharif and Burn (2007).

One issue with the Mahalanobis distance is that it depends on taking the inverse of the covariance matrix. If this matrix is not invertible, no need to fear, you can calculate the pseudo-inverse instead to calculate the Mahalanobis distance (thanks to Philip Trettner for pointing that out!).

Code Availability

For anyone interested in the code used to create the figures in this post, I’ve created a GitHub gist.

References

Lall, Upmanu, and Ashish Sharma. “A Nearest Neighbor Bootstrap For Resampling Hydrologic Time Series.” Water Resources Research 32, no. 3 (March 1, 1996): 679–93. https://doi.org/10.1029/95WR02966.
Sharif, Mohammed, and Donald Burn. “Improved K -Nearest Neighbor Weather Generating Model.” Journal of Hydrologic Engineering 12, no. 1 (January 1, 2007): 42–51. https://doi.org/10.1061/(ASCE)1084-0699(2007)12:1(42).
Yates, David, Subhrendu Gangopadhyay, Balaji Rajagopalan, and Kenneth Strzepek. “A Technique for Generating Regional Climate Scenarios Using a Nearest-Neighbor Algorithm.” Water Resources Research 39, no. 7 (2003). https://doi.org/10.1029/2002WR001769.

Time Series Modeling: ARIMA Notation

A quick note!

If you are looking for more exhaustive resources on time series modeling, check out Forecasting: Principles and Practice and Penn State 510: Applied Time Series Analysis. These have time series theory plus examples of how to implement it in R. (for a more detailed description of these resources, see the ‘References’ section)

Motivation

Hydrological, meteorological, and ecological observations are often a special type of data: a time series. A time series consists of observations (say streamflow) at equally-spaced intervals over some period of time. Many of us on this blog are interested in running simulation-optimization models which receive time series data as an input. But the time series data from the historical record may be insufficient for our work, we also want to create synthetic time series data to explore a wider range of scenarios. To do so, we need to fit a time series model. If you are uncertain why we would want to generate synthetic data, check out Jon L.’s post “Synethic streamflow generation” for some background. If you are interested in some applications, read up on this 2-part post from Julie.

A common time series model is the autoregressive moving average (ARMA) model. This model has many variations including the autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA) models, and ARIMA models with external covariates (ARIMAX and SARIMAX). This class of models is useful but it has its own special notation which can be hard to unpack. Take the SARIMA model for example:

Confused yet? Me too. What are those functions? What does the B stand for? To help figure that out, I’m going to break down some time series notation into bite-sized pieces. In this post, I will unpack the ARMA model (eq. 2). If you are interested in understanding (eq. 1)  check out Penn State 510: Applied Time Series Analysis – Lessons 4: Seasonal Models.

Autoregressive (AR) and moving average (MA) models

An ARMA model is generalized form of two different models: the autoregressive (AR) and moving average (MA). Both the AR (eq. 3) and MA (eq. 4) models have a single parameter, p and q, respectively, which represent the order of the model.

The c and μ are constants, x’s are the time series observations, θ’s and Φ’s are weighting parameters for the different lagged terms, and ε represents a random error term (i.e. it has a normal distribution with mean zero). You can see already how these equations might get a bit tedious to write out. Using what is known as a backshift operator and defining specific polynomials for each model, we can use less ink to get the same point across.

Backshift operator

The backshift (also known as the lag) operator, B, is used to designate different lags on a particular time series observation. By applying the backshift operator to the observation at the current timestep, xt, it yields the one from the previous timestep xt-1 (also known as lag 1).

It doesn’t save much ink in this simple example, but with more model terms the backshift operator comes in handy. Using this operator, we can represent any lagged term by raising B to the power of the desired lag. Let’s say we want to represent the lag 2 of xt.

Or possibly the lag 12 term.

Example 1: AR(2) – order two autoregressive model

Let’s apply the backshift operator to the AR(2) model as an example. First, let’s specify the model in our familiar notation.

Now, let’s apply the backshift operator.

Notice that xt. shows up a few times in this equation, so let’s rearrange the model and simplify.

Once we’ve gotten to this point, we can define a backshift polynomial to further distill this equation down. For order two autoregressive models, this polynomial is defined as

Combine this with the above equation to get the final form of the AR(2) equation.

Example 2: MA(1) – order one moving average model

Starting to get the hand of it? Now we’re going to apply the same approach to a MA(1) model.

Now let’s apply the backshift operator.

Rearrange and simplify by grouping εt terms together.

Define a backshift polynomial to substitute for the terms in the parentheses.

Substitute polynomial to reach the compact notation.

Autoregressive moving average (ARMA) models

Now that we’ve had some practice with the AR and MA models, we can move onto ARMA models. As the name implies, the ARMA model is simply a hybrid between the AR and MA models. As a shorthand, AR(p) is equivalent to ARMA(p,0) and MA(q) is the same as ARMA(0,q). The full ARMA(p,q) model is as follows:

Example 3: ARMA(2,2)

For the grand finale, let’s take the ARMA model from it’s familiar (but really long) form and put in it more compact notation. As an example we’ll look at the ARMA(1,2) model.

First, apply the backshift operator.

Rearrange and simplify by grouping the terms from the current timestep, t. (If you are confused by this step check out “Clarifying Notes #2”)

Substitute the polynomials defined for AR and MA to reach the compact notation.

And that’s it! Hopefully that clears up ARMA model notation for you.

Clarifying Notes

  1. There are many different conventions for the symbols used in these equations. For example, the backshift operator (B) is also known as the lag operator (L). Furthermore, sometimes the constants used in AR, MA, and ARMA models are omitted with the assumption that they are centered around 0. I’ve decided to use the form which corresponds to agreement between a few sources with which I’m familiar and is consistent with their Wikipedia pages.
  2. What does it mean for a backshift operator to be applied to a constant? For example, like for μ in equation 2. Based on my understanding, a backshift operator has no effect on constants: Bμ = μ. This makes sense because a backshift operator is time-dependent but a constant is not. I don’t know why some of these equations have constants multiplied by the backshift operator but it appears to be the convention. It seems to be more confusing to me at least.
  3. One question you may be asking is “why don’t we just use summation terms to shorten these equations?” For example, why don’t we represent the AR(p) model like this?

We can definitely represent these equations with a summation, and for simple models (like the ones we’ve discussed) that might make more sense. However, as these models get more complicated, the backshift operators and polynomials will make things more efficient.

References

Applied Time Series Anlysis, The Pennsylvania State University: https://onlinecourses.science.psu.edu/stat510/
Note: This is a nice resource for anyone looking for a more extensive resource on time series analysis. This blogpost was inspired largely by my own attempt to understand Lessons 1 – 4 and apply it to my own research.
Chatfield, Chris. The Analysis of Time Series: An Introduction. CRC press, 2016.
Hyndman, Rob J., and George Athanasopoulos. Forecasting: Principles and Practice. Accessed October 27, 2017. http://Otexts.org/fpp2/.
Note: This is an AWESOME resource for everything time series. It is a bit more modern than the Penn State course and is nice because it is based around the R package ‘forecast’ and has a companion package ‘fpp2’ for access to data. Since it is written by the author of ‘forecast’ (who has a nice blog and is a consistent contributor to Cross Validated and Stack Overflow), it is consistent in its approach throughout the book which is a nice bonus.

Wikipedia: https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model

Debugging: Interactive techniques and print statements

Trying to fix a bug in a program can be a long and tedious process. Learning what techniques to use can save you from headaches and wasted time. Starting out, I figured that a single best tool for debugging must exist. But which was it? Were print statements the answer? Or was it interactive debugging (like that described in “Debugging in Python (using PyCharm) Parts 1, 2, and 3)?

Speaking to different people and reading forums, I could not find a consensus. Some people would refer to print statements as “a lost art”, while others criticize print statements as “poor substitutes [for interactive debugging] and, in fact, at times dangerous tools.” I’ve read many accounts of experienced programmers who swear by print statements but feel embarrassed to admit it. As if it were taboo to use such a simple technique even if it was effective at solving their problem.

There are strong opinions on either side, but based on my experiences, I believe the answer lies somewhere in the middle. Interactive debugging and print statements are two effective techniques which each have a time and a place. I think this post summed up my opinion on the matter well:

“Print statements and debugger are not mutually exclusive. They are just different tools available to you in order to locate/identify bugs. There are those who will claim how they never touch a debugger and there are those who do not have a single logging/print statement anywhere in the code that they write. My advice is that you do not want to be in either one of those groups.” 

Below, I’ve compiled some opinions which highlight the benefits and drawbacks of each technique. Hopefully, these can serve as a guide for your future debugging needs!

Interactive Debugging

  • Less recompiling: with interactive debugging you can change variable values while the program is running. This gives you the freedom to test new scenarios without the need to recompile.
  • Get custom notifications: set up watch variables which notify you when that variable changes
  • Control: step into functions of interest or skip over functions that are not important for debugging. You can also set conditional breakpoints which are only activated when a certain value is triggered.
  • Use an IDE or the command line: debugging can be performed in an IDE (interactive development environment) or from the command line. IDEs are generally preferred, but there are instances—such as when using a command line interface to access a supercomputer—when they cannot be used. In these circumstances, interactive debugging can still be performed from the command line with tools such as GDB. Furthermore, most of these tools can be run with a text user interface (e.g. $gdb –tui).
  • Travel back in time: view the call stack at any time and inspect values up the stack trace. This can be an especially helpful feature when determining why a program has crashed. For example, see “What to Do After a Crash” in this GDB post.
  • Scales well with large projects: Although I don’t have much experience in this area, this post claims that interactive debugging is better suited for large projects or those with parallel code.
  • No clean-up: unlike print statements which often must be cleaned up (deleted or commented out) after debugging is done, there is no trace left behind from interactive debugging.

Print Statements

  • Reproducibility: leaving print statements in your code can help you reproduce past debugging sessions or help collaborators that may need to debug the program in the future. And instead of commenting out or deleting these print statements when not in use, they can be placed within an if-statement with a debugging flag that can be turned on and off (e.g. if (debug == TRUE) print <value>).
  • Consistency between environments: interactive debugging can be done via the command line, but the experience does not generally compare to the same task in an IDE. Print statements, however, give the user a consistent experience across environments.
  • Permanent record-keeping: when an interactive debugging session is over, there is no record of the information that the user came across. On the other hand, print statements can be used to create an extensive diagnostic report of the program. The user can then analyze this report to gain more insight about the program at any time in the future.
  • Easy to use: print statement are simple to use and understand. Although interactive debugging has nice bells and whistles, there is a learning curve for new users.

Thanks for reading and please feel free to edit and add your own thoughts!

Sources:

 

 

Compiling Code using Makefiles

For people new to coding or using supercomputers to submit jobs, compiling can be a new concept. In programming, a compiler takes source code (e.g. written in C/C++, Python, Fortran, etc.) and translates it into a lower-level programming language (e.g. assembly language or machine code). When a compiler runs successfully, the source code will be converted to an executable program which the computer understands how to run.   For more info on compilers, check out this video

To compile a program, we use the ‘make’ command. When we have multiple source files (which we often do when running complex water management models), a makefile file helps organize directions to give to the compiler. If you are not creating a model from scratch, you may already have an existing makefile which are conventionally named ‘makefile’ or ‘Makefile’. If that is the case, compiling is easy if all the files are in the proper directory. Simply type ‘make’ in the command line interface to compile your code.

If you would like to edit your makefile, create one from scratch, or just want to learn more about the ‘make’ command and makefiles, check out the resources below:

Introduction to ‘make’ and compiler options:

Introductory tutorials for makefiles:

Makefile naming:

Makefile macros:

Compiler flags:

For a list of compiler flags, see the manual or ‘man’ page for the compiler: e.g. $man <compiler_name>. The convention for naming makefile macros for compiler flags for gcc and g++ are CFLAGS and CPPFLAGS, respectively.

Example Makefile (C or C++):

The conventional file organization for this work is to create a src (or source) and bin (or binary) directory. The source code will go in /src while the makefile and any input files will go in /bin. Once the executable is created, it will be located in /bin as well. Below is a truncated version of a makefile I made for a water treatment plant model (written in C) based on a makefile I found for LRGV (written in C++). In general, there is little difference between a makefile for a program written in C or C++, so this template file can be used for either language by simply commenting or uncommenting the makefile macros for the language for which you want to compile (e.g. CC, CXX, CFLAGS, CPPFLAGS). One special difference between compiling C and C++ is that a common library, math.h, is automatically linked for C++ but must be explicitly linked for C. You’ll notice that I’ve explicitly linked it in the code below, so as long as the C code I’ve written is also valid C++, I should be able to use either compiler since C is a subset of C++. Enjoy!

MAIN_DIR=./.. #from within /bin go to main directory which contains /bin and /src directories

SOURCE_DIR = $(MAIN_DIR)/src #navigate to directory which contains source code

SOURCES = \
$(SOURCE_DIR)/basin.c    \
   $(SOURCE_DIR)/breakpt.c  \
   #I’ll leave out some files for the sake of space
   $(SOURCE_DIR)/unittype.c \
   $(SOURCE_DIR)/uptable.c  \
   $(SOURCE_DIR)/writewtp.c \

OBJECTS=$(SOURCES:.c=.o) #name object files based on .c files
CC=gcc #select the type of compiler: this once in for C
#CXX=g++
CFLAGS=-c -O3 -Wall -I. -I$(SOURCE_DIR) #set flags for gcc
#CPPFLAGS=-c -O3 -Wall -I. -I$(SOURCE_DIR) #set flags for g++
LIBS=-lm  # link libraries (e.g. math.h with -lm: goo.gl/Bgq75V)
EXECUTABLE=wtp_v2-2_borg-mp #name of the executable file

all: $(SOURCES) $(EXECUTABLE)
    rm -rf $(SOURCE_DIR)/*.o

$(EXECUTABLE): $(OBJECTS)
    $(CC) $(OBJECTS) -o $@ $(LIBS)

.c.o:
    $(CC) $(CFLAGS) $^ -o $@

clean: #’make clean’ will remove all compilation files
    rm -f $(SOURCE_DIR)/*.o $(EXECUTABLE)

 

 

 

Debug in Real-time on SLURM

Debugging a code by submitting jobs to a supercomputer is an inefficient process. It goes something like this:

  1. Submit job and wait in queue
  2. Check for errors/change code
  3. (repeat endlessly until your code works)

Debugging in Real-Time:

There’s a better way to debug that doesn’t require waiting for the queue every time you want to check your code. On SLURM, you can debug in real-time like so:
  1. Request a debugging or interactive node and wait in queue
  2. Check for errors/change code continuously until code is fixed or node has timed out

Example (using Summit supercomputer at University of Colorado Boulder):

  1. Log into terminal (PuTTY, Cygwin, etc.)
  2. Navigate to directory where the file to be debugged is located using ‘cd’ command
  3. Load SLURM
    • $module load slurm
  4. Enter the ‘sinteractive’ command
    • $sinteractive
  5. Wait in line for permission to use the node (you will have a high priority with a debugging QOS so it shouldn’t take long)
  6. Once you are granted permission, the node is yours! Now you can debug to your hearts content (or until you run out of time).
I’m usually debugging shell scripts on Unix. If you want advice on that topic check out this link. I prefer the ‘-x’ command (shown below) but there are many options available.
Debugging shell scripts in Unix using ‘-x’ command: 
 $bash -x mybashscript.bash
Hopefully this was helpful! Please feel free to edit/comment/improve as you see fit.