Google Earth Pro: the cool, the trade-offs and the limits

This post is an account of my recent experience with Google Earth Pro (GPE), a free online tool meant to make visualization of geographical information intuitive and accessible. This account contains tidbits that I thought others might find useful, but it is not meant to be an unbiased or comprehensive resource.

My goal in using GPE was to set up a short video tour of the reservoirs on the main stem of the Upper Snake River, which has its upper reaches around the Teton Range (WY, USA) before going through much of southern Idaho. Here is the video, produced with instructions from this post from the blog:

 

The cool

GPE is indeed an intuitive way for people that have only a limited experience of geographic information systems (GIS) to put together nice-looking videos in a limited amount of time. It relies on the increasingly large amounts of geographic data available online. For instance, polygons made of points with geographical coordinates, used to delineate political boundaries or catchments among others, can be found increasingly easily, be it under formats specific to GPE (KML, KMZ), or traditional GIS format such as SHP (shapefile format, which can be imported to GPE with… “File => Import”, I told you it was intuitive). This capability to find the right data could be improved by the recent launch of a dataset search engine by Google.

This video, after several tests, superposed three layers on the satellite images of the land surface:

  1. Pins of the dams’ locations. Please refer to this post to learn all you need to know about setting and customizing such new placemarks.
  2. The network of all streams with an average flow of more than 10 m3/s in the Columbia River basin. This is mainly for easy visualization of where the Snake River and its affluents are. That data was obtained from this useful USGS resource, with all the major basins in the US having data in GPE-ready KML format (basin boundaries, tributaries with their average flows, landuse, etc.).
  3. An invisible layer contains the shapes of the reservoirs’ lakes, in order to zoom in on the reservoir and not the dam itself. I got the shapes of waterbodies in the Upper Snake River basin from an amazing resource for water resources practitioners (see “The limits” below to see how to access this resource, but also to understand why it must be handled with care). Hint: since the list of waterbodies is dauntingly large and includes every little puddle in the area, I had to zoom in to the desired reservoir so as to only select its shape, by doing “File => Import”, then when prompted, by choosing to only import the features in my line of sight.

The trade-offs

The trade-offs when doing this kind of short intro video to your study area, are between video quality and required memory. This is especially true if the video is of a rather large area, as the GPE satellite images embed a level of detail that is amazing at the scale of that basin. Therefore, when creating my video, I had to resort to several tries before getting  a correct quality, and this video eats up 500Mo of disk space for 43 seconds (!!!!). Any video with lower resolution just looked downright disgusting, whereas any larger video may have serious problem running on your laptop. Think of it as a Goldilocks zone that you have to find if you want to have a video you can embed in an oral presentation.

(NB: to embed this video in a presentation in a fullproof kind of way, the easiest is to embed a link from the PPT file to the video in the same folder).

The limits

The limits of Google Earth are with its plain refusal to display features it deems too big. To understand this, let us look at this USGS webpage where a sizable of hydrological information can be downloaded. In particular it is possible to download data exactly for the upper Snake area (Subregion 1704 on the picture below). This data can include waterbodies as discussed above (in “the cool”) and that is contained in the NHD data ticket in the picture. By ticking “Watershed Boundary Dataset (WBD)”, one can also download basin and subbasin boundaries.

WBD_data

Why then did I not represent Upper Snake boundaries in my video? Well, the Upper Snake polygon, under SHP format at least, is too big so GPE just… refused to display it. I tried to represent the HU-6 and HU-8 subbasins (smaller than the HU-4-sized Upper Snake under this classification system, and:

  • The HU-6 subbasin just divides the basin into “headwaters” (smaller part upstream of Palisades) and “the rest”; the former feature was smaller and GPE plotted it, but it did not plot the latter.
  • The HU-8 subbasins all are individually smaller features that GPE accepts to plot. So I could have plotted the watershed boundaries by plotting ALL of the HU-8 subbasins, but spoiler alert: this looked horrendous.

Takeaway: GPE only displays features that individually have limited size. So maybe using a more powerful and recent desktop computer than the one I had would have done the trick… but be aware that there will always be a limit. Also note that assuming that GPE is just taking its time loading the data, and that staring blankly at the screen in the meantime is not going to be too painful, is not a good idea. Take a nap instead, or do something else, then admit that the damn thing just did not display.

A solution for this, of course, is to load the larger features on GIS software and making them lighter by eliminating points in the polygon without altering the shape… but that kinda beats the purpose of GPE, which is to avoid having to become a GIS geek, doesn’t it?

Evaluating and visualizing sampling quality

Evaluating and visualizing sampling quality

State sampling is a necessary step for any computational experiment, and the way sampling is carried out will influence the experiment’s results. This is the case for instance, for sensitivity analysis (i.e., the analysis of model output sensitivity to values of the input variables). The popular method of Sobol’ (Sobol’, 2001) relies on tailor-made sampling techniques that have been perfected through time (e.g., Joe and Kuo, 2008; Saltelli et al., 2010). Likewise, the method of Morris (Morris, 1991), less computationally demanding than Sobol’s (Herman et al., 2013) and used for screening (i.e., understanding which are the inputs that most influence outputs), relies on specific sampling techniques (Morris, 1991; Campolongo et al., 2007).

But what makes a good sample, and how can we understand the strengths and weaknesses of the sampling techniques (and also of the associated sensitivity techniques we are using) through quick visualization of some associated metrics?

This post aims to answer this question. It will first look at what makes a good sample using some examples from a sampling technique called latin hypercube sampling. Then it will show some handy visualization tools for quickly testing and visualizing a sample.

What makes a good sample?

Intuitively, the first criterion for a good sample is how well it covers the space from which to sample. The difficulty though, is how we define “how well” it practice, and the implications that has.

Let us take an example. A quick and popular way to generate a sample that covers the space fairly well is latin hypercube sampling (LHS; McKay et al., 1979). This algorithm relies on the following steps for drawing N samples from a hypercube-shaped of dimension p.:

1) Divide each dimension of the space in N equiprobabilistic bins. If we want uniform sampling, each bin will have the same length. Number bins from 1 to N each dimension.

2) Randomly draw points such that you have exactly one in each bin in each dimension.

For instance, for 6 points in 2 dimensions, this is a possible sample (points are selected randomly in each square labelled A to F):

 

lhs1

It is easy to see that by definition, LHS has a good space coverage when projected on each individual axis. But space coverage in multiple dimensions all depends on the luck of the draw. Indeed, this is also a perfectly valid LHS configuration:

lhs2

In the above configuration, it is easy to see that on top of poor space coverage, correlation between the sampled values along both axes is also a huge issue. For instance, if output values are hugely dependent on values of input 1, there will be large variations of the output values as values of input 2 change, regardless of the real impact of input 2 on the output.

Therefore, there are two kinds of issues to look at. One is correlation between sampled values of the input variables. We’ll look at it first because it is pretty straightforward. Then we’ll look at space coverage metrics, which are more numerous, do not look exactly at the same things, and can be sometimes conflicting. In fact, it is illuminating to see that sample quality metrics sometimes trade-off with one another, and several authors have turned to multi-objective optimization to come up with Pareto-optimal sample designs (e.g., Cioppa and Lucas, 2007; De Rainville et al., 2012).

One can look at authors such as Sheikholeslami and Razavi (2017) who summarize similar sets of variables. The goal there is not to write a summary of summaries but rather to give a sense that there is a relationship between which indicators of sampling quality matter, which sampling strategy to use, and what we want to do.

In what follows we note x_{k,i} the kth  sampled value of input variable i, with 1\leq k \leq N and 1\leq i \leq p.

Correlation

Sample correlation is usually measured through the Pearson statistic. For inputs variables i and j among the p input variables, we note x_{k,i} and x_{k,j} the values of these variables i and j in sample k (1\leq k \leq N) have:

\rho_{ij} = \frac{\sum_{k=1}^N (x_{k,i}-\bar{x}_i)(x_{k,j}-\bar{x}_j)}{\sqrt{\sum_{k=1}^N (x_{k,i}-\bar{x}_i)^2 \sum_{k=1}^N (x_{k,j}-\bar{x}_j)^2}}

In the above equation, {\bar{x}_i}   and {\bar{x}_j} are the average sampled values of inputs i and j . 

Then, the indicator of sample quality looks at the maximal level of correlation across all variables:

\rho_{\max} = \max_{1\leq i \leq j \leq N} |\rho_{ij}|

This definition relies on the remark that \rho_{ij} = \rho_{ji}.

Space Coverage

There are different measures of space coverage.

We are best equipped to visualize space coverage via 1D or 2D projections of a sample. In 1D a measure of space coverage is by dividing each dimension in N equiprobable bins, and count the fraction of bins that have at least a point. Since N is the sample size, this measure is maximized when there is exactly one point in each bin — it is a measure that LHS maximizes.

Other measures of space coverage consider all dimension at once. A straightforward measure of space filling is the minimum Euclidean distance between two sampled points X in the generated ensemble:

D = \min_{1\leq k \leq m \leq N} \left\{ d(\textbf{X}^k, \textbf{X}^m) \right\}

Other indicators measure discrepancy which is a concept closely related to space coverage. In simple terms, a low discrepancy means that when we look at a subset of a sampled input space, its volume is roughly proportional to the number of points that are in it. In other words, there is no large subset with relatively few sampled points, and there is no small subset with a relatively large density of sampled points. A low discrepancy is desirable and in fact, Sobol’ sequences that form the basis of the Sobol’ sensitivity analysis method, are meant to minimize discrepancy.

 

Sample visualization

The figures that follow can be easily reproduced by cloning a little repository SampleVis I put together, and by entering on the command line python readme.py &> output.txt. That Python routine can be used with both latin hypercube and Sobol’ sampling (using the SAlib sampling tool; SAlib is a Python library developed primarily by Jon Herman and Will Usher, and which is extensively discussed in this blog.)

In what follows I give examples using a random draw of latin hypercube sampling with 100 members and 7 sampled variables.

Correlation

No luck, there is statistically significant pairwise correlation between in three pairs of variables: x1 and x4, x4 and x6, and x5 and x6. Using LHS, it can take some time to be lucky enough until the drawn sample is correlation-free (alternatively, methods to minimize correlations have been extensively researched over the years, though no “silver bullet” really emerges).

Pearson_significance

This means any inference that works for both variables in any of these pairs may be suspect. The SampleVis toolbox contains also tools to plot whether these correlations are positive or negative.

Space coverage

The toolbox enables to plot several indicators of space coverage, assuming that the sampled space is the unit hypercube of dimension p (p=7 in this example). It computes discrepancy and minimal distance indicators. Ironically, my random LHS with 7 variables and 100 members has a better discrepancy (here I use an indicator called L2-star discrepancy) than a Sobol’ sequence with as many variables and members. The minimal Euclidean distance as well is better than for Sobol’ (0.330 vs. 0.348). This means that if for our experiment, space coverage is more important than correlation, the drawn LHS is pretty good.

To better grasp how well points cover the whole space, it is interesting to plot the distance of the point that is closest to each point, and to represent that in growing order:Distances

This means that some points are not evenly spaced, and some are more isolated than others. When dealing with a limited number of variables, it can also be interesting to visualize 2D projections of the sample, like this one:

2D-projections

This again goes to show that the sample is pretty-well distributed in space. We can compare with the same diagram for a Sobol’ sampling with 100 members and 7 variables:

2D-projections

It is pretty clear that the deterministic nature of Sobol’ sampling, for so few points, leaves more systematic holes in the sampled space. Of course, this sample is too small for any serious Sobol’ sensitivity analysis, and holes are plugged by a larger sample. But again, this comparison is a visual heuristic that tells a similar story as the global coverage indicator: this LHS draw is pretty good when it comes to coverage.

 

References

Campolongo, F., Cariboni, J. & Saltelli, A. (2007). An effective screening design for sensitivity analysis of large models. Environmental Modelling & Software, 22, 1509 – 1518.

Cioppa, T. M. & Lucas, T. W. (2007). Efficient Nearly Orthogonal and Space-Filling Latin Hypercubes. Technometrics, 49, 45-55.

De Rainville, F.-M., Gagné, C., Teytaud, O. & Laurendeau, D. (2012). Evolutionary Optimization of Low-discrepancy Sequences. ACM Trans. Model. Comput. Simul., ACM, 22, 9:1-9:25.

Herman, J. D., Kollat, J. B., Reed, P. M. & Wagener, T. (2013). Technical Note: Method of Morris effectively reduces the computational demands of global sensitivity analysis for distributed watershed models. Hydrology and Earth System Sciences, 17, 2893-2903.

Joe, S. & Kuo, F. (2008). Constructing Sobol Sequences with Better Two-Dimensional Projections. SIAM Journal on Scientific Computing, 30, 2635-2654.

McKay, M.D., Beckman R.J. & Conover, W.J. (1979).A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2), 239-245.

Morris, M. D. (1991). Factorial Sampling Plans for Preliminary Computational Experiments. Technometrics, 33, 161-174.

Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M. & Tarantola, S. (2010). Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Computer Physics Communications, 181, 259 – 270.

Sheikholeslami, R. & Razavi, S. (2017). Progressive Latin Hypercube Sampling: An efficient approach for robust sampling-based analysis of environmental models. Environmental Modelling & Software, 93, 109 – 126.

Sobol’, I. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271 – 280.

 

Using HDF5/zlib compression in NetCDF4, part 2: testing the compression settings

There has been a previous post, courtesy of Greg Garner, on why HDF5/zlib compression matters for NetCDF4. That post featured a plot that showed how much you could compress your data when increasing the compression level. But the fine print also acknowledged that this data was for a pretty idealized dataset. So how much should you compress your data in a real-world application? How can you test what your trade-off really is between compression and computing time?

Follow this 4-step process to find out!

I’ll be illustrating this post using my own experience with the Water Balance Model (WBM), a model developed at the University of New Hampshire and that has served for several high-profile papers over the years (including Nature and Science). This is the first time that this model, written in Perl, is being ported to another research group, with the goal of exploring its behavior when running large ensembles of inputs (which I am starting to do! Exciting, but a story for another post).

Step 1. Read the manual

There is a lot of different software for creating NetCDF data. Depending on the situation, you may have a say on which to use, or be already using the tool that comes with the software suite you are working with. Of course, in the latter case, you can always change the tools. But reasonable a first step before that is to test them. Ergo, look up the documentation for the software you are using, to see how you can control compression on them.

Here, WBM uses the PDL::NetCDF Perl library, which has useful functions for adding data to a NetCDF file after every time step the model runs. Contrary to Greg’s post that uses C and where there are two flags (“shuffle” and “deflate”) and a compression level parameter (“deflate_level”), for PDL::NetCDF there are only two parameters. The SHUFFLE flag is the equivalent in Perl of the “shuffle” flag in C. The DEFLATE Perl parameter ihas integer values from 0 to 9, with a value 0 being equivalent to the C-flag “deflate” being turned off, and any value from 1 to 9 being equivalent to the “deflate”C-flag being on, the value of DEFLATE being then equivalent to the value of the “deflate_level” parameter in Greg’s post. Therefore, the DEFLATE variable from the PDL::NetCDF library in Perl lumps together the parameters “deflate” and “deflate_level” used in C.

I then located the DEFLATE and SHUFFLE variables within the auxiliary functions of the WBM. In the function write_nc, the following two lines of codes are key:

 my $deflate = set_default($$options{DEFLATE},1); # NetCDF4 deflate (compression) parameter</pre>
my $shuffle = set_default($$options{SHUFFLE},0); # NetCDF4 shuffle parameter 

Step 2. Set up a test protocol

This builds on Greg’s idea of recording time and resulting file size for all compression level. Here we are interested in these quantities for full-scale model runs, and not just for the generation of a single NetCDF dataset.

In this case therefore, we want to contrast the default setting above with stronger compression settings, for ensemble runs of WBM on the Cube (the local HPC cluster). For a better comparison, let us place ourselves in the conditions in which ensemble runs will be made. Runs will use all 16 cores of a Cube node, therefore for each compression setting, this experiment runs 16 instances of the WBM on a single node. Each of the 16 instances runs on a single core. All WBM runs are identical so the only differences between run times and result file size come from compression settings.

Compression settings for (SHUFFLE,DEFLATE) are (0,1) by default, and we compare that with all settings from (1,1) to (1,9).

Step 3. Run experiment, get results

Here are the results from this experiment. Results consider 47 output fields for WBM runs with a daily time-step for 8 years (2009-2016), plus 5 years of warmup (this is pretty common for hydrological models). All this in a spatial mesh of 148,500 grid cells. A folder containing binaries for a single input variable, for this time span and spatial coverage, has a size of 3.1GB. Therefore, the expected size for 47 variables in binary format is 146Go. Let us compare with our results:

netcdf_expe_results

As one can see the presence of the shuffle flag or the value of the deflate parameter have little influence on the size of the results files. Compressed results are 3 to 4 time smaller than binaries, which highlights the interest of compressing, but also means we do not have the order(s) of magnitude differences reported by Greg’s blog post. This is mainly because the binary format used for WBM inputs is much more efficient than the uncompressed ASCII that Greg used in his experiment. For a deflate parameter of 9, there is an apparent problem within the PDL library, and no output (note that a single-core run with shuffle=0 and deflate=9 did not lead to a similar problem).

Step 4. Conclude on compression parameters

Here the epxerimental setup has shown that carefully selecting the output fields will save more space than fine-tuning NetCDF compression parameters. For instance, some of the 47 output fields above are fully redundant with others. Others are residual fields, and the only interest of looking them up is to verify that a major development within the WBM code did not mess up with the overall water balance.

More generally, the effects of compression are situation-specific and are not as great when there is no obvious regularity in the data (as is often the case with outputs from large models), or when the binary format used is already much better than ASCII. This said, NetCDF still occupies much less space than binaries, and is much easier to handle: WBM outputs are contained in one file per year (8 files total) with very useful metadata info…


 

 

What is resilience in water resources systems?

Ah, resilience. Everybody talks about it, everybody has an idea (and sometimes a precise one) about what it means, but nobody seems to agree about what it is that it means. This post is about providing an overview of that topic for water resources systems: why people talk about it, how we can understand it, why it matters for adaptation and why it is more of an exciting research topic than a buzzword. I prepared a video to answer all those four questions in just over 4 minutes. Here it is.

An additional question would be: what are the connections with other topics abundantly described in this blog, such as robustness? Clues to the answer to this are in the video (a.k.a you should definitely watch it), and the answer to this is that resilience can be thought of as being robustness, or recovery if we are not robust. Academics and practitioners alike tend to focus on one of these aspects depending on the characteristics of the problem(s) they focus on.

That short answer can look a bit too easy, but it works whatever it is we are assessing robustness or resilience to: events, long-term changes, surprises (the state of the world is not what we thought), etc.

In practice that means that when we are concerned with resilience, we are concerned with three problems at once: 1) robustness within a safe operating space that we define according to our preferences, perceptions, the state of the system, etc., 2) a recovery problem outside of that safe space, and 3) where is our limit? Everything that has been done (in the Reed group and beyond) on robustness still applies when we operate our system in our safe space! Part of my work now is eliciting what is similar and what is different in the recovery problem, compared with the robustness problem. Another part is understanding what shapes the limit of a safe operating space.

More to follow on that soon… in the meantime, whoever is interested in how resilience amounts to solving two separate problems depending on the state of our system, can look up that work from my PhD thesis (yes, it even contains an application of the shallow lake problem!).

Geeky appendix

How did I make this video? I used the free and open-source Kazam software to record myself, screenshot and voice. Then I edited the video a bit with OpenShot (you have to leave your main screen to stop th recording with Kazam, so I had to edit that out), an equally free and open-source software. Both were easy to use and I would recommend them if you want to make this kind of video. One last thing: it is definitely interesting to record oneself talking! It is not easy to hold off the hesitations and quirks in your speech for 4 minutes…

Animations (2/2)

Animations (2/2)

In the second part of this two-part post we’ll learn to use different tools and techniques to visualize and save animations (first part here). All the code discussed here is available on a GitHub repository, here.

This part focuses on the moviepy Python library, and all the neat things one can do with it. There actually are some nice tutorials for when we have a continuous function t -> f(t) to work with (see here). Instead, we are often working with data structures that are indexed on time in a discrete way.

Moviepy could be used from any data source dependent on time, including netCDF data such as the one manipulated by VisIt in the first part of this post. But in this second part, we are instead going to focus on how to draw time -dependent trajectories to make sense of nonlinear dynamical systems, then animate them in GIF. I will use the well-known shallow lake problem, and go through a first example with detailed explanation of the code. Then I’ll finish with a second example showing trajectories.

Part I: using state trajectories to understand the concept of stable equilibria

The shallow lake problem is a classic problem in the management of coupled human and natural system. Some human (e.g. agriculture) produce phosphorus that eventually end up in water bodies such as lakes. Too much phosphorus in lake causes a processus called eutrophication which usually destroys lakes’ diverse ecosystems (no more fish) and lower water quality. A major problem with that is that eutrophication is difficult or even sometimes impossible to reverse: lowering phosphorus inputs to what they were pre-eutrophication simply won’t work. Simple nonlinear dynamics, first proposed by Carpenter et al. in 1999 (see here) describe the relationship between phosphorus inputs (L) and concentration (P). The first part of the code (uploaded to GitHub as movie1.py) reads:

 import attractors
import numpy as np
from moviepy.video.io.bindings import mplfig_to_npimage
from moviepy.video.VideoClip import DataVideoClip
import matplotlib.pyplot as plt
import matplotlib.lines as mlines

# Lake parameters
b = 0.65
q = 4

# One step dynamic (P increment rate)
# arguments are current state x and lake parameters b,q and input l
def Dynamics(x, b, q, l):
    dp = (x ** q) / (1 + x ** q) - b * x + l
    return dp 

Where the first 6 lines contain the usual library imports. Note that I am importing an auxiliary Python function “attractors” to enable me to plot the attractors (see attractors.pyon the GitHub repository). The function “Dynamics” correspond to the evolution of P given L and lake parameters b and q, also given in this bit of code. Then we introduce the time parameters:

# Time parameters
dt = 0.01 # time step
T = 40 # final horizon
nt = int(T/dt+1E-6) # number of time steps

To illustrate that lake phosphorus dynamics depend not only on the phosphorus inputs L but also on initial phosphorus levels, we are going to plot P trajectories for different constant values of L, and three cases regarding the initial P. We first introduce these initial phosphorus levels, and the different input levels, then declare the arrays in which we’ll store the different trajectories

# Initial phosphorus levels
pmin = 0
pmed = 1
pmax = 2.5

# Inputs levels
l = np.arange(0.001,0.401,0.005)

# Store trajectories
low_p = np.zeros([len(l),nt+1]) # Correspond to pmin
med_p = np.zeros([len(l),nt+1]) # Correspond to pmed
high_p = np.zeros([len(l),nt+1]) # Correspond to pmax 

Once that is done, we can use the attractor import to plot the equilibria of the lake problem. This is a bit of code that is the GitHub repository associated to this post, but that I am not going to comment on further here.

After that we can generate the trajectories for P with constant L, and store them to the appropriate arrays:

# Generating the data: trajectories
def trajectory(b,q,p0,l,dt,T):
# Declare outputs
time = np.arange(0,T+dt,dt)
traj = np.zeros(len(time))
# Initialize traj
traj[0] = p0
# Fill traj with values
for i in range(1,len(traj)):
traj[i] = traj[i-1] + dt * Dynamics(traj[i-1],b,q,l)
    return traj
# Get them!
for i in range(len(l)):
    low_p[i,:] = trajectory(b,q,pmin,l[i],dt,T)
    med_p[i, :] = trajectory(b, q, pmed, l[i], dt, T)
    high_p[i,:] = trajectory(b,q,pmax,l[i],dt,T)

Now we are getting to the interesting part of making the plots for the animation. We need to declare a figure that all the frames in our animation will use (we don’t want the axes to wobble around). For that we use matplotlib / pyplot libraries:

# Draw animated figure
fig, ax = plt.subplots(1)
ax.set_xlabel('Phosphorus inputs L')
ax.set_ylabel('Phosphorus concentration P')
ax.set_xlim(0,l[-1])
ax.set_ylim(0,pmax)
line_low, = ax.plot(l,low_p[:,0],'.', label='State, P(0)=0')
line_med, = ax.plot(l,med_p[:,0],'.', label='State, P(0)=1')
line_high, = ax.plot(l,high_p[:, 0], '.', label='State, P(0)=2.5')

Once that is done, the last things we need to do before calling the core moviepy functions are to 1) define the parameters that manage time, and 2) have a function that makes frames for the instant that is being called.

For 1), we need to be careful because we are juggling with different notions of time, a) time in the dynamics, b) the index of each instant in the dynamics (i.e., in the data, the arrays where we stored the trajectories), and c) time in the animation. We may also want to have a pause at the beginning or at the end of the GIF, rather than watch with tired eyes as the animation is ruthlessly starting again before we realized what the hell happened. So here is how I declared all of this:

# Parameters of the animation
initial_delay = 0.5 # in seconds, delay where image is fixed before the animation
final_delay = 0.5 # in seconds, time interval where image is fixed at end of animation
time_interval = 0.25 # interval of time between two snapshots in the dynamics (time unit or non-dimensional)
fps = 20 # number of frames per second on the GIF
# Translation in the data structure
data_interval = int(time_interval/dt) # interval between two snapshots in the data structure
t_initial = -initial_delay*fps*data_interval
t_final = final_delay*fps*data_interval
time = np.arange(t_initial,low_p.shape[1]+t_final,data_interval) # time in the data structure

Now for 2), the function to make the frames resets the parts of the plot that change for different time indexes(“t” below is the index in the data). If we don’t do that, the plot will keep the previous plotted elements, and will grow messier at the animation goes on.

# Making frames
def make_frame(t):
    t = int(t)
    if t<0:
        return make_frame(0)
    elif t>nt:
        return make_frame(nt)
    else:
        line_low.set_ydata(low_p[:,t])
        line_med.set_ydata(med_p[:,t])
        line_high.set_ydata(high_p[:, t])
        ax.set_title(' Lake attractors, and dynamics at t=' + str(int(t*dt)), loc='left', x=0.2)
       if t > 0.25*nt:
            alpha = (t-0.25*nt) / (1.5*nt)
            lakeAttBase(eqList, 0.001, alpha=alpha)
            plt.legend(handles=[stable, unstable], loc=2)
        return mplfig_to_npimage(fig) 

In the above mplfig_to_npimage(fig) is a moviepy function that turns a figure into a frame of our GIF. Now we just have to call the function to do frames using the data, and to turn it into a GIF:

# Animating
animation = DataVideoClip(time,make_frame,fps=fps)
animation.write_gif("lake_attractors.gif",fps=fps)

Where the moviepy function DataVideoClip takes as arguments the sequences of indexes defined by the vector “time” defined in the parameters of the animation, the “make_frame” routine we defined, and the number of frame per second we want to output. The last lines integrates each frame to the GIF that is plotted below:

lake_attractors

Each point on the plot represent a different world (different constant input level, different initial phosphorus concentration), and the animation shows how these states converge towards an stable equilibriun point. The nonlinear lake dynamics make the initial concentration important towards knowing if the final concentration is low (lower set of stable equilibria), or if the lake is in a eutrophic state (upper set of stable equilibria).

Part II: plotting trajectories in the 2-D plane

Many trajectories can be plotted at the same time to understand the behavior of attractors, and visualize system dynamics for fixed human-controlled parameters (here, the phosphorus inputs L). Alternatively, if one changes the policy, trajectories evolve depending on both L and P. This redefines how trajectories are defined.

I did a similar bit of code to show how one could plot trajectories in the 2D plane. It is also uploaded on the GitHub repository (under movie2.py), and is similar in its structure to the code above. The definition of the trajectories and where to store them change. We define trajectories where inputs are lowered at a constant rate, with a minimum input of 0.08. For three different initial states, that gives us the following animation that illustrates how the system’s nonlinearity leads to very different trajectories even though the starting positions are close and the management policy, identical:

lake_trajectories

 

This could easily be extended to trajectories in higher dimensional planes, with and without sets of equilibria to guide our eyes.

Animations (1/2)

Animations (1/2)

In the first part of this two-part post we’ll learn to use different tools and techniques to visualize and save animations (second part here). This first part focuses on VisIt, an open-source visualization software developed by the Lawrence Livermore Laboratory – and many others (…that’s an advantage of being open-source). There are a lot of cool things about VisIt, and you can read about that here or here, and probably in future posts. But for today, we’ll just learn how to use VisIt to play and save animations. VisIt supports over a hundred file formats, and this post focuses on reading spatially distributed and time-varying data stored in the netCDF format (more about netCDF here).

Let us assume we have VisIt installed and the VisIt path has been added to our bash profile. Then we just need to cd into the directory where our data are stored, and type visit in the command line to launch the program. VisIt comes with two windows, one for manipulating files and the data stored in them, and another where plots are drawn (it is possible to draw more plots in additional windows, but in this tutorial we’ll stick to one window). Both windows are visible in the screenshot below.

visit

It may be difficult to see, but the left-hand window is the space for managing files and plots. The active “Plot” is also displayed. Here the active file is “air.2m.2000.nc” where nc is th extension for netCDF files. The plot is a “Pseudocolor” of that data, the equivalent of “contourf” in Matlab or Python – Matplotlib. It is plotted in the right-hand window. It is the worldwide distribution of daily air temperature (in Kelvin) with a 1 degree resolution for January 1, 2000 on the left pane (Data from NCEP). Note the toolbar above the plot: it contains “play / stop” buttons that can allow us to play the 365 other days of year 2000 (the cursor is on “stop” in the screenshot). We are going to save the animation of the whole year with the VisIt movie wizard.

But first, let us change the specifications of our plot. In the left-hand window, we go to PlotAtts – > Pseudocolor (recall that is the plot type that we have). We can change the colorbar, set minima and maxima so that the colorbar does not change for each day of the year when we save the movie, and even make the scale skewed so we see more easily the differences in temperatures in non-polar regions where people live (on the image above the temperature in the middle of the colorbar is 254.9 K, which corresponds to -18.1C or roughly -1F). The screenshot below shows our preferences. We can click “Apply” for the changes to take effect, and exit the window.

preferences

Now we can save the movie. We go to File – > Save Movie and follow the VisIt Movie Wizard. You will find it very intuitive and easy. We save a “New Simple Movie” in DVIX (other formarts are available), with 20 frames (i.e. 20 days) per second. The resulting movie is produced within 2 minutes, but since WordPress does not support DIVX unless I am willing to pay for premium access, I had to convert the result as a GIF. To be honest, this is a huge letdown as the quality suffers compared to a video (unless I want to upload a 1GB monster). But the point of this post is that VisIt is really easy to use, not that you should convert its results into GIF.

blogpost

In this animation it is easy to see the outline of land masses as they get either warmer or cooler than the surrounding oceans! Videos would look neater: you can try them for yourself! (at home, wherever you want).

It is actually possible to command VisIt using Python scripts, but I haven’t mastered that yet so it will be a tale for another post.

Introduction to Docker

In this post we’ll learn the principles of Docker, and how to use Docker with large quantities of data in input / output.

1. What is Docker?

Docker is a way to build virtual machines from a file called the Docker file. That virtual machine can be built anywhere with the help of that Docker file, which makes Docker a great way to port models and the architecture that is used to run them (e.g., the Cube: yes, the Cube can be ported in that way, with the right Docker file, even though that is not the topic of this post). Building it creates an image (a file), and a container is a running instance of that image, where one can log on and work. By definition, containers are transient and removing does not affect the image.

2. Basic Docker commands

This part assumes that we already have a working Docker file. A docker file runs a series of instructions to build the container we want to work in.

To build a container for the WBM model from a Docker file, let us go to the folder where the Docker file is and enter:

docker build -t myimage -f Dockerfile .

The call docker build means that we want to run a Docker file; -t means that we name, or “tag” our image, here by giving it the name of “myimage”; -f specifies which Docker file we are using, in case there are several in the current folder, and “.” says that we run the Docker file and build the container in the current folder. Options -t and -f are optional in theory, but the tag -t is very important as it gives a name to your built image. If we don’t do that, we’ll have to go through the whole build every time we want to run a Docker container from the Docker file. This would waste a lot of time.

Once the Docker image is built, we can run it. In other words, have a virtual machine running on the computer / cluster / cloud where we are working. To do that, we enter:

docker run -dit myimage

The three options are as follows: -d means that we do not directly enter the container, and instead have it running in the background, while the call returns the containers hexadecimal ID. -i means that we keep the standard input open. Finally, -t is our tag, which is the name of the docker image (here, “myimage”).

We can now check that the image is running by listing all the running images with:

docker ps

In particular, this lists displays a list of hexadecimal IDs associated to each running image. After that, we can enter the container by typing:

 docker exec -i -t hexadecimalID /bin/bash 

where -i is the same as before, but -t now refers to the hexadecimal ID of the tagged image (that we retrieved with docker ps). The second argument /bin/bash simply sets the directory of the shell in a standard way.

Once in the container, we can run all the processes we want. Once we are ready to exit the container, we can exit it by typing… exit.

Once outside of the container, we can re-enter it as long as it still runs. If we want it to stop running, we use the following command to “kill” it (not my choice of words!):

 docker kill hexadecimalID 

A short cut to calling all these commands in succession is to use the following version of docker run:

 docker run -it myimage /bin/bash 

This command logs us onto the image as if we had typed run and exec at the same time (using the shell /bin/bash). Note that option -d is not used in this call. Also note that upon typing exit, we will not only exit the container, but also kill the running Docker image. This means that we don’t have to retrieve its hexadecimalID to log on to the image, nor to kill it.

Even if the container is not running any more, it can be re-started and re-entered by retrieving its hexadecimal ID. The docker ps command only lists running containers, so to list all the containers, including those that are no longer running, we type:

 docker ps -a

We can then restart and re-enter the container with the following commands:


docker restart hexadecimalID

docker exec -it hexadecimalID /bin/bash

Note the absence of options for docker restart. Once we are truly done with a container, it can be removed from the lists of previously running containers by using:

 docker rm hexadecimalID 

Note that you can only remove a container that is not running.

3. Working with large input / output data sets.

Building large quantities of data directly into the container when calling docker build has three major drawbacks. First, building the docker image will take much more time because we will need to transfer all that data every time we call docker build. This will waste a lot of time if we are tinkering with the structure of our container and are running the Docker file several times. Second, every container will take up a lot of space on the disk, which can prove problematic if we are not careful and have many containers for the same image (it is so easy to run new containers!). Third, output data will be generated within the container and will need to be copied to another place while still in the container.

An elegant workaround is to “mount” input and output directories to the container, by calling these folders with the -v option as we use the docker run command:

 docker run -it -v path/to/inputs -v path/to/outputs myimage /bin/bash 

or

 docker run -dit -v path/to/inputs -v path/to/outputs myimage 

The -v option is abbreviation for “volume”. This way, the inputs and outputs directories (set on the same host as the container) are used directly by the Docker image. If new outputs are produced, they can be added directly to the mounted output directory, and that data will be kept in that directory when exiting / killing the container. It is also worth noting that we don’t need to call -v again if we restart the container after killing it.

A side issue with Docker is how to manage user permissions on the outputs a container produces, but 1) that issue arises whether or not we use the -v option, and 2) this is a tale for another post.

Acknowledgements: thanks to Julie Quinn and Bernardo Trindade from this research group, who started exploring Docker right before me, making it that much easier for me to get started. Thanks also to the Cornell-based IT support of the Aristotle cloud, Bennet Wineholt and Brandon Baker.