# Map making in Matlab

Greetings,

This weeks post will cover the basics of generating maps in Matlab.  Julie’s recent post showed how to do some of this in Python, but, Matlab is also widely used by the community.  You can get a lot done with Matlab, but in this post we’ll just cover a few of the basics.

We’ll start off by plotting a map of the continental United States, with the states.  We used three  this with three commands: usamap, shaperead, and geoshow.  usamap creates an empty map axes having the Lambert Projection covering the area of the US, or any state or collection of states.  shaperead reads shapefiles (duh) and returns a Matlab geographic data structure, composed of both geographic data and attributes.  This Matlab data structure then interfaces really well with various Matlab functions (duh).  Finally, geoshow plots geographic data, in our case on the map axes we defined.  Here’s some code putting it all together.

```hold on
figure1 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
tightmap
hold off
```

Note that ‘usastatehi’ is a shapefile containing the US states (duh) that’s distributed with Matlab. The above code generates this figure:

Now, suppose we wanted to plot some data, say a precipitation forecast, on our CONUS map.  Let’s assume our forecast is being made at many points (lat,long).  To interpolate between the points for plotting we’ll use Matlab’s griddata function.  Once we’ve done this, we use the Matlab’s contourm command.  This works exactly like the normal contour function, but the ‘m’ indicates it plots map data.

```xi = min(x):0.5:max(x);
yi = min(y):0.5:max(y);
[XI, YI] = meshgrid(xi,yi);
ZI = griddata(x,y,V,XI,YI);

hold on
figure2 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])

contourm(YI,-1*XI,ZI)
tightmap
hold off
```

Here x, y, and V are vectors of long, lat, and foretasted precipitation respectively.  This code generates the following figure:

Wow!  Louisiana is really getting hammered!  Let’s take a closer look.  We can do this by changing the entry to usamap to indicate we want to consider only Louisiana.  Note, usamap accepts US postal code abbreviations.

```ax = usamap('LA');
```

Making that change results in this figure:

Neat!  We can also look at two states and add annotations.  Suppose, for no reason in particular, you’re interested in the location of Tufts University relative to Cornell.  We can make a map to look at this with the textm and scatterm functions.  As before, the ‘m’ indicates the functions  plot on a map axes.

```hold on
figure4 = figure;
ax = usamap({'MA','NY'});

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
scatterm(42.4075,-71.1190,100,'k','filled')
textm(42.4075+0.2,-71.1190+0.2,'Tufts','FontSize',30)

scatterm(42.4491,-76.4842,100,'k','filled')
textm(42.4491+0.2,-76.4842+0.2,'Cornell','FontSize',30)
tightmap
hold off
```

This code generates the following figure.

Cool! Now back to forecasts.  NOAA distributes short term Quantitative Precipitation Forecasts (QPFs) for different durations every six hours.  You can download these forecasts in the form of shapefiles from a NOAA server.  Here’s an example of a 24-hour rainfall forecast made at 8:22 AM UTC on April 29.

Wow, that’s a lot of rain!  Can we plot our own version of this map using Matlab!  You bet!  Again we’ll use usamap, shaperead, and geoshow.  The for loop, (0,1) scaling, and log transform are simply to make the color map more visually appealing for the post.  There’s probably a cleaner way to do this, but this got the job done!

```figure5 = figure;
ax = usamap('conus');

set(ax, 'Visible', 'off')
latlim = getm(ax, 'MapLatLimit');
lonlim = getm(ax, 'MapLonLimit');
'UseGeoCoords', true, 'BoundingBox', [lonlim', latlim']);
geoshow(ax, states, 'FaceColor', [0.5 0.5 0.5])
p = colormap(jet);

N = max(size(S));
d = zeros(N,1);
for i = 1:N
d(i) = log(S(i).QPF);
end

y=floor(((d-min(d))/range(d))*63)+1;
col = p(y,:);
for i = 1:N
geoshow(S(i),'FaceColor',col(i,:),'FaceAlpha',0.5)%,'SymbolSpec', faceColors)
end
```

This code generates the following figure:

If you are not plotting in the US, Matlab also has a worldmap command.  This works exactly the same as usamap, but now for the world (duh).  Matlab is distibuted with a shapefile ‘landareas.shp’ which contains all of the land areas in the world (duh).  Generating a global map is then trivial:

```figure6 = figure;

worldmap('World')
geoshow(land, 'FaceColor', [0.15 0.5 0.15])
```

Which generates this figure.

Matlab also comes with a number of other included that might be of interest.  For instance, shapefiles detailing the locations of major world cities, lakes, and rivers.  We can plot those with the following code:

```figure7 = figure;

worldmap('World')
geoshow(land, 'FaceColor', [0.15 0.5 0.15])
geoshow(lakes, 'FaceColor', 'blue')
geoshow(rivers, 'Color', 'blue')
geoshow(cities, 'Marker', '.', 'Color', 'red')
```

Which generates the figure:

But suppose we’re interested in one country or a group of countries.  worldmap works in the same usamap does.  Also, you can plot continents, for instance Europe.

```worldmap('Europe')
```

Those are the basics, but there are many other capabilities, including 3-D projections. I can cover this in a later post if there is interest.

That’s it for now!

# Solving non-linear problems using linear programming

This week’s post comes from recent conversations we’ve had around the Reed group concerning tools to quickly solve (approximately) non-linear programming problems.  First, some context.

As part of a simulation model our group is building, a drinking water allocation sub-problem must be solved.   Figure 1 is a simplified example of the sort of problem we are solving.

Figure 1: Mock water distribution network

There are three utilities that each have a demand ($d_{1}$, $d_{2}$, and $d_{3}$). The utilities are connected via some infrastructure, as shown in Figure 1.  When our total available water ($R$) is in excess of the demand ($d_{1}+d_{2}+d_{3}$), no rationing is needed.  When we do need to ration, we want to allocate the water to minimize the percent supply deficits across the three utilities:

Equation 1

Subject to:

The last constraint here describes the a limitation of the distribution network.  The real problem is much more complicated, but we needn’t detail that here.

This problem needs to be solved thousands, or hundreds of thousands of times in each simulation, so we want any solution technique to be fast.  The natural solution is linear programming (LP), which can solve problems with tens of thousands of variables and constraints nearly instantaneously.

We won’t discuss LP in great detail here, except to say that LP requires an objective and constraints that are linear with respect to the decision variables.  These restrictive requirements significantly reduce the number of potential optimal solutions that must be searched.  By systematically testing and pivoting between these potential optimal solutions, the popular Simplex Algorithm quickly converges to the optimal solution.

As stated in equation 1, our rationing scheme is indifferent to imposing small deficits across all three utilities, or imposing one large deficit to a single utility.  For example, the objective value in equation 1 is the same, whether each utility has a deficit of 5%, or if utility 1 has a deficit of 15%, and utilities 2 and 3 have no deficit.  In reality, many small deficits are likely preferable to one large one.  So what are we to do?

We could square our deficits.  In that case, our rationing scheme will prefer small distributed deficits over one large deficit:

Equation 2

BUT, we can’t use LP to solve this problem, as our objective is now non-linear! There are non-linear programming algorithms that are relatively fast, but perhaps not fast enough.  Instead we could linearize our non-linear objective, as shown in Figure 2.

The strategy here is to divide a single allocation,  $x_{1}$ for instance, into many decision variables, representing different ranges of the actual allocation $x_{1}$.  In each range, a linear segment approximates the actual quadratic objective function.  Any actual release $x_{1}$ can be achieved by assigning the appropriate values to the new decision variables ($k_{1}$, $k_{2}$, and $k_{3}$), and the contribution to the objective function from that release can be approximated by:

Equation 3

Subject to:

If a more accurate description is needed, the range of $x_{1}$ can be divided into more segments.  For our purposes just a few segments are probably sufficient.  A similar strategy can be adopted for $x_{2}$ and $x_{3}$.  Of course the constraints from the original optimization problem would need to be translated into terms of the new decision variables.

Now we are adding many more decision variables and constraints, but this is unlikely to slow a modern LP algorithm too much; we are still solving a relatively simple problem.  BUT, how does the LP algorithm know to increase $k_{1}$ to its maximum threshold before applying $k_{2}$?  Do we need to add a number of conditional constraints to ensure this is done properly?

It turns out we don’t!  Because our squared deficit curve in Figure 2 is monotonic and convex, we know that slope of the linear segments making up the approximation are increasing (becoming less negative).  Thus, in a minimization problem, the marginal improvement in the objective is highest for the $k_{1}$ segment, followed by the $k_{2}$ segment, followed by the $k_{3}$ segment, and so on.  In other words $a < b < c$.For this reason, the algorithm will increase $k_{1}$ to its maximum threshold before assigning a non-zero value to $k_{2}$, and so forth.  No need for complicated constraints!

Now this is not always the case.  If the function were not monotonic, or if it were convex for a maximization, or concave for a minimization, this would not work.  But, this trick works for a surprising number of applications in water resources systems analysis!

If nothing else this simple example serves as a reminder that a little bit of thought in formulating problems can save a lot of time later!

# Synthetic streamflow generation

A recent research focus of our group has been the development and use of synthetic streamflow generators.  There are many tools one might use to generate synthetic streamflows and it may not be obvious which is right for a specific application or what the inherent limitations of each method are.  More fundamentally, it may not be obvious why it is desirable to generate synthetic streamflows in the first place.  This will be the first in a series of blog posts on the synthetic streamflow generators in which I hope to sketch out the various categories of generation methods and their appropriate use as I see it.  In this first post we’ll focus on the motivation and history behind the development of synthetic streamflow generators and broadly categorize them.

### Why should we use synthetic hydrology?

The most obvious reason to use synthetic hydrology is if there is little or no data for your system (see Lamontagne, 2015).  Another obvious reason is if you are trying to evaluate the effect of hydrologic non-stationarity on your system (Herman et al. 2015; Borgomeo et al. 2015).  In that case you could use synthetic methods to generate flows reflecting a shift in hydrologic regime.  But are there other reasons to use synthetic hydrology?

In water resources systems analysis it is common practice to evaluate the efficacy of management or planning strategies by simulating system performance over the historical record, or over some critical period.  In this approach, new strategies are evaluated by asking the question:  How well would we have done with your new strategy?

This may be an appealing approach, especially if some event was particularly traumatic to your system. But is this a robust way of evaluating alternative strategies?  It’s important to remember that any hydrologic record, no matter how long, is only a single realization of a stochastic process.  Importantly, drought and flood events emerge as the result of specific sequences of events, unlikely to be repeated.  Furthermore, there is a 50% chance that the worst flood or drought in an N year record will be exceeded in the next N years.  Is it well advised to tailor our strategies to past circumstances that will likely never be repeated and will as likely as not be exceeded?  As Lettenmaier et al. [1987] reminds us “Little is certain about the future except that it will be unlike the past.”

Even under stationarity and even with long hydrologic records, the use of synthetic streamflow can improve the efficacy of planning and management strategies by exposing them to larger and more diverse flood and drought than those in the record (Loucks et al. 1981; Vogel and Stedinger, 1988; Loucks et al. 2005).  Figure 7.12 from Loucks et al. 2005 shows a typical experimental set-up using synthetic hydrology with a simulation model.  Often our group will wrap an optimization model like Borg around this set up, where the system design/operating policy (bottom of the figure) are the decision variables, and the system performance (right of the figure) are the objective(s).

(Loucks et al. 2005)

### What are the types of generators?

Many synthetic streamflow generation techniques have been proposed since the early 1960s.  It can be difficult for a researcher or practitioner to know which method is best suited to the problem at hand.  Thus, we’ll start with a very broad characterization of what is out there, then proceed to some history.

Broadly speaking there are two approaches to generating synthetic hydrology: indirect and direct.  The indirect approach generates streamflow by synthetically generating the forcings to a hydrologic model.  For instance one might generate precipitation and temperature series and input them to a hydrologic model of a basin (e.g. Steinschneider et al. 2014).  In contrast, direct methods use statistical techniques to generate streamflow timeseries directly.

The direct approach is generally easier to apply and more parsimonious because it does not require the selection, calibration, and validation of a separate hydrologic model (Najafi et al. 2011).  On the other hand, the indirect approach may be desirable.  Climate projections from GCMs often include temperature or precipitation changes, but may not describe hydrologic shifts at a resolution or precision that is useful.  In other cases, profound regime shifts may be difficult to represent with statistical models and may require process-driven models, thus necessitating the indirect approach.

Julie’s earlier series focused on indirect approaches, so we’ll focus on the direct approach.  Regardless of the approach many of the methods are same.  In general generator methods can be divided into two categories: parametric and non-parametricParametric methods rely on a hypothesized statistical model of streamflow whose parameters are selected to achieve a desired result (Stedinger and Taylor, 1982a).  In contrast non-parametric methods do not make strong structural assumptions about the processes generating the streamflow, but rather rely on re-sampling from the hydrologic record in some way (Lall, 1995).  Some methods combine parametric and non-parametric techniques, which we’ll refer to as semi-parametric (Herman et al. 2015).

Both parametric and non-parametric methods have advantages and disadvantages.  Parametric methods are often parsimonious, and often have analytical forms that allow easy parameter manipulation to reflect non-stationarity.  However, there can be concern that the underlying statistical models may not reflect the hydrologic reality well (Sharma et al. 1997).  Furthermore, in multi-dimensional, multi-scale problems the proliferation of parameters can make parametric models intractable (Grygier and Stedinger, 1988).  Extensive work has been done to confront both challenges, but they may lead a researcher to adopt a non-parametric method instead.

Because many non-parametric methods ‘re-sample’ flows from a record, realism is not generally a concern, and most re-sampling schemes are computationally straight forward (relatively speaking).  On the other hand, manipulating synthetic flows to reflect non-stationarity may not be as straightforward as a simple parameter change, though methods have been suggested (Herman et al. 2015Borgomeo et al. 2015).  More fundamentally, because non-parametric methods rely so heavily on the data, they require sufficiently long records to ensure there is enough hydrologic variability to sample.  Short records can be a concern for parametric methods as well, though parametric uncertainty can be explicitly considered in those methods (Stedinger and Taylor, 1982b).  Of course, parametric methods also have structural uncertainty that non-parametric models largely avoid by not assuming an explicit statistical model.

In the coming posts we’ll dig into the nuances of the different methods in greater detail.

### A historical perspective

The first use of synthetic flow generation seems to have been by Hazen [1914].  That work attempted to quantify the reliability of a water supply by aggregating the streamflow records of local streams into a 300-year ‘synthetic record.’  Of course the problem with this is that the cross-correlation between concurrent flows rendered the effective record length much less than the nominal 300 years.

Next Barnes [1954] generated 1,000 years of streamflow for a basin in Australia by drawing random flows from a normal distribution with mean and variance equal to the sample estimates from the observed record.  That work was extended by researchers from the Harvard Water Program to account for autocorrelation of monthly flows (Maass et al., 1962; Thomas and Fiering, 1962).  Later work also considered the use of non-normal distributions (Fiering, 1967), and the generation of correlated concurrent flows at multiple sites (Beard, 1965; Matalas, 1967).

Those early methods relied on first-order autoregressive models that regressed flows in the current period on the flows of the previous period (see Loucks et al.’s Figure 7.13  below).  Box and Jenkins [1970] extended those methods to autoregressive models of arbitrary order, moving average models of arbitrary order, and autoregressive-moving average models of arbitrary order.  Those models were the focus of extensive research over the course of the 1970s and 1980s and underpin many of the parametric generators that are widely used in hydrology today (see Salas et al. 1980; Grygier and Stedinger, 1990; Salas, 1993; Loucks et al. 2005).

(Loucks et al. 2005)

By the mid-1990s, non-parametric methods began to gain popularity (Lall, 1995).  While much of this work has its roots in earlier work from the 1970s and 1980s (Yakowitz, 1973, 1979, 1985; Schuster and Yakowitz, 1979; Yakowitz and Karlsson, 1987; Karlson and Yakowitz, 1987), improvements in computing and the availability of large data sets meant that by the mid-1990s non-parametric methods were feasible (Lall and Sharma, 1996).  Early examples of non-parametric methods include block bootstrapping (Vogel and Shallcross, 1996), k-nearest neighbor (Lall and Sharma, 1996), and kernel density methods (Sharma et al. 1997).  Since that time extensive research has made improvement to these methods, often by incorporating parametric elements.  For instance, Srinivas and Srinivasan (2001, 2005, and 2006) develop a hybrid autoregressive-block bootstrapping method designed to improve the bias in lagged correlation and to generate flows other than the historical, for multiple sites and multiple seasons.  K-nearest neighbor methods have also been the focus of extensive research (Rajagopalan and Lall, 1999; Harrold et al. 2003; Yates et al. 2003; Sharif and Burn, 2007; Mehortra and Sharma, 2006; Prairie et al. 2006; Lee et al. 2010, Salas and Lee, 2010, Nowak et al., 2010), including recent work by our group  (Giuliani et al. 2014).

Emerging work focuses on stochastic streamflow generation using copulas [Lee and Salas, 2011; Fan et al. 2016], entropy theory bootstrapping [Srivastav and Simonovic, 2014], and wavelets [Kwon et al. 2007; Erkyihun et al., 2016], among other methods.

In the following posts I’ll address different challenges in stochastic generation [e.g. long-term persistence, parametric uncertainty, multi-site generation, seasonality, etc.] and the relative strengths and shortcomings of the various methods for addressing them.

### Works Cited

Barnes, F. B., Storage required for a city water supply, J. Inst. Eng. Australia 26(9), 198-203, 1954.

Beard, L. R., Use of interrelated records to simulate streamflow, J. Hydrol. Div., ASCE 91(HY5), 13-22, 1965.

Borgomeo, E., Farmer, C. L., and Hall, J. W. (2015). “Numerical rivers: A synthetic streamflow generator for water resources vulnerability assessments.” Water Resour. Res., 51(7), 5382–5405.

Y.R. Fan, W.W. Huang, G.H. Huang, Y.P. Li, K. Huang, Z. Li, Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas, Advances in Water Resources, Volume 88, February 2016, Pages 170-185.

Fiering, M.B, Streamflow Synthesis, Harvard University Press, Cambridge, Mass., 1967.

Giuliani, M., J. D. Herman, A. Castelletti, and P. Reed (2014), Many-objective reservoir policy identification and refinement to reduce policy inertia and myopia in water management, Water Resour. Res., 50, 3355–3377, doi:10.1002/2013WR014700.

Grygier, J.C., and J.R. Stedinger, Condensed Disaggregation Procedures and Conservation Corrections for Stochastic Hydrology, Water Resour. Res. 24(10), 1574-1584, 1988.

Grygier, J.C., and J.R. Stedinger, SPIGOT Technical Description, Version 2.6, 1990.

Harrold, T. I., Sharma, A., and Sheather, S. J. (2003). “A nonparametric model for stochastic generation of daily rainfall amounts.” Water Resour. Res., 39(12), 1343.

Hazen, A., Storage to be provided in impounding reservoirs for municipal water systems, Trans. Am. Soc. Civ. Eng. 77, 1539, 1914.

Herman, J.D., H.B. Zeff, J.R. Lamontagne, P.M. Reed, and G. Characklis (2016), Synthetic Drought Scenario Generation to Support Bottom-Up Water Supply Vulnerability Assessments, Journal of Water Resources Planning & Management, doi: 10.1061/(ASCE)WR.1943-5452.0000701.

Karlsson, M., and S. Yakowitz, Nearest-Neighbor methods for nonparametric rainfall-runoff forecasting, Water Resour. Res., 23, 1300-1308, 1987.

Kwon, H.-H., U. Lall, and A. F. Khalil (2007), Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43, W05407, doi:10.1029/2006WR005258.

Lall, U., Recent advances in nonparametric function estimation: Hydraulic applications, U.S. Natl. Rep. Int. Union Geod. Geophys. 1991- 1994, Rev. Geophys., 33, 1093, 1995.

Lall, U., and A. Sharma (1996), A nearest neighbor bootstrap for resampling hydrologic time series, Water Resour. Res. 32(3), pp. 679-693.

Lamontagne, J.R. 2015,Representation of Uncertainty and Corridor DP for Hydropower 272 Optimization, PhD edn, Cornell University, Ithaca, NY.

Lee, T., J. D. Salas, and J. Prairie (2010), An enhanced nonparametric streamflow disaggregation model with genetic algorithm, Water Resour. Res., 46, W08545, doi:10.1029/2009WR007761.

Lee, T., and J. Salas (2011), Copula-based stochastic simulation of hydrological data applied to Nile River flows, Hydrol. Res., 42(4), 318–330.

Lettenmaier, D. P., K. M. Latham, R. N. Palmer, J. R. Lund and S. J. Burges, Strategies for coping with drought Part II: Planning techniques for planning and reliability assessment, EPRI P-5201, Final Report Project 2194-1, June 1987.

Loucks, D.P., Stedinger, J.R. & Haith, D.A. 1981, Water Resources Systems Planning and Analysis, 1st edn, Prentice-Hall, Englewood Cliffs, N.J.

Loucks, D.P. et al. 2005, Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Delft, The Netherlands.

Maass, A., M. M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair,

Design of Water Resource Systems, Harvard University Press, Cambridge, Mass., 1962.

Matalas, N. C., Mathematical assessment of synthetic hydrology, Water Resour. Res. 3(4), 937-945, 1967.

Najafi, M. R., Moradkhani, H., and Jung, I. W. (2011). “Assessing the uncertainties of hydrologic model selection in climate change impact studies.” Hydrol. Process., 25(18), 2814–2826.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily

streamﬂow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Nowak, K., J. Prairie, B. Rajagopalan, and U. Lall (2010), A nonparametric stochastic approach for multisite disaggregation of annual to daily streamflow, Water Resour. Res., 46, W08529, doi:10.1029/2009WR008530.

Prairie, J. R., Rajagopalan, B., Fulp, T. J., and Zagona, E. A. (2006). “Modified K-NN model for stochastic streamflow simulation.” J. Hydrol. Eng., 11(4), 371–378.

Rajagopalan, B., and Lall, U. (1999). “A k-nearest-neighbor simulator for daily precipitation and other weather variables.” Water Resour. Res., 35(10), 3089–3101.

Salas, J. D., J. W. Deller, V. Yevjevich and W. L. Lane, Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colo., 1980.

Salas, J.D., 1993, Analysis and Modeling of Hydrologic Time Series, Chapter 19 (72 p.) in The McGraw Hill Handbook of Hydrology, D.R. Maidment, Editor.

Salas, J.D., T. Lee. (2010). Nonparametric Simulation of Single-Site Seasonal Streamflow, J. Hydrol. Eng., 15(4), 284-296.

Schuster, E., and S. Yakowitz, Contributions to the theory of nonparametric regression, with application to system identification, Ann. Stat., 7, 139-149, 1979.

Sharif, M., and Burn, D. H. (2007). “Improved K-nearest neighbor weather generating model.” J. Hydrol. Eng., 12(1), 42–51.

Sharma, A., Tarboton, D. G., and Lall, U., 1997. “Streamflow simulation: A nonparametric approach.” Water Resour. Res., 33(2), 291–308.

Srinivas, V. V., and Srinivasan, K. (2001). “A hybrid stochastic model for multiseason streamflow simulation.” Water Resour. Res., 37(10), 2537–2549.

Srinivas, V. V., and Srinivasan, K. (2005). “Hybrid moving block bootstrap for stochastic simulation of multi-site multi-season streamflows.” J. Hydrol., 302(1–4), 307–330.

Srinivas, V. V., and Srinivasan, K. (2006). “Hybrid matched-block bootstrap for stochastic simulation of multiseason streamflows.” J. Hydrol., 329(1–2), 1–15.

Roshan K. Srivastav, Slobodan P. Simonovic, An analytical procedure for multi-site, multi-season streamflow generation using maximum entropy bootstrapping, Environmental Modelling & Software, Volume 59, September 2014a, Pages 59-75.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 1. Model verification and validation, Water Resour. Res. 18(4), 909-918, 1982a.

Stedinger, J. R. and M. R. Taylor, Sythetic streamflow generation, Part 2. Parameter uncertainty,Water Resour. Res. 18(4), 919-924, 1982b.

Steinschneider, S., Wi, S., and Brown, C. (2014). “The integrated effects of climate and hydrologic uncertainty on future flood risk assessments.” Hydrol. Process., 29(12), 2823–2839.

Thomas, H. A. and M. B. Fiering, Mathematical synthesis of streamflow sequences for the analysis of river basins by simulation, in Design of Water Resource Systems, by A. Maass, M. Hufschmidt, R. Dorfman, H. A. Thomas, Jr., S. A. Marglin and G. M. Fair, Harvard University Press, Cambridge, Mass., 1962.

Vogel, R.M., and J.R. Stedinger, The value of stochastic streamflow models in over-year reservoir design applications, Water Resour. Res. 24(9), 1483-90, 1988.

Vogel, R. M., and A. L. Shallcross (1996), The moving block bootstrap versus parametric time series models, Water Resour. Res., 32(6), 1875–1882.

Yakowitz, S., A stochastic model for daily river flows in an arid region, Water Resour. Res., 9, 1271-1285, 1973.

Yakowitz, S., Nonparametric estimation of markov transition functions, Ann. Stat., 7, 671-679, 1979.

Yakowitz, S. J., Nonparametric density estimation, prediction, and regression for markov sequences J. Am. Stat. Assoc., 80, 215-221, 1985.

Yakowitz, S., and M. Karlsson, Nearest-neighbor methods with application to rainfall/runoff prediction, in Stochastic  Hydrology, edited by J. B. Macneil and G. J. Humphries, pp. 149-160, D. Reidel, Norwell, Mass., 1987.

Yates, D., Gangopadhyay, S., Rajagopalan, B., and Strzepek, K. (2003). “A technique for generating regional climate scenarios using a nearest-neighbor algorithm.” Water Resour. Res., 39(7), 1199.

# Plotting geographic data from geojson files using Python

Hi folks,

I’m writing today about plotting geojson files with Matplotlib’s Basemap.  In a previous post I laid out how to plot shapefiles using Basemap.

geojson is an open file format for representing geographical data based on java script notation.  They are composed of points, lines, and polygons or ‘multiple’ (e.g. multipolygons composed of several polygons), with accompanying properties.  The basic structure is one of names and vales, where names are always strings and values may be strings, objects, arrays, or logical literal.

The geojson structure we will be considering here is a collection of features, where each feature contains a geometry and properties.  Each geojson feature must contain properties and geometry.  Properties could be things like country name, country code, state, etc.  The geometry must contain a type (point, line, polygons, etc.) and coordinates (likely an array of lat-long). Below is an excerpt of a geojson file specifying Agro-Ecological Zones (AEZs) within the various GCAM regions.

```{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },

"features": [
{ "type": "Feature", "id": 1, "properties": { "ID": 1.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 99.5, 78.5 ], [ 98.33203125, 78.735787391662598 ], [ 98.85723876953125, 79.66796875 ], [ 99.901641845703125, 79.308036804199219 ], [ 99.5, 78.5 ] ] ] ] } },
{ "type": "Feature", "id": 2, "properties": { "ID": 2.000000, "GRIDCODE": 11913.000000, "CTRYCODE": 119.000000, "CTRYNAME": "Russian Fed", "AEZ": 13.000000, "GCAM_ID": "Russian Fed-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ 104.5, 78.0 ], [ 104.0, 78.0 ], [ 99.5, 78.0 ], [ 99.5, 78.5 ], [ 100.2957763671875, 78.704218864440918 ], [ 102.13778686523437, 79.477890968322754 ], [ 104.83050537109375, 78.786871910095215 ], [ 104.5, 78.0 ] ] ] ] } },
{ "type": "Feature", "id": 3, "properties": { "ID": 3.000000, "GRIDCODE": 2713.000000, "CTRYCODE": 27.000000, "CTRYNAME": "Canada", "AEZ": 13.000000, "GCAM_ID": "Canada-13" }, "geometry": { "type": "MultiPolygon", "coordinates": [ [ [ [ -99.5, 77.5 ], [ -100.50860595703125, 77.896504402160645 ], [ -101.76053619384766, 77.711499214172363 ], [ -104.68202209472656, 78.563323974609375 ], [ -105.71781158447266, 79.692866325378418 ], [ -99.067413330078125, 78.600395202636719 ], [ -99.5, 77.5 ] ] ] ] } }
}
```

Now that we have some understanding of the geojson structure, plotting the information therein should be as straightforward as traversing that structure and tying geometries to data.  We do the former using the geojson python package and the latter using pretty basic python manipulation.  To do the actual plotting, we’ll use PolygonPatches from the descartes library and recycle most of the code from my previous post.

We start by importing the necessary libraries and then open the geojson file.

```import geojson
from descartes import PolygonPatch
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np

with open("aez-w-greenland.geojson") as json_file:
```

We then define a MatplotLib Figure, and generate a Basemap object as a ‘canvas’ to draw the geojson geometries on.

```plt.clf()

m = Basemap(projection='robin', lon_0=0,resolution='c')
m.drawmapboundary(fill_color='white', zorder=-1)
m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawcoastlines(color='0.6', linewidth=1)
```

Next, we iterate over the nested features in this file and pull out the coordinate list defining each feature’s geometry (line 2).  In lines 4-5 we also pull out the feature’s name and AEZ, which I can tie to GCAM data.

```for i in range(2799):
coordlist = json_data.features[i]['geometry']['coordinates'][0]
if i < 2796:
name = json_data.features[i]['properties']['CTRYNAME']
aez =  json_data.features[i]['properties']['AEZ']

for j in range(len(coordlist)):
for k in range(len(coordlist[j])):
coordlist[j][k][0],coordlist[j][k][1]=m(coordlist[j][k][0],coordlist[j][k][1])

poly = {"type":"Polygon","coordinates":coordlist}#coordlist

ax.axis('scaled')
plt.draw()
plt.show()
```

Line 9 is used to convert the coordinate list from lat/long units to meters.  Depending on what projection you’re working in and what units your inputs are in, you may or may not need to do this step.

The final lines are used to add the polygon to the figure, and to make the face color of each polygon green and the border dark green. Which generates the figure:

To get a bit more fancy, we could tie the data to a colormap and then link that to the facecolor of the polygons.  For instance, the following figure shows the improvement in maize yields over the next century in the shared socio-economic pathway 1 (SSP 1), relative to a reference scenario (SSP 2).

# Python’s template class

Many analyses require the same model to be run many times, but with different inputs.  For instance a Sobol sensitivity analysis requires thousands (or millions) of model runs corresponding to some strategic sampling of the factor space.  Depending on how complicated your model is, facilitating hundreds or thousands of runs may or may not be straightforward.  Some models require a unique configuration file, so performing a Sobol analysis is not as simple as changing a vector of numbers passed to an executable.

A very simple solution suggested by Jon Herman in an earlier post is to use Python string templates.  It is such a handy tool, I thought it deserved its own post.  We’ll use Python’s string module’s Template class.  The Template class has two methods: substitute and safe_substitute.  The difference between substitute and safe_substitute is that substitute will throw an exception if there is some problem filling the template, where as safe_substitute will not.  These two methods work essentially as standard \$-based substitutions in Python, but rather than altering a single string, we can alter an entire document, and can then save it with a unique name.

Let’s consider a simple example first where we modify a single string:

```from string import Template
s = Template('\$who is from \$where')

d = {}
d['who'] = 'Bill'
d['where'] = 'Boston'

p = s.substitute(d)
```

Which returns the string Bill is from Boston…lucky Bill.  Now we can get a bit fancier with lists of people and places:

```from string import Template
s = Template('\$who is from \$where')

people = ['Bill','Jim','Jack']
places = ['Boston','London','LA']

p = {}
cnt = int(0)
for person in people:
for place in places:
d = {}
d['who'] = person
d['where'] = place
p[cnt] = s.substitute(d)
cnt = cnt+1
```

Which returns a p as a dictionary of every combination of people and places:

Bill is from Boston
Bill is from London
Bill is from LA
Jim is from Boston
Jim is from London
Jim is from LA
Jack is from Boston
Jack is from London
Jack is from LA

Of course this is a silly example, but this sort of exercise proved really useful for some recent factorial experiments where we wanted to test a model performance for every combination of input factors (specified by filename strings).

Getting a bit more complex, let’s consider a long configuration file needed to run your model.  For example GCAM, an integrated assessment model I’ve previously discussed, uses a configuration xml file that’s about 100 lines long. We’ll consider a pared down version:

```<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
<Files>
<Value name="xmlInputFileName">../input/gcam-data-system/xml/modeltime-xml/modeltime.xml</Value>
<Value name="BatchFileName">batch_ag.xml</Value>
<Value name="policy-target-file">../input/policy/forcing_target_4p5.xml</Value>
<Value name="xmldb-location">../output/database_basexdb</Value>
</Files>
<ScenarioComponents>
<Value name = "climate">../input/climate/magicc.xml</Value>
</ScenarioComponents>
</Configuration>
```

Now, suppose we want to vary the cost of solar power inside the model over a number of levels, and we want each model run to print to a unique output directory.  Our first step is to make a template xml file with a \$-place holder where we want to vary the configuration file:

```<?xml version="1.0" encoding="UTF-8"?>
<Configuration>
<Files>
<Value name="xmlInputFileName">../input/gcam-data-system/xml/modeltime-xml/modeltime.xml</Value>
<Value name="BatchFileName">batch_ag.xml</Value>
<Value name="policy-target-file">../input/policy/forcing_target_4p5.xml</Value>
<Value name="xmldb-location">../output/database_basexdb_\$RN&</Value>
</Files>
<ScenarioComponents>
<Value name = "climate">../input/climate/magicc.xml</Value>
<!-- SOLAR -->
\$SOLAR
</ScenarioComponents>
</Configuration>
```

We can utilize the template xml file using Python’s template class as follows:

```with open(template_name,'r') as T:
SOLAR_1 = ['<Value name="solar">../input/gcam-data-system/xml/energy-xml/solar_low.xml</Value>']
SOLAR_2 = ['']
SOLAR = [SOLAR_1,SOLAR_2,SOLAR_3]
for i in range(3):
d = {}
d['SOLAR']=SOLAR[i]
d['RN']=str(i)
S1 = template.safe_substitute(d)
with open('./configuration_' + str(i) + '.xml','w') as f1:
f1.write(S1)
```

Here we are looping over experimental particles, defined by a unique setting of the solar power level in our experimental design.  For each particle a GCAM, the solar level and run number are substituted in (see S1), and S1 is written to a unique XML file.  If we open configuration_0.xml we get see that the substitution has worked!

```<?xml version="1.0" encoding="UTF-8">
<Configuration>
<Files>
<Value name="xmlInputFileName">../input/gcam-data-system/xml/modeltime-xml/modeltime.xml</Value>
<Value name="BatchFileName">batch_ag.xml</Value>
<Value name="policy-target-file">../input/policy/forcing_target_4p5.xml</Value>
<Value name="xmldb-location">../output/database_basexdb_0</Value>
</Files>
<ScenarioComponents>
<Value name = "climate">../input/climate/magicc.xml</Value>
<!-- SOLAR -->
<Value name="solar">../input/gcam-data-system/xml/energy-xml/solar_low.xml</Value>
</ScenarioComponents>
</Configuration>
```

Of course this is a very simple example, but it has proven incredibly useful in our ongoing work.

That’s all for now!

# Making Movies of Time-Evolving Global Maps with Python

Hi All,

These past few months I’ve been working with the Global Change Assessment Model (GCAM) which is an integrated assessment model (IAM) that combines models of the global climate and economic systems. I wrote an earlier post on compiling GCAM on a Unix cluster.  This post discusses some visualization tools I’ve developed for GCAM output.

GCAM models energy and agriculture systems at a regional level, where the world is composed of 32 regions.  We’re interested in tracking statistics (like the policy cost of stabilization) over time and across regions.  This required three things:

1. The ability to draw a global map.
2. The ability to shade individual political units on that map.
3. The ability to animate this map.

Dr. Jon Herman has already posted a good example of how to do (1) in python using matplotlib’s Basemap.  We’ll appropriate some of his example for this example.  The Basemap has the option to draw coastlines and boundaries, but these boundaries are not tied to shapes, meaning that you can’t assign different colors to individual countries (task (2) above).  To do that, we need a shapefile containing information about political boundaries.  You can find these for free from a number of sources online, but I like Natural Earth.  They provide data on many different scales. For this application I downloaded their coarsest data set.  To give each country a shade which is tied to data, we use matplotlib’s color map.  The basic plan is to generate a colored map for each time-step in our data, and then to animate the maps using the convert linux command.

Now that we’ve described roughly how we’ll proceed, a word about the data we’re dealing with and how I’ve handled it.  GCAM has 32 geo-political regions, some of which are individual countries (like the USA or China), while others are groups of countries (like Australia & New Zealand). I stored this information using a list of lists (i.e. a 32-element list, where each element is a list of countries in that region). I’ve creatively named this variable list_list in this example (see code below). For each of the regions GCAM produces a time series of policy costs as a fraction of GDP every 5 years from 2020-2100. I’ve creatively named this variable data. We want to tie the color of a country in each time to its policy cost relative to costs across countries and times.  To do this, I wrote the following (clumsy!) Python function, which I explain below.

```
def world_plot(data,idx,MN,MX):
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
import matplotlib.cm as cm
import matplotlib as mpl
import numpy as np

norm = mpl.colors.Normalize(vmin=MN, vmax=MX)
cmap = cm.coolwarm
colors=cm.ScalarMappable(norm=norm, cmap=cmap)
colors.set_array(data)
a = np.zeros([32,4])
a = colors.to_rgba(data)

fig = plt.figure(figsize=(10,10))

m = Basemap(projection='robin', lon_0=0,resolution='c')
m.drawmapboundary(fill_color='white', zorder=-1)
m.drawparallels(np.arange(-90.,91.,30.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)
m.drawmeridians(np.arange(0., 360., 60.), labels=[1,0,0,1], dashes=[1,1], linewidth=0.25, color='0.5',fontsize=14)

year = [1990,2005,2010,2015,2020,2025,2030,2035,2040,2045,2050,2055,2060,2065,2070,2075,2080,2085,2090,2095,2100]
GCAM_32 = ['PRI','USA','VIR']
GCAM_1 = ['BDI','COM','DJI','ERI','ETH','KEN','MDG','MUS','REU','RWA','SDS','SDN','SOM','UGA','SOL']
GCAM_2 = ['DZA','EGY','ESH','LBY','MAR','TUN','SAH']
GCAM_3 = ['AGO','BWA','LSO','MOZ','MWI','NAM','SWZ','TZA','ZMB','ZWE']
GCAM_4 = ['BEN','BFA','CAF','CIV','CMR','COD','COG','CPV','GAB','GHA','GIN','GMB','GNB','GNQ','LBR','MLI','MRT','NER','NGA','SEN','SLE','STP','TCD','TGO']
GCAM_6 = ['AUS','NZL']
GCAM_7 = ['BRA']
GCAM_8 = ['CAN']
GCAM_9 = ['ABW','AIA','ANT','ATG','BHS','BLZ','BMU','BRB','CRI','CUB','CYM','DMA','DOM','GLP','GRD','GTM','HND','HTI','JAM','KNA','LCA','MSR','MTQ','NIC','PAN','SLV','TTO','VCT']
GCAM_10 = ['ARM','AZE','GEO','KAZ','KGZ','MNG','TJK','TKM','UZB']
GCAM_11 = ['CHN','HKG','MAC']
GCAM_13 = ['BGR','CYP','CZE','EST','HUN','LTU','LVA','MLT','POL','ROM','SVK','SVN']
GCAM_14 = ['AND','AUT','BEL','CHI','DEU','DNK','ESP','FIN','FLK','FRA','FRO','GBR','GIB','GRC','GRL','IMN','IRL','ITA','LUX','MCO','NLD','PRT','SHN','SMR','SPM','SWE','TCA','VAT','VGB','WLF']
GCAM_15 = ['BLR','MDA','UKR']
GCAM_16 = ['ALB','BIH','HRV','MKD','MNE','SCG','SRB','TUR','YUG']
GCAM_17 = ['CHE','ISL','LIE','NOR','SJM']
GCAM_18 = ['IND']
GCAM_19 = ['IDN']
GCAM_20 = ['JPN']
GCAM_21 = ['MEX']
GCAM_22 = ['ARE','BHR','IRN','IRQ','ISR','JOR','KWT','LBN','OMN','PSE','QAT','SAU','SYR','YEM']
GCAM_23 = ['PAK']
GCAM_24 = ['RUS']
GCAM_25 = ['ZAF']
GCAM_26 = ['GUF','GUY','SUR','VEN']
GCAM_27 = ['BOL','CHL','ECU','PER','PRY','URY']
GCAM_28 = ['AFG','ASM','BGD','BTN','LAO','LKA','MDV','NPL']
GCAM_29 = ['KOR']
GCAM_30 = ['BRN','CCK','COK','CXR','FJI','FSM','GUM','KHM','KIR','MHL','MMR','MNP','MYS','MYT','NCL','NFK','NIU','NRU','PCI','PCN','PHL','PLW','PNG','PRK','PYF','SGP','SLB','SYC','THA','TKL','TLS','TON','TUV','VNM','VUT','WSM']
GCAM_31 = ['TWN']
GCAM_5 = ['ARG']
GCAM_12 = ['COL']

list_list = [GCAM_1,GCAM_2,GCAM_3,GCAM_4,GCAM_5,GCAM_6,GCAM_7,GCAM_8,GCAM_9,GCAM_10,GCAM_11,GCAM_12,GCAM_13,GCAM_14,GCAM_15,GCAM_16,GCAM_17,GCAM_18,GCAM_19,GCAM_20,GCAM_21,GCAM_22,GCAM_23,GCAM_24,GCAM_25,GCAM_26,GCAM_27,GCAM_28,GCAM_29,GCAM_30,GCAM_31,GCAM_32]
num = len(list_list)
for info, shape in zip(m.comarques_info,m.comarques):
for i in range(num):
patches1 = []
patches1.append( Polygon(np.array(shape), True) )
ax.set_title('Policy Cost',fontsize=25,y=1.01)#GDP Adjusted Policy Cost#Policy Cost#Policy Cost Reduction from Technology
plt.annotate('%s'%year[idx],xy=(0.1,0.2),xytext=(0.1,0.2),xycoords='axes fraction',fontsize=30)
cb = m.colorbar(colors,'right')
cb.ax.tick_params(labelsize=14)
filename = &amp;quot;out/map_%s.png&amp;quot; %(str(idx).rjust(3,&amp;quot;0&amp;quot;))
plt.show()
fig.savefig(filename)
return
```

The function’s name is world_plot and it’s inputs are:

1. The raw data for a specific time step.
2. The index of the time step for the map we are working with (e.g. idx=0 for 2020).
3. The minimum and maximum of the data across countries and time.

(1) is plotted, (2) is used to name the resulting png figure (line 73), and (3) is used to scale the color colormap (line 11).  On lines 2-8 we import the necessary Python packages, which in this case are pretty standard Matplotlib packages and numpy.  On lines 11-16 we generate a numpy array which contains the rgba color code for each of the data points in data.  In lines 18-19 we create the pyplot figure object.

On lines 21-24 we create and format the Basemap object.  Note that on line 21 I’ve selected the Robinson projection, but that the Basemap provides many options.

Lines 26-60 are specific for this application, and certainly could have been handled more compactly if I wanted to invest the time.  year is a list of time steps for our GCAM experiment, and lines 27-58 contain lists of three letter ID codes for each GCAM region, which are assembled into a list of lists (creatively called list_list) on line 60.

On line 61 we read the data from the shapefile database which was downloaded from Natural Earth. From lines 63-68 we loop through the info and shape attributes of the shapefile database, and determine which of the GCAM geo-political units each of the administrative units in the database is associated with.  Once this is determined, the polygon associated with that administrative unit is given the correct color (lines 66-68).

Lines 69-72 are doing some final formatting and labeling, and in lines 73-75 we are giving the file a unique name (tied to the time step plotted) and saving the images to some output directory.

When we put this function into a loop over time, we generate a sequence of figures looking something like this:

To convert this sequence of PNGs to a gif file, we use the convert command in linux (or in my case Cygwin).  So, we go to the command line and cd into the directory where we’ve saved our figures and type:

```convert -delay 45 -loop 0 *.png globe_Cost_Reduction_faster.gif
```

Here the delay flag controls the framerate of the gif (in milliseconds), the loop flag controls whether the gif repeats, next I’m using a wildcat to include all of the pngs in the output directory, and the final input is the resulting name of the gif. The final product:

# Solving Analytical Algebra/Calculus Expressions with Matlab

Hi All,

This week a few of our colleagues were grinding through some tough calculus for a homework assignment and bemoaning the fact they’d gotten a bit rusty since undergrad.  For help one might turn to Maple or Wolfram, but Matlab is also a useful tool.

Here we want to make use of Matlab’s symbolic variable class.  Mathworks has some nice tutorials on how to create symbolic variables etc. and a nice .  I’ll go through a really brief tutorial here.

First let’s create a symbolic number.  We do this with the sym function.  To see the difference between a symbolic number and floating-point number try the following code:

```sym(1/3)
1/3```

ans =
1/3

ans =
0.3333

Neat, but how are they different?  Stealing from Mathwork’s tutorial for a moment consider sin(π).  We all learned in middle school that sin(π) = 0. Of course π is irrational so numerically evaluating sin(π) may not return exactly zero.  Let’s see what happens when we evaluate sin(π) with both symbolic and floating-point pi’s:

```sin(sym(pi))
sin(pi)```

ans =
0

ans =
1.2246e-16

Cool!  Now, you’ve probably guessed that if we can create symbolic variables, we can also create symbolic functions.  Let’s create a simple symbolic function:

```syms x a b c
f = symfun(3*x^2+2*x+1,x);```

f(x) =
3x^2 + 2x + 1

Note that we could also use symbolic variables a, b, and c for the coefficients in f.  For now, let’s stick with 1, 2, and 3.

We can substitute any constant (say 3) or symbolic (say sin(y)) value of x we want (say x=3) and get the resulting value of f:

```Test1 = f(3)

syms y
Test2 = f(sin(y))```

Test1 =
34

Test2 =
2sin(y) + 3sin(y)^2 + 1

We can also multiply our function f by a constant or by another variable:

```Test3 = 6*f

Test4 = symfun(f*y,[x y])```

Test3 (x)=
18x^2 + 12x + 6

Test4(x, y) =
y(3x^2 + 2*x + 1)

Now, getting back to f, let’s do some calculus.  For starters, let’s take the first derivative of f with respect to x:

`df = diff(f,x)`

df(x) =
6*x + 2

And we can integrate as well:

`F = int(f,x)`

F(x) =
x*(x^2 + x + 1)

Again recall that we can derive the more general form of the expression using symbolic coefficients.  Of course, this is a simple example, but I’ve used it for much, much more complicated functions.

If we’re having trouble with calculus, maybe you could use a bit of help on the linear algebra front as well.  Let’s start by instantiating a symbolic matrix, X:

`X = sym('X',[2,2])`

X =
[ X1_1, X1_2]
[ X2_1, X2_2]

Alternatively we could instantiate X as a symbolic function with more readable symbolic elements:

```syms a b c d
X=symfun([a,b;c,d],[a b c d])```

X(a, b, c, d) =
[ a, b]
[ c, d]

As with the previous examples, we can operate on X.  Let’s take the inverse:

`Xinv = inv(X)`

Xinv(a, b, c, d) =

Break out your linear algebra book to confirm that this is right.  Suppose we want to do something a bit more complicated: linear regression!  We instantiate a a Y vector of observations:

```syms y1 y2 y3
Y = [y1;y2;y3]```

Y =
y1
y2
y3

Now, suppose we want to fit a constant model, we instantiate a new X matrix (or in this case a vector) of ones:

`X = sym(ones(3,1))`

X =
1
1
1

Now, we fit our model (y=bx+e) using the ordinary least squares (OLS) estimator (b=(XTX)^-1XTY):

`b = inv(X'*X)*X'*Y`

b =
y1/3 + y2/3 + y3/3

So we have our answer, and we find that the OLS estimator of the constant model is simply the sample mean of the data.

All of these examples were super easy, but should be helpful if you’re stuck on econometrics, statistics, or hydrology homework.  If there is interest I can make later posts on taking the partial derivative of piece-wise polynomial structures in Matlab.

Jon

# Power Spectral Density Estimation in Matlab

Hi all,

This post is about power spectral density estimation (PSDE) using Matlab.  PSDE is often used in signal processing and fluid mechanics, so you may or may not be familiar with it.  The idea is to determine the frequency content of sampled time series data.  This means that we can determine the amount of the variance in the time series that is contained at different frequencies (or periodicities).  Let’s take a look at a simple example.

Suppose we are sampling a signal (in this case a sign wave) for about 3 minutes with a frequency of 100 hertz, meaning we take 100 measurements every second.  Let’s generate our signal and plot our observations using Matlab.

```x = (0:0.01:60*pi);
y = sin(x);

figure(1)
plot(x,sin(x))
title('Signal with no Noise')
xlabel('Seconds')
ylabel('Observed Signal')
ylim([-1.5,1.5])
```

Sine wave, with no noise, sampled at 100 Hertz

From observing the data directly it’s pretty clear what the frequency and period of the data are here (1/(2*pi) and pi respectively): no need for PSDE.  However, real data is rarely so well behaved.  To simulate this, let’s add random noise to our original signal.  To make it interesting, let’s add a lot of noise: normal distributed noise with standard deviation 10 times larger than the magnitude of the signal!

```for i = 1:max(size(x))
y(i) = y(i)+10*norminv(rand);
end

figure(2)
plot(x,y)
title('Signal with Noise')
xlabel('Seconds')
ylabel('Observed Signal')```

Sine wave, with random normal noise, sampled at 100 Hertz

Now the noise completely washes out the signal, and it is not at all clear what the period or frequency of the underlying data are.  So let’s compute and plot the power spectrum of the data.  I leave it to the reader to review all of the math underlying computing the spectrum, but I’ll provide a very brief description of what we are doing here.  PSDE involves taking the discrete Fourier transform of the data.  This transforms our data from the temporal or spatial domain to the frequency domain.  The power spectral density (PSD) function is simply the magnitude of the squared Fourier transformed data (often scaled). It tells us how much of the variance in our signal is explained in each discrete frequency band.  A spike in the PSD function tells us that there’s a lot of variance explained in that frequency band.  The frequency resolution of our analysis (width of the frequency bands) is dictated by the length of our recorded observations.  Interestingly, if we integrate the PSD function over the frequency domain, we obtain the sample variance of our data.  Now, let’s look at the code.

```fs = (0.01)^-1;
N =size(y,2);

Y = fft(y);
f=fs*linspace(0,1,N);
G = abs(Y.^2)*(1/(fs*N));

figure(3)
loglog(f(2:floor(N/2)),G(2:floor(N/2)))
title('Amplitude of Spectrum of y(t)')
xlabel('Frequency (Hz)')
ylabel('S(f)')```

The first two lines are defining the sampling frequency (fs), the number of observations (N).  Next we use Matlab’s fft command to take the discrete Fourier transformation of the data.  We then compute the frequency bins we are considering, then we compute and plot the PSD function.  Let’s take a look at the result

There is a lot of noise, but there is a distinct peak at 0.1592, which is 1/(2*pi), indicates the frequency of the underlying signal.  This is neat, even though the signal contains a huge amount of noise, we’ve distinguished the correct frequency (and inversely the period) of the signal!

So how might we as systems analysts use this?

In my PhD thesis we used PSDE to determine the critical time-dynamics of a hydropower system’s operation.  This turned out to be a critical diagnostic tool.  In the hydropower system I considered, both inflows and energy prices are stochastic, so it can be difficult to determine if  the derived ‘optimal’ policy makes sense by only observing simulation results: the release in any one time step (here 6-hrs) can appear noisy and erratic in response to price or hydrologic shocks.  The power of PSDE is the ability to cut through the noise and reveal the underlying signal.

As an example, I’ve derived an implicit ‘optimal’ policy using sampling stochastic dynamic programming.  To evaluate the quality of the derived policy, I’ve simulated the system operation with this policy over the range of historical inflow series.  For simplicity, we’ll assume that the energy price profile (ranging from high on-peak to low off-peak prices) is identical on every day.  Now, let’s look at the PSD function for the reservoir release.

We can immediately pick out elements of the PSD function that make sense.  First, there is a lot of variability at very low frequencies, corresponding to periodicity on the order of months or seasons.  This is likely due to seasonal hydrologic variability between summer, spring, and winter inflows: releases are higher in the spring and early summer because more water is available, and generally lower in the winter and late summer when inflows to the reservoir are lower.  Similarly the peak at frequency 1/day makes sense, because we still have a daily price cycle that the system is responding to, put we are harder pressed to explain the other peaks between frequencies 0.1 and 1 (roughly corresponding to weekly and within week periodicities).  We have fixed the price profile, but hydrology shouldn’t be cycling during the week, so what is going on?

As it turns out, I had a bug in my algorithm (bad ending conditions for the SSDP) which was causing the ‘optimal’ policy to behave a bit myopically on a weekly basis.  This error was propagating through the model, but was not easy to see by simply observing the simulated results (see chapter 6 of the thesis).  Correcting the problem, I got the following spectrum.  The unexplainable spikes have gone away, and we now see that daily price cycling and seasonal hydrology explain the majority of the systems variability (as we expect).

For our Cornell-based readers our new CEE faculty member Gregory McLaskey is offering a new class on time series analysis (including PSDE) which is highly recommended.  PSDE is also used extensively (with real data) in Todd Cowen‘s experimental fluid mechanics class which also recommend because you learn a lot and get to play with lasers which, let’s be honest, is why we all became engineers.

# Porting GCAM onto a UNIX cluster

Hi All,

One of my recent research tasks has been to port GCAM (Global Change Assessment Model) on to the Cube, our groups HPC cluster, and some of the TACC resources.  This turned out to be more challenging than I had anticipated, so I wanted to share my experiences in the hopes that it will save you all some time in the future.  This post might be a bit pedestrian for some readers, but I hope it’s helpful to folks that are new to C++ and Linux.

Before getting started, some background on GCAM.  GCAM was developed by researches at the Joint Global Change Research Institute (JGCRI) at the Pacific Northwest National Lab (PNNL).  GCAM is an integrated assesment model (IAM), which means it pairs a climate model with a global economic model.  GCAM is unique from many IAMs in that it has a fairly sophisticated representation of various sectors of the global economy modeled on a regional level…meaning the model is a beast compared to other IAMs (like DICE). I’ll be writing a later post on IAMs more generally.  GCAM is coded in C++ and makes extensive use of xml databases for both input and output.  The code is open source (available here), has a Wiki (here), and a community listserve where researches can pose questions to their peers.

There are three flavors of the model available: one for Windows users, one for Mac users, and one for Unix users.  The Windows version comes compiled, with a nice user-interface for post-processing the results.  The Unix version comes as uncompiled code, and requires the installation of some third party C++ libraries.  For those who’d like to sniff around GCAM the Windows or Mac versions are a good starting point, but if you’re going to be doing HPC heavy lifting, you’ll need to work with the Unix code.

The GCAM Wiki and the detailed user manual provide excellent documentation about running GCAM the Model Interface tools, but are a bit limited when describing how to compile the Unix version of the code.  One helpful page on the Wiki can be found here.  GCAM comes with a version of the Boost C++ library in the standard Unix download (located in Main_User_Workspace/libs).  GCAM also uses the Berkeley DBXML library, which can be downloaded here.  You’ll want to download version 2.5.16 to be sure it runs well with GCAM.

Once you’ve downloaded DBXML, you’ll need to build the library.  As it turns out, this is fairly easy.  Once you’ve ported the DBXML library onto your Unix cluster, simply navigate into the main directory and run the script buildall.sh.  The buildall.sh script accepts flags that allow you to customize your build.  We’re going to use the -b flag to build the 64-bit version of DBXML:

```sh buildall.sh -b 64
```

once you’ve build the DBXML library, you are nearly ready to build GCAM, but must first export the paths to the Boost library and to the DBXML include and lib directories.  It seems that the full paths must be declared.  In other words, they should read something like:

```export BOOST_INCLUDE=/home/fs02/pmr82_0001/jrl276/GCAM_Haewon_trans/libs/boost_1_43_0/
export DBXML_INCLUDE=/home/fs02/pmr82_0001/jrl276/dbxml-2.5.16/install/include/
export DBXML_LIB=/home/fs02/pmr82_0001/jrl276/dbxml-2.5.16/install/lib/
```

One tricky point here is that the full path must be typed out instead of using ‘~’ as a short-cut (I’m not sure why). Once this is done, you are ready to compile gcam.  Navigate into the ‘Main_User_Workspace’ directory and type the command:

```
make gcam

```

Compiling GCAM takes several minutes, but at the end there should be a gcam.exe executable file located in the ‘/exe/’  folder of the ‘Main_User_Workspace’.  It should be noted that there is a multi-thread version of gcam (more info here), which I’ll be compiling in the next few days.  If there are any tricks to that process, I’ll post again with some tips.

That’s it for now.

# New Federal Flood Frequency Analysis Guidelines

This post is a bit off topic for this blog, but I think it should be interesting to our readers.  The current procedure used by Federal agencies (FEMA, Bureau of Reclamation, USGS, Army Corps, etc.) to assess flood risk at gauged sites is described in Bulletin 17B.  That procedure is used to estimate flood risk for things like FEMA flood maps, to set flood insurance rates, and to design riparian structures like levees.  Given how important the procedure is, many people are surprised to learn that it has not been updated since 1982, despite improvements in statistical techniques and computational resources, and the additional data that is now available.  This post wont speculate as to why change has taken so long, but I will briefly describe the old procedures and the proposed updates.

Bulletin 17B Procedure:

Dr. Veronica Webster has written an excellent brief history on the evolution of  national flood frequency standards in the US dating back to the 1960s.  For this post, we focus on the procedures adopted in 1982.  In short, Bulletin 17B recommends fitting the log-Pearson Type III distribution to the annual peak flow series at a gauged location, using the log-space method of moments.  In other words, one takes the logarithms of the flood series and then uses the mean, variance, and skew coefficient of the transformed series to fit a Pearson Type III distribution.  The Pearson Type III distribution is a shifted Gamma distribution.  When the population skew coefficient is positive the distribution is bounded on the low end, and when the population skew is negative the distribution is bounded on the high end.  When the population skew is zero, the Pearson Type III distribution becomes a normal distribution.  For those wanting more information, Veronica and Dr. Jery Stedinger have written a nice three-part series on the log-Pearson Type III, starting with a paper on the distribution characteristics.

Unfortunately data is not always well behaved and the true distribution of floods is certainly not log-Person Type III, so the Bulletin takes measures to make the procedure more robust.  One such measure is the use of a regional skew coefficient to supplement the sample skew coefficient computed from the transformed flood record.  The idea here is that computing the skew coefficient from short samples is difficult because the skew coefficient can be noisy, so one instead uses a weighted average of the sample skew and a regional skew.  The relative weights are proportional to the mean squared error of each skew, with the more accurate skew given more weight.  The Bulletin recommends obtaining a regional skew from either an average of nearby and hydrologically similar records, a model of skew based on basin characteristics (like drainage area), or the use of the National Skew Map (see below), based on a 1974 study.

National Skew Map provided in Bulletin 17B [IAWCD, 1982, Plate I]

A second concern is that small observations in a flood record might exert unjustifiable influence in the analysis and distort the estimates of the likelihood of the large floods of interest.  Often such small observations are caused by different hydrologic processes than those that cause the largest floods.  In my own experience, I’ve encountered basins wherein the maximum annual discharge is zero, or nearly zero in some years.  The Bulletin recommends censoring such observations, meaning that one removes them from the computation of the sample moments, then applies a probability adjustment to account for their presence.  The Bulletin provides an objective outlier detection test to identify potential nuisance observations, but ultimately leaves censoring decisions to the subjective judgement of the analyst.  The proposed objective test is the Grubbs-Beck test, which is a 10% significance test of whether the smallest observation in a normal sample is unusually small.

The Hydrologic Frequency Analysis Work Group (HFAWG) is charged with updating the Bulletin.  Their recommendations can be found on-line, as well as a testing report which compares the new methods to the old Bulletin.  Bulletin 17C is currently being written.  Like Bulletin 17B, the new recommendations also fit the log-Pearson Type III distribution to the annual maximum series from a gaged site, but a new fitting technique is used: the expected moments algorithm (EMA).  This method is related to the method of moments estimators previously used, but allows for the incorporation of historical/paleo flood information, censored observations, and regional skew information in a unified, systematic methodology.

High water mark of 1530 flood of the Tiber River at Rome

Historical information might include non-systematic records of large floods: “The town hall has been there since 1760 and has only been flooded once,”  thus providing a threshold flow which has only been crossed once in 256 years.  EMA can include that sort of valuable community experience about flooding in a statistically rigorous way! For the case that no historical information is available, no observations are censored, and no regional skew is used, the EMA moment estimators are exactly the same as the Bulletin 17B method of moments estimators.  See Dr. Tim Cohn’s website for EMA software.

The EMA methodology also has correct quantile confidence intervals (confidence intervals for the 100-year flood, etc.), which are more accurate than the ones used in Bulletin 17B.

Another update to the Bulletin involves the identification of outlier observations, which are now termed Potentially Influential Low Flows (PILFs).  A concern with the old detection test was that it rarely identified multiple outliers, even when several very small observations are present.  In fact, the Grubbs-Beck test, as used in the Bulletin, is only designed to test the smallest observation and not the second, third, or kth smallest observations.  Instead, the Bulletin adopts a generalization of the Grubbs-Beck test designed to test for multiple PILFs (see this proceedings, full paper to come).  The new test identifies PILFs more often and will identify more PILFs than the previous test, but we’ve found that use of a reasonable outlier test actually results in improved quantile efficiency when fitting log-Pearson Type III data with EMA  (again, full paper in review).

The final major revision is the retirement of the skew map (see above), and the adoption of language recommending more modern techniques, like Bayesian Generalized Least Squares (BGLS).  In fact, Dr. Andrea Veilleux, along with her colleagues at the USGS have been busy using iterations of BGLS to generate models of regional skew across the country, including California, Iowa, the Southeast, the Northwest, MissouriVermont, Alaska, Arizona, and California rainfall floods.  My masters work was in this area, and I’d be happy to write further on Bayesian regression techniques for hydrologic data, if folks are interested!

If folks are interested in software that implements the new techniques, the USGS has put out a package with good documentation.

That’s it for now!

Jon