Synthetic Weather Generation: Part V

Conditioning Synthetic Weather Generation on Seasonal Climate Forecasts

This is the final blog post in a five part series on synthetic weather generators. See Parts I and II for a description of single-site parametric and non-parametric generators, respectively, and Part III for a description of multi-site generators of both types.

In my previous post, Part IV, I discussed how parametric and non-parametric weather generators can be modified to produce weather that is consistent with climate change projections for use in long-term planning. In the shorter-term, water managers may be able to exploit mid-range climate forecasts to inform seasonal reservoir operations (see e.g., Kim and Palmer (1997), Chiew et al. (2003), Block (2011), Block and Goddard (2012), Anghileri et al. (2016)). For such analyses, it could be useful to tailor management plans to simulated weather conditions consistent with these probabilistic forecasts. Here I discuss how one can condition weather generators on seasonal climate forecasts for such purposes.

Two major forecasting groups, the International Research Institute (IRI) for Climate and Society at Columbia University and the Climate Prediction Center (CPC) of the U.S. National Centers for Environmental Prediction, issue tercile seasonal forecasts that specify the probabilities of observing above normal (pA), near normal (pN) and below normal (pB) precipitation and temperature. Forecasts are issued each month for the upcoming three months (see those from IRI and CPC). While these forecasts are derived from dynamical and statistical models that include a variety of physically-based processes, most of the forecast skill can be explained by the effects of the El Niño-Southern Oscillation (ENSO) on the climate system (Barnston et al., 1994). As most of you probably know, ENSO refers to the quasi-periodic cycling of sea surface temperatures (SSTs) in the tropical eastern Pacific Ocean. The warm phase is known as El Niño and the cool phase as La Niña. These are quantified as five consecutive three-month periods with SST anomalies in the Niño 3.4 region of the Pacific > +0.5°C (El Niño) or < -0.5°C (La Niña). All other periods are considered neutral (Climate Prediction Center, 2016). Because much of seasonal climate forecast skill is derived from ENSO, current or forecasted SST anomalies in the Niño 3.4 region are sometimes used by themselves as proxy seasonal forecasts. Here I will discuss techniques for conditioning weather generator parameters on either tercile forecasts or current or projected ENSO conditions.

Parametric Weather Generators

Wilks (2002) presents a method for conditioning parametric weather generators of the Richardson (1981) type on tercile seasonal climate forecasts using a network of sites in New York State. The key idea, derived from Briggs and Wilks (1996), is to estimate the weather generator parameters not from the entire historical record, but from a weighted bootstrapped sample of the historical record consistent with the forecast. This is similar to the method of streamflow generation use by Herman et al. (2016) to increase the frequency of droughts of a given magnitude. As an illustrative example, Herman et al. (2016) empirically estimate the quantile of a noteworthy drought from the historical record and show the system impacts of droughts of that magnitude or worse becoming n times more likely. This is done by adapting the semi-parametric streamflow generator developed by Kirsch et al. (2013) to sample years with historical droughts of at least that magnitude n times more often.

While Herman et al. (2016) take a fully non-parametric approach by estimating quantiles empirically, Briggs and Wilks (1996) estimate terciles parametrically by fitting a normal distribution to the historical mean seasonal temperatures and a Gamma distribution to the historical seasonal precipitation totals at each site. After estimating terciles, Briggs and Wilks (1996) classify each year in the historical record as below normal, near normal or above normal in terms of temperature and precipitation in the season of interest. Because terciles are estimated parametrically, each category will not necessarily contain an equal number of years, even if the record length is a multiple of three. It should be noted, however, that like Herman et al. (2016), IRI defines terciles empirically from the most recent 30-yr record (see more here and here) and CPC from the most recent 30-yr record updated only every 10 years, i.e., the current reference frame is 1981-2010 (see more here).  IRI provides a discussion of the advantages and disadvantages to parametric and non-parametric tercile estimation methods. For consistency, it may be best to take the same approach as the agency whose forecasts are being used.

Once terciles have been defined, weather generator parameters can be estimated from a bootstrapped sample of the historical record consistent with the forecast (Wilks, 2002). Consider a historical record with NB years classified as being below normal, NN as near normal, and NA as above normal. Then the expected value of a given seasonal statistic, X, can be estimated from a bootstrapped sample of L years from the historical record in which the below normal, near normal and above normal years are sampled with probabilities pB, pN, and pA, respectively. Representing the statistic of interest in the ith below, near and above normal year as xi(B), xi(N) and xi(A), respectively, then the expected value of X is the following:

(1) E\left[X\right] = \lim_{L\to\infty} \frac{1}{L}\left[\sum_{i=1}^{N_B} \frac{p_BL}{N_B}x_{i}^{\left(B\right)} + \sum_{i=1}^{N_N} \frac{p_NL}{N_N}x_{i}^{\left(N\right)} + \sum_{i=1}^{N_A} \frac{p_AL}{N_A}x_{i}^{\left(A\right)}\right]

That is, the forecast-conditional value of X is a weighted sum of its average in the below normal, near normal and above normal years, where the weights are equal to the forecast probabilities for each of the three respective categories (Wilks, 2002). Note that X is a seasonal statistic, not a parameter, so one cannot simply estimate a weather generator parameter as a weighted average of its value in each of the terciles unless the parameter is itself a simple statistic.

Fortunately, for some of the weather generator parameters this is the case. Recall from Part I that the first step of the Richardson generator is to generate a sequence of daily rainfall occurrences from a first order Markov chain. This chain is defined by the probabilities of transitioning from a dry day to a wet day, p01, or from a wet day to another wet day, p11. As discussed in Part IV, these two parameters together define the unconditional probability of a wet day, π, and the first-order autocorrelation of the occurrences, d, where π = p01/(1 + p01p11) and d = p11p01 (Katz, 1983). The unconditional probability of a wet day is a simple statistic. Therefore, since pN = 1 – pApB, π can be estimated each month as a function of pB, pA and the average portion of wet days in below normal, near normal and above normal years for that month ((B),(N) and (A)):

(2) π = pB(N) + (1 – pBpA)(N) + pA(A).

More generically, π =b0 + bBpB + bApA, where b0 = (N), bB = (B) – x̅(N) and bA = (A) – x̅(N). Wilks, 2002 recommends that the parameter π be estimated separately at each site and for each month, but that the below normal, near normal and above normal years be defined based on the total precipitation at that site in the entire three-month season. This is because the forecast is for the entire season, but the portion of wet days varies on a shorter time scale.

Like the unconditional probability of a wet day, the persistence parameter d can also be estimated as a function of pB, pA and the value of d in below normal, near normal and above normal years. Wilks (2002) shows that d can be represented by a quadratic function of pA and p:

(3) d = b0 + bBpB + bBBpB2 + bApA + bAApA2 + bBApBpA.

but finds that variations in d across forecasts are small such that one can reasonably assume the climatological estimate of d for all sites and months, regardless of the forecast.

The remaining weather generator parameters related to precipitation are those defining the distribution of precipitation amounts. Because these parameters (α and β if fitting a Gamma distribution, and α, λ1 and λ2 if fitting a mixed exponential distribution) are estimated iteratively in an MLE approach, they cannot be estimated as a function of the forecast probabilities like the occurrence parameters can. Instead, Wilks (2002) suggests using the Briggs and Wilks (1996) approach of bootstrapping a large sample of years from the historical record consistent with the forecast and fitting separate probability distributions to the precipitation amounts in each month’s weighted sample. When performing this estimation using a mixed exponential distribution to model precipitation amounts, Wilks (2002) found the estimates of the mixing parameter, α, to be the least consistent across the investigated sites and chose to hold it constant across sites and forecasts. Thus, only λ1 and λ2 were re-estimated for each month and site as a function of the seasonal forecast.

One drawback to the sampling scheme employed by Briggs and Wilks (1996) is that all historical years within each tercile have an equal probability of being sampled: pB/NB, pN/NN and pA/NA for below, near and above normal years. In reality, years similar to those in the tail are less likely to occur than years similar to those near the median. An alternative, more precise sampling scheme called the pdf-ratio method suggested by Stedinger and Kim (2010) assigns each year i an un-normalized probability of selection, qi = (1/N)*f1(xi)/f0(xi) where N is the number of years in the historical record and f1 and f0 are pdfs of the statistic X under forecast and climatological conditions, respectively. The qi are then normalized such that they sum to 1. f1 and f0 can be analytical or empirical distributions.

After estimating parameters of the precipitation amounts distributions using either the approach of Briggs and Wilks (1996) or Stedinger and Kim (2010), one must estimate the forecast-conditional temperature parameters. Recall that in the Richardson generator, separate harmonics are fit to the eight time series of means and standard deviations of minimum and maximum temperature on wet and dry days. Historical residuals from these fits are determined by first subtracting the predicted mean and then dividing by the predicted standard deviation. Finally, the residuals are modeled by an order-one vector autoregression, or VAR(1) model. Because forecast-conditional weather generators are only applied three months at a time, Wilks (2002) suggests instead fitting quadratic functions to these eight time series within the season of interest.

Like the parameters describing precipitation occurrences, the parameters of the quadratic functions of time describing the mean temperature on wet and dry days can be estimated as a weighted average of fits in each of the three terciles. As shown in Wilks (2002), the four mean temperature functions (minimum and maximum on wet and dry days), µ(t), at each site are specified by the function:

(4) µ(t) = (β0 + βBpB + βApA) + (γ0 + γBpB + γApA)t + (δ0 + δBpB + δApA)t2

where

(5) µ̅t(N) = β0 + γ0t + δ0t2,

(6) [µ̅t(B) µ̅t(N)] = βB + γBt + δBt2,

(7) [µ̅t(A) µ̅t(N)] = βA + γAt + δAt2,

t is the day and µ̅t(B), µ̅t(N) and µ̅t(A) are the mean temperature statistics of concern on each day of below normal, near normal and above normal years in the historical record, respectively.

Finally, Wilks (2002) shows that the standard deviations of minimum and maximum temperature on wet and dry days can be estimated by an extension of Equation 1:

(8) s\left(t\right) = \left[\frac{p_B}{N_B}\sum_{i=1}^{N_B}\left[x_{i}^{\left(B\right)}\left(t\right) - \mu\left(t\right)\right]^2 + \frac{p_N}{N_N}\sum_{i=1}^{N_N}\left[x_{i}^{\left(N\right)}\left(t\right) - \mu\left(t\right)\right]^2 + \frac{p_A}{N_A}\sum_{i=1}^{N_A}\left[x_{i}^{\left(A\right)}\left(t\right) - \mu\left(t\right)\right]^2\right]^{1/2}

where µ(t) is defined as in Equation 4. Once again, the forecast-conditional standard deviations of the four temperature series, σ(t), can then be estimated by quadratic functions of time, conditional on the forecast probabilities pA and pB:

(9) σ(t) = (β0 + βBpB + βBBpB2 + βApA + βAApA2 + βBApBpA)+ (γ0 + γBpB + γBBpB2 + γApA + γAApA2 + γBApBpA)+ (δ0 + δBpB + δBBpB2 + δApA + δAApA2 + δBApBpA)t2.

For the VAR(1) model of temperature residuals, Wilks (2002) found that variations in the estimates of these parameters as a function of the forecast, like d, were minor for the investigated sites in New York. For this reason, the VAR(1) model was fit separately for each month and site based on the entire historical record and these estimates were unchanged with the forecast. Finally, Wilks (2002) found that the spatial correlation of temperature and precipitation also did not change significantly between climatic terciles, and so they too were assumed independent of the forecast. Correlations in temperature were included in the VAR(1) model as described in Part III. Correlations in precipitation occurrences ω, and amounts, ζ, between sites k and l were approximated for all site pairs each month as a function of the horizontal distance, c, between them (see Equations 10 and 11, respectively, for which parameters θ1, θ2 and θ3 were estimated).

(10) \omega = \left(1 + \frac{c_{k,l}}{\theta_1}\right)^{\theta_2}

(11) \zeta = \exp\left(-\theta_3c_{k,l}\right)

If one received a tercile ENSO forecast, the same approach could be used as in Wilks (2002), except the season of interest in each historical year would be classified as La Niña, Neutral or El Niño instead of below normal, near normal or above normal.

Non-parametric Weather Generators

The key idea of weighted sampling from Briggs and Wilks (1996) has also been applied in non-parametric weather generators to condition synthetic weather series on seasonal climate forecasts. For example, Apipattanavis et al. (2007) modify their semi-parametric k-nn generator to find and probabilistically select neighbors, not from the entire historical record, but from a bootstrapped sample of the historical record consistent with the forecast. Again, this can be applied using tercile forecasts of either the {Below Normal, Near Normal, Above Normal} type or {La Niña, Neutral, El Niño} type.

Clark et al. (2004a) develop a more innovative approach that combines ideas from the non-parametric Schaake Shuffle method used to spatially correlate short-term precipitation and temperature forecasts (Clark et al., 2004b) with a parametric approach to weighted resampling presented by Yates et al. (2003) for the k-nn generator of Rajagopalan and Lall (1999). The Schaake Shuffle, originally devised by Dr. J. Schaake of the National Weather Service Office of Hydrologic Development, is a method of reordering ensemble precipitation and temperature forecasts to better capture the spatial and cross correlation of these spatial fields (Clark et al., 2004b).

Traditionally, model output statistics (MOS) from the Numerical Weather Prediction (NWP) model such as temperature, humidity and wind speed at different pressure levels, are used as predictors in a regression model to forecast daily temperature and precipitation at a number of sites. To generate an ensemble of predictions for each forecasted day, normal random variables with mean 0 and variance σε2 are added back to the mean prediction, where σε2 is the variance of the regression residuals. However, these regressions are generally developed independently for each variable at each site and therefore do not reproduce the spatial or temporal correlation between the variables across sites and time (Clark et al., 2004b). To better capture these correlations, the Schaake Shuffle, illustrated in Figure 2 from Clark et al. (2004a) for a 10-member ensemble, re-orders the ensemble members each day in order to preserve the Spearman-rank correlation in the temperature and precipitation variables between sites.

schaake_shuffle_table

The Schaake Shuffle proceeds as follows. For a particular day, the original ensemble members for each variable at each station are ranked from lowest to highest, as shown in Table A of Figure 2 above. Next, a set of historical observations of the same size is generated by randomly selecting days from the historical record within a window of 7 days before and after the forecast date (Table B). Third, the historical observations are sorted from lowest to highest for each variable at each site, as shown in Table C. Finally, the original ensemble members in Table A are re-shuffled to form the final, spatially correlated ensembles in Table D in the following way:

  1. The rank of the data in the first historical observation (shown with dark circles in Tables B and C) is determined at each site
  2. At each site, the member of the original ensemble with the same rank as the first historical observation for that site becomes the first member of the final, correlated ensemble (see dark circles in Table A and location in Table D).
  3. Steps 1 and 2 are repeated for every historical observation/ensemble member.

As stated earlier, this process reproduces the Spearman rank correlation of the observations across sites (Clark et al., 2004b). In order to preserve the temporal correlation for each variable, instead of re-generating a random set of historical observations to use for shuffling the next day’s forecast, the observations from the day following that used for the previous time step is utilized. While the Schaake Shuffle does not guarantee reproduction of the spatial correlation in the observations, just in their rank, the results presented in Clark et al. (2004b) indicate that the method does reasonably well for both, and significantly improves upon the un-shuffled forecasts.

In the weather generator presented by Clark et al. (2004a), the same approach is used to simulate weather sequences except the ensembles in Table A are not generated by MOS regressions but by independently sampling historical observations within +/- 7 days of the simulated day at each site. To condition this weather generator on seasonal climate forecasts, the unshuffled ensembles are formed by preferential selection of different years from the historical record following an approach inspired by Yates et al. (2003). The first step in this approach is to sort all N historical years in terms of their similarity to a climate index, such as current SSTs in the Niño 3.4 region. The most similar year is given rank i = 1 and the least similar i = N. Next, a standard uniform random variable u is drawn and the year of rank i is chosen as an ensemble member, where i = INT(uλN/α) + 1. Here INT(·) is the integer operator, λ is a weighting parameter, and α a selection parameter. Values of λ greater (less) than 1 increase (decrease) the probability of selecting years ranked more similar to the climate index. Values of α greater than 1 restrict the number of sampled years such that α = 5, for example, results in only the most similar 1/5 of years being selected (Clark et al., 2004a).

Yates et al. (2003) apply a simplified version of this method with only one parameter, λ, in a scenario discovery-type approach, investigating the effects of e.g. warmer-drier springs and cooler-wetter summers. Clark et al. (2004a) first take this approach by ranking the historical years according to their similarity to the current Niño 3.4 index and exploring the effects of different choices of λ and α on the skill of the generated weather sequences in forecasting total winter precipitation at Petrified Forest in Arizona, measuring skill by the ranked probability skill score (RPSS). Interestingly, they find that high values of both λ and α, where years more similar to the climate index at the beginning of the season are selected, result in negative forecast skill. This highlights the importance of not being overconfident by only sampling years closest to current or forecast conditions. They note that the values of λ and α should depend on the strength of the Niño 3.4 index, and therefore should be re-optimized for different values of the index in order to maximize the RPSS.

All of these approaches could prove informative for seasonal water resources planning, if the forecasts being used are reliable. In the case of tercile forecasts, this means that, on average, when a given climate state is forecast to occur with probability p, it does in fact occur with that probability. Given that past diagnostic assessments of IRI and CPC forecasts have found biases and overconfidence in some locations (Wilks and Godfrey, 2002; Wilks, 2000), water managers should proceed with caution in using them for seasonal planning. At a minimum, one should perform an analysis of the forecast value for the system of concern (Anghileri et al., 2016) before changing system operations. Fortunately, these forecasts continue to improve over time and several studies have already found value in using them to inform seasonal operations (e.g. Kim and Palmer (1997)Block (2011), Block and Goddard (2012), Anghileri et al. (2016)), indicating promise in their use for water resources planning.

Works Cited

Anghileri, D. Voisin, N., Castelletti, A., Pianosi, A., Nijssen, B., & Lettenmaier, D. P. (2016). Value of long-term streamflow forecasts to reservoir operations for water supply in snow-dominated river catchments. Water Resources Research, 52(6), 4209-4225.

Apipattanavis, S., Podestá, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43(11).

Barnston, A. G., van den Dool, H. M., Rodenhuis, D. R., Ropelewski, C. R., Kousky, V. E., O’Lenic, E. A., et al. (1994). Long-lead seasonal forecasts-Where do we stand?. Bulletin of the American Meteorological Society75(11), 2097-2114.

Block, P. (2011). Tailoring seasonal climate forecasts for hydropower operations. Hydrology and Earth System Sciences, 15, 1355-1368.

Block, P., & Goddard, L. (2012). Statistical and dynamical climate predictins to guide water resources in Ethiopoia. Journal of Water Resources Planning and management, 138(3), 287-298.

Briggs, W. M., & Wilks, D. S. (1996). Extension of the Climate Prediction Center long-lead temperature and precipitation outlooks to general weather statistics. Journal of climate, 9(12), 3496-3504.

Chiew, F. H. S., Zhou, S. L., & McMahon, T. A. Use of seasonal streamflow forecasts in water resources management. Journal of Hydrology, 270(1), 135-144.

Clark, M. P., Gangopadhyay, S., Brandon, D., Werner, K., Hay, L., Rajagopalan, B., & Yates, D. (2004a). A resampling procedure for generating conditioned daily weather sequences. Water Resources Research, 40(4).

Clark, M., Gangopadhyay, S., Hay, L., Rajagopalan, B., & Wilby, R. (2004b). The Schaake shuffle: A method for reconstructing space-time variability in forecasted precipitation and temperature fields. Journal of Hydrometeorology, 5(1), 243-262.

Climate Prediction Center (2016). ENSO: Recent Evolution, Current Status and Predictions. National Oceanic and Atmospheric Administration, pp. 19-20.

Herman, J. D., Zeff, H. B., Lamontagne, J. R., Reed, P. M., & Characklis, G. W. (2016). Synthetic drought scenario generation to support bottom-up water supply vulnerability assessments. Journal of Water Resources Planning and Management, 04016050.

Katz, R. W. (1983). Statistical procedures for making inferences about precipitation changes simulated by an atmospheric general circulation model. Journal of the Atmospheric Sciences, 40(9), 2193-2201.

Kim, Y., & Palmer, R. (1997). Value of seasonal flow forecasts in Bayesian stochastic programming. Journal of Water Resources Planning and Management, 123(6), 327-335.

Kirsch, B. R. , Characklis, G. W., & Zeff, H. B. (2013). Evaluating the impact of alternative hydro-climate scenarios on transfer agreements: Practical improvement for generating synthetic streamflows. Journal of Water Resources Planning and Management, 139(4), 396-406.

Rajagopalan, B., & Lall, U. (1999). A k‐nearest‐neighbor simulator for daily precipitation and other weather variables. Water Resources Research35(10), 3089-3101.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Stedinger, J. R., & Kim, Y. O. (2010). Probabilities for ensemble forecasts reflecting climate information. Journal of hydrology, 391(1), 9-23.

Wilks, D. S. (2000). Diagnostic verification of the Climate Prediction Center long-lead outlooks, 1995-1998. Journal of Climate, 13, 2389-2403.

Wilks, D. S. (2002). Realizations of daily weather in forecast seasonal climate. Journal of Hydrometeorology, 3(2), 195-207.

Wilks, D. S. & Godfrey, C. M. (2002). Diagnostic verification of the IRI net assessment forecasts, 1997-2000. Journal of Climate, 15(11), 1369-1377.

Yates, D., Gangopadhyay, S., Rajagopalan, B., & Strzepek, K. A technique for generating regional climate scenarios using a nearest-neighbor algorithm. Water Resources Research, 39(7).

Getting started with C and C++

I’ve been learning C and C++ recently and I thought I’d share my experience learning these languages through a post. Prior to learning C and C++, I had experience in Python and Matlab, but this was my first foray into lower level languages. In my attempts to learn each language I made my way through several courses and books available online; some I found very helpful, others not so much. In this post I’ll detail my experiences with each resource and provide some tips that may help those planning on learning these languages.

Main takeaways

To learn the languages, I used four main resources. Online courses from Lynda.com, a book titled Learn C the Hard Way, a book commonly known as K&R 2 and tutorials from cplusplus.com. For those who do not have the time or desire to read this whole post, I found the following resources to be the most useful:

For C:

Learning C the Hard Way

K&R 2

For C++:

Foundations of Programming: Object – Oriented Design (you need a lynda.com login             to access)

Up and Running with C++ (again, you need a lynda.com login to access)

cplusplus.com

Everyone’s learning style is different, but I found that I learned the languages much faster by coding examples myself, rather than watching someone walk through a script. I also found that courses that taught straight from the command line were more effective than courses that taught through an IDE. When using an IDE, I often found myself spending more time working through glitches or nuances within the IDE than learning the languages themselves.

I’ll detail my experiences with each resource below. I’ll start with C resources, then discuss C++.

Resources for Learning C

C Essential Training – From Lynda.com

I started my training by taking course on Lynda.com titled “C Essential Training”. Lynda.com is an online educational website with thousands of videos, many of which focus on programming. The service is free to Cornell students and graduate students (though I checked and unfortunately neither PSU, UC Davis nor CU Boulder have agreements with the site). I found the course to be well structured and I felt that the instructor presented the material clearly and understandably. Despite this, I do not feel that the course did an effective job teaching me C. The main problem I had with the course was its reliance on the Eclipse IDE.  Eclipse seems like a fine IDE, but I don’t plan to use an IDE when writing code and I didn’t want to spend the time taking a separate course to learn its intricacies (though Lynda.com does have a full course devoted to Eclipse). Throughout the course, I kept finding myself having small Eclipse problems (e.g. not being able to change the project I was working on or having compiler errors) that were not hard to solve, but were never discussed in the lectures. I was able to solve each problem by doing some research online, but each little problem took me time to resolve and was mentally taxing. After spending 30 minutes looking up an Eclipse fix, I was not in the mood to go troubleshooting interesting C questions . Another problem with using Eclipse is that the user is never forced to write their own makefiles, an omission that seems like it could really hurt someone who plans to run C programs through the command line. In summary, I would not recommend taking this course unless you are either thoroughly versed in Eclipse or plan to write all of your code through Eclipse.

Learning C the Hard Way

The next resource I used to learn C was a book that Jazmin pointed me to called Learning C the Hard Way by Zed A. Shaw (after some poking around I found this had been mentioned previously on this blog). The book is laid out as a tutorial, where each chapter teaches a new C concept (it’s really a C course in book form).The author takes a slightly nontraditional teaching approach in that he makes you write the code first, then explains in detail what you just wrote. I found this very hands on teaching method extremely helpful. When I wrote the code myself, I was forced to focus on every detail of the code (something that is very important in a language like C). I also was able to learn which concepts were genuinely challenging for me and which concepts I needed more work on. When I watched the Lynda.com lectures, I’d often believe I understood a concept, only to find out later that I had misinterpreted the instructors lesson.

The book does not use an IDE, but rather writes code in a text editor (I used Sublime Text) and runs them on the Unix command line.  The author provides a succinct introduction to makefiles and how to use them, which was refreshing after the Eclipse based course that never mention makefiles or compilers.

Overall I found the teaching method employed by the book to be very effective, and I would highly recommend it to new C users. I should note however, that there seems to be some controversy surrounding the book. If you google “Learning C the hard way” you’ll find some very heated exchanges between the author and a blogger who criticized the book’s teaching methodology. The blogger had two main criticisms of the book; first that it over simplified and inaccurately presented key C concepts, and second, that the author failed to teach accepted programming standards for the C language. Mr. Shaw’s rebuttal was that the book’s purpose was to teach people get people comfortable with C and begin actually coding with it, then  once they are literate, have them go back and learn more about the language’s nuances. I personally agree with Mr. Shaw on this point, though I don’t have a background in computer science so my opinion is only that of an beginner. Many of the criticisms of the book seemed to come from the perspective of an advanced coder who is unable to see the language through the eyes of a beginner. Mr. Shaw’s explanations might be over simplified, but they do a good job demystifying many of the most foreign  aspects of C. I think that use of this book should be supplemented with other sources, especially documents on accepted C coding standards, but if you’re looking for a quick way to get on your feet with C and gain some confidence, then the book is a great resource.

I used a free beta version of the book which can be found here: http://c.learncodethehardway.org/book/ but you can also purchase the book from the author here: https://www.amazon.com/Learn-Hard-Way-Practical-Computational/dp/0321884922

I found the beta version to be just fine, but there were some minor errors and some sections were clearly under construction.

The blog post criticizing the book can be found here: http://hentenaar.com/dont-learn-c-the-wrong-way

K&R 2

A resource that I discovered through reading the exchanges between the Shaw and his critics was “The C Programming Language” by Brian W. Kernighan and Dennis M. Ritchie (commonly referred to as K&R 2 which is what I’ll call it for the rest of the post). One of the Authors of this book, Dennis Ritchie, actually coauthored the C language and this book is talked of as the go to authority of all matters C. Mr. Shaw devoted a whole chapter of “Learning C the Hard way” to bashing this book, but I found its layout and explanations quite accessible and useful. I did not find the tutorials as direct as “Learning C the Hard Way”, but I found it to be a helpful supplement.

 

Resources for Learning C++

Foundations of Programming: Object-Oriented Design – From Lynda.com

A main difference between C and C++ is that C++ is an object oriented language. I had some very basic experience in using object oriented programming, but was looking for a refresher before learning C++. “Foundations of Programming: Object-Oriented Design” was an excellent course that taught me all I wanted to know about object-oriented programming and more. The course is purely conceptual and does not teach any actual code or contain any practice problems. It presents the concepts in a simple yet comprehensive manner that I found very helpful. I would highly recommend this course to anyone hoping to learn or brush up their knowledge of how object-oriented programming works.

Up and Running with C++ – From Lynda.com

This course was very similar in layout to the C course from Lynda.com, and I have the same criticisms. The entire course used Eclipse, and I kept having minor problems that were never addressed by the lectures but prevented me from executing my code. I did feel like I was able to learn the basic tools I needed from the lectures, but I would have gotten much more out of the class if it had been taught through the command line. I also felt that the course was sparse on exercises and heavy on lectures. I found that I got much less out of watching the instructor write code than being forced to write out the code myself (as Learning C the Hard Way forces you to do).

cplusplus.com

This resource is linked often in older posts on this blog, and I found it helpful in answering C++ questions I had after finishing the Lynda.com courses. I did not find that tutorial had the most helpful narration of how one may learn C++ from scratch, but it has very succinct definitions of many C++ components and was helpful as a reference. I think this site is the one I will look to most when troubleshooting future C++ code.

Final thoughts

I’d like to note that I found WRASEMAN’s post on makefiles a few weeks back  to be quite helpful. From my limited experience, ensuring that your code compiles correctly can be one of the most challenging parts of using a lower level language and the post has some excellent resources that explain makefiles are and how they can be used.

I know there are a lot of contributors and readers of this blog who are much more versed in C and C++ than I am, so if you’d like to chime in on useful resources, please do so in the comments.

 

Importing, Exporting and Organizing Time Series Data in Python – Part 2

Importing, Exporting and Organizing Time Series Data in Python – Part 2

This blog post is Part 2 of a multi-part series of posts intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle .csv, .xls and .xlsx files); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

Part 1 of the series focused on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets.

This blog post presents a flexible python function I developed to import time series outputs from a simulation model into a dictionary of 3D pandas DataFrames (DFs). This function is part of a data importing and exporting module included in the PySedSim simulation model I have been developing in the last year, but you can use this function for time series outputs from any simulation model, provided the data are stored in .csv files and organized in the format shown in Fig. 2.

The value of this function is it creates a single dictionary of 3D pandas DFs storing time series outputs from multiple simulation “scenarios”; for multiple system locations (e.g., Reservoir 1, Reservoir 2…); multiple state variables (reservoir water storage, suspended sediment load, etc.); and multiple realizations, or stochastic simulation ensemble members (e.g., Realization 1, Realization 2, …, Realization 1000). I have found the 3D DFs to have more reliable functionality than 4D DFs, so I have elected not to use a fourth pandas panel dimension in this function.

In a future post, I will then present some additional modules I have developed that use Pandas functionality to evaluate “performance” of different system locations (e.g., reservoirs) with respect to a diverse set of temporal resampling strategies and distribution statistics.

Below is the Python function, which you are welcome to use and/or modify for your own purposes. Please note that I had trouble reproducing Python formatting in the blog post (and some html characters get incorrectly inserted), and apparently I cannot post a link to a .py file, so it will eventually be on Github in late 2016.

 # Import method-specific libraries
from copy import deepcopy
from __future__ import division
import pandas as pd
import os
import platform

def Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir, proc_num = ''):
    '''
    Purpose: Imports time series outputs from a simulation run into a 3-dimensional pandas dataframe (DF).

    More detail: Module intended to import .csv files that have been produced as the result of a PySedSim simulation.
    Each .csv file should contain time series outputs for a single state variable (e.g., reservoir water storage) and
    system location (e.g., reservoir 1).

    The purpose of using this file may either be to (1) import the output into a dictionary of pandas structures so
    that simulation performance measures can be evaluated and plotted, or (2) to import all the data produced by
    separate processors into a single data structure that can then be exported into a .csv file that contains aggregated
    output for each system location/variable.

    DF details (a 3D DF exists for each system location (e.g., reservoir)):
    Axis 0: State variable (e.g., Water storage, Energy production) for system location
    Axis 1: Time (e.g., dates over simulation horizon)
    Axis 2: Realization Number (e.g., stochastic simulation ensemble members)

    :param Sims_to_Import: List, containing strings of simulation scenario names (these must be directories in the
    specified output file directory that have these names). Example: [&amp;amp;amp;quot;Alternative Scenario 7A&amp;amp;amp;quot;]

    :param Locations_to_Import: Dictionary, keys store strings representing simulation element names (e.g.,
    Reservoir 1). Keys must be in the Sims_to_Import list. Example: {&amp;amp;amp;quot;Alternative Scenario 7A&amp;amp;amp;quot;: [&amp;amp;amp;quot;Reservoir 1&amp;amp;amp;quot;]}

    :param var_sub_list: List, containing strings of PySedSim state variable names for which .csv output files exist
    for the scenarios in the Sims_to_Import list. Example: ['water_surface_elevation', 'capacity_active_reservoir']

    :param file_dir: String, directory in which output files to be imported are located.
    Example: r'E:\PySedSim\ModelFiles\Output_Storage'

    :param proc_num: Optional. Integer, number appended to the output .csv file representing the processor that
    produced the file (e.g., the number 3 for the file 'water_surface_elevation_3.csv')

    :return TSID: Dictionary, where keys are scenario names. Key stores sub_dictionary, where sub_dictionary keys are
    system locations storing 3D pandas DF for each system location.
    :return Num_Realizations: Dictionary, where keys are scenario names, storing number of stochastic realiztions for
    scenario.
    :return Num_Years: Dictionary, where keys are scenario names, storing number of years in a simulation realization
    for scenario
    '''

    # Get operator (/ or \) for changing directory based on operating system.
    os_fold_op = Op_Sys_Folder_Operator()

    # This function reads back in previously exported simulation data so performance measure analysis can be conducted.
    if proc_num is not '':
        cluster_loop = '_' + str(proc_num-1) # Subtract 1 as first file ends with &amp;amp;amp;quot;0&amp;amp;amp;quot;.
        cluster_sub_folder = 'cluster_output'
    else:
        cluster_loop = ''
        cluster_sub_folder = ''

    # Initialize various data structures
    TSID = {} # Main dictionary to export
    TSID_Temp = {} # Use to temporarily load each processor's output sheet for location/variable, if applicable
    Num_Realizations = {} # For each scenario, stores number of realizations for that scenario
    Num_Years = {} # For each scenario, stores number of years in each realization for that scenario
    counter = {} # Temporary counter

    # Main data import loop. Intent is to import data into Time Series Import Dictionary (TSID)
    for sims in Sims_to_Import:
        counter[sims] = 0
        TSID[sims] = {} # Sub dict for each simulation will store locations.
        TSID_Temp[sims] = {} # Sub dict for element/variable output for a given processor in cluster.
        sim_import_loc = file_dir + os_fold_op + sims # This folder needs to already exist.
        for sys_locs in Locations_to_Import[sims]:
            TSID[sims][sys_locs] = {} # Sub dictionary for each location will store variables.
            TSID_Temp[sims][sys_locs] = {} # Sub dict for location will store a variable for each processor.
            loc_sub_folder = os_fold_op + cluster_sub_folder + os_fold_op + sys_locs
            # Requires that all the locs you are searching have all the variables you list above, which wont be the
            # case always (for junctions vs. reservoirs, for example).
            for vars in var_sub_list:
                file_path = sim_import_loc + loc_sub_folder # File path reflecting new folder
                if os.path.exists(os.path.join(file_path, vars + cluster_loop + '.csv')) == True:
                    # This variable exists as a file name in the specified file path, so import it.
                    if proc_num == '':
                        # User is not importing output files produced on a cluster by various processors. Proceed
                        # linearly (there are not different files from different processors that need to be combined).
                        # Import this dataframe to a csv file.
                        TSID[sims][sys_locs][vars] = pd.read_csv(os.path.join(file_path, vars + cluster_loop + '.csv'),
                        index_col=0)

                        # Force each dataframe to have datetime objects as dates rather than strings.
                        TSID[sims][sys_locs][vars].set_index(pd.to_datetime(TSID[sims][sys_locs][vars].index),
                        inplace=True)
                        # Determine number of realizations (ensembles). Only do this calculation once per simulation
                        # realization (on first pass through loop).
                        if counter[sims] == 0:
                            Num_Realizations[sims] = len(TSID[sims][sys_locs][vars].columns)
                            Num_Years[sims] = TSID[sims][sys_locs][vars].index[-1].year - \
                            TSID[sims][sys_locs][vars].index[0].year + 1
                            counter[sims] += 1
                    else:
                        # User wishes to use this processor to create a dictionary for the particular
                        # location/variable of interest. This processor will therefore read in all output .csv files
                        # produced by other processors.
                        for csv_sheet in range(proc_num):
                            # Import this dataframe to a csv file
                            TSID_Temp[sims][sys_locs][vars] = pd.read_csv(
                            os.path.join(file_path, vars + '_' + str(csv_sheet) + '.csv'), index_col=0)
                            # Make each dataframe have datetime objects as dates rather than strings.
                            TSID_Temp[sims][sys_locs][vars].set_index(
                            pd.to_datetime(TSID_Temp[sims][sys_locs][vars].index), inplace=True)
                            # Loop through locations and variables, store data from this processor in master dictionary.
                            if csv_sheet == 0:
                                TSID = deepcopy(TSID_Temp)
                            else:
                                for locs in TSID[sims]:
                                    for vars in TSID[sims][locs]:
                                        # Add this new set of realizations from this DF into the main DF
                                        TSID[sims][locs][vars] = pd.concat(
                                        [TSID[sims][locs][vars], TSID_Temp[sims][locs][vars]], axis=1)
    print(&amp;amp;amp;quot;Data Import is completed.&amp;amp;amp;quot;)
    # Return is conditional. Number of realizations/years cannot be provided if the TSID only represents one of many
    # ensemble members of a stochastic simulation:
    if proc_num is not '':
        return TSID
    else:
        return TSID, Num_Realizations, Num_Years

The following is an example of how to make use of this function. Suppose you have 2 directories corresponding two two different simulation runs (or “scenarios”). Let’s call those Scenario 1 and Scenario 2.

Within each of those scenario directories, you have separate directories for each system location. In this example, the locations are “Reservoir 1” and “River Channel 1”, so there are 2 sub-directories.

Within each of those location directories, you have .csv files for the following state variables: water_storage, susp_sediment_inflow, inflow_rate. Each .csv file stores time series output for a particular scenario, location, and state variable across all realizations (ensemble members).

Figure 1 (below) shows an example of how these files might be stored for “River Channel 1”, within a directory titled “Simulation Output”.

Fig 2 - file location

Figure 2 (below) shows an example of how these files might be stored for “Reservoir 1”, within a directory titled “Simulation Output”.

Fig 2A - file location

Figure 2 shows an example of how one .csv file. (The PySedSim model automatically creates the directory structure and file formatting shown here).

Fig 1 - csv file layout

The following is an example function call:


import Import_Simulation_Output

Sims_to_Import = ['Scenario 1', 'Scenario 2']

Locations_to_Import = {'Scenario 1': ['Reservoir 1', 'River Channel 1'], 'Scenario 2': ['Reservoir 1', 'River Channel']}

var_sub_list = ['inflow_rate', 'susp_sediment_inflow', 'water_storage']

file_dir = r'C:\Users\tbw32\Documents\Reed Group\Blog - Water Programming\2016\July 2016\Simulation Output'

Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir)

In the next post I will demonstrate what to do with the dictionary that has been created.