SALib v0.7.1: Group Sampling & Nonuniform Distributions

This post discusses the changes to the Python library SALib in version 0.7.1, with some examples of how to use the new capabilities. The two major additions in this version were: group sampling for Sobol’ sensitivity analysis and specifying nonuniform distributions for variables.

Sobol’ Indices Group Sampling

Previous versions of SALib allowed one to calculate the first-order, total-order, and second-order indices for individual input parameters. These same indices can be defined for groups of input parameters (see Saltelli (2002) for more discussion). The main change is adding an item called ‘groups’  to the problem dictionary, which specifies the group of each parameter. Here is some example code. Notice in the ‘groups’  entry in the problem definition.


from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.util import read_param_file
import numpy as np

# example function
def sampleModel(x):
    y = x[:,0]**1.5 + x[:,1] + 2*x[:,2] + x[:,3] + np.exp(0.3*x[:,4]) + x[:,5] \
         + 2*x[:,1]*x[:,4] + (x[:,0]*x[:,5])**2
    return y

# problem definition
prob_gps_code = {
'num_vars':6,
'names': ['P1','P2','P3','P4','P5','P6'],
'bounds':[[0.0, 1.0], [2.0, 3.0], [0.5, 1.0], [0.0, 5.0], [-0.5, 0.5], [0.0, 1.0]],
'groups':['group1','group2','group1','group1','group2','group3']
}
# generating parameter values
param_vals_gps_code = saltelli.sample(prob_gps_code, 10000,calc_second_order=True)

# calculating model output values
Y_gps_code = sampleModel(param_vals_gps_code)

# completing Sobol' sensitivity analysis
Si_gps_code = sobol.analyze(prob_gps_code,Y_gps_code,calc_second_order=True,print_to_console=True)

The output from this code is given below. In this case the first-order indices (S1’s) are the index that is closed for the group. The S1 for group1 ( P1, P3, and P4) would be equivalent to summing:  S1 for P1, P3, and P4; S2 for (P1 & P3), (P1 & P4), and (P3 & P4); and S3 for (P1 & P3 & P4). All of the equations used in calculating the sensitivity indices are the same, but now they are for groups of variables.

Group S1 S1_conf ST ST_conf
group1 0.471121 0.081845 0.472901 0.011769
group2 0.498600 0.078950 0.497005 0.013081
group3 0.030502 0.019188 0.031736 0.001041

Group_1 Group_2 S2 S2_conf
group1 group2 0.000618 0.159951
group1 group3 0.002170 0.161403
group2 group3 -0.003324 0.155224

Note: You can also use the read_param_file() function to define the problem. For the above example the problem file would look like:

P1,0.0,1.0,group1
P2,2.0,3.0,group2
P3,0.5,1.0,group1
P4,0.0,5.0,group1
P5,-.5,.5,group2
P6,0.0,1.0,group3

Nonuniform Distributions

Often the variables in a sensitivity analysis are assumed to be distributed uniformly over some interval. In the updated version of the SALib it is possible to specify whether the each input parameter is triangular, normal, lognormal, or uniform. Each of these distributions interprets the ‘bounds’ in the problem dictionary separately, as listed below.

  • Triangular, “triang” (assumed lower bound of 0)
    • first “bound” is width of distribution (scale, must be greater than 0)
    • second “bound” is location of peak as a fraction of the scale (must be on [0,1])
  • Normal, “norm”
    • first “bound” is the mean (location)
    • second “bound” is the standard deviation (scale, must be greater than 0)
  • Lognormal, “lognorm” (natural logarithms, assumed lower bound of 0)
    • first “bound” is the ln-space mean
    • second “bound” is the ln-space standard deviation (must be greater than 0)
  • Uniform, “unif”
    • first “bound” is the lower bound
    • second “bound” is the upper bound (must be greater than lower bound)

Triangular and lognormal distributions with a non-zero lower bound can be obtained by adding the lower bound to the generated parameters before sending the input data to be evaluated by the model.

Building on the same example as above, the problem dictionary and related analysis would be completed as follows.

# problem definition
prob_dists_code = {
'num_vars':6,
'names': ['P1','P2','P3','P4','P5','P6'],
'bounds':[[0.0,1.0], [1.0, 0.75], [0.0, 0.2], [0.0, 0.2], [-1.0,1.0], [1.0, 0.25]],
'dists':['unif','triang','norm','lognorm','unif','triang']
}

# generating parameter values
param_vals_dists_code = saltelli.sample(prob_dists_code, 10000,calc_second_order=True)

# calculating model output
Y_dists_code = sampleModel(param_vals_dists_code)

# complete Sobol' sensitivity analysis
Si_dists_code = sobol.analyze(prob_dists_code,Y_dists_code,calc_second_order=True,print_to_console=True)</pre>

The output from this analysis is given below, which is consistent with the format in previous versions of SALib.

Parameter S1 S1_conf ST ST_conf
P1 0.106313 0.030983 0.110114 0.003531
P2 0.037785 0.027335 0.085197 0.003743
P3 0.128797 0.029834 0.128702 0.003905
P4 0.034284 0.016997 0.034193 0.001141
P5 0.579715 0.071509 0.627896 0.017935
P6 0.062791 0.021743 0.065357 0.002221
Parameter_1 Parameter_2 S2 S2_conf
P1 P2 0.001783 0.060174
P1 P3 0.001892 0.060389
P1 P4 0.001753 0.060155
P1 P5 0.001740 0.062130
P1 P6 0.004774 0.060436
P2 P3 -0.003539 0.051611
P2 P4 -0.003500 0.051186
P2 P5 0.044591 0.054837
P2 P6 -0.003585 0.051388
P3 P4 -0.000562 0.058972
P3 P5 -0.000533 0.059584
P3 P6 -0.000480 0.059923
P4 P5 -0.000364 0.034382
P4 P6 -0.000191 0.034301
P5 P6 -0.001293 0.137576

Note 1: You can also use the read_param_file() function to define the problem. The one catch is when you want to use nonuniform distributions without grouping the variables. In this case the fourth column in the input file (column for ‘groups’) must be the parameter name repeated from the first column. For the above example the problem file would look like:

P1,0.0,1.0,P1,unif
P2,1.0,.75,P2,triang
P3,0.0,.2,P3,norm
P4,0.0,.2,P4,lognorm
P5,-1.0,1.0,P5,unif
P6,1.0,.25,P6,triang

Note 2: If you are uncertain that the distribution transformation yielded the desired results, especially since the ‘bounds’ are interpreted differently by each distribution, you can check by plotting histograms of the data. The histograms of the data used in the example are shown below. (The data was actually saved to a .txt file for reference and then imported to R to plot these histograms, but matplotlib has a function histogram().)

allHist

References

Saltelli, Andrea (2002). Making best use of model evaluations to compute sensitivity indices. Computer Physics Communications 145(2):280-297. doi:10.1016/S0010-4655(02)00280-1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.

Synthetic Weather Generation: Part III

Multi-Site Weather Generators

This is the third part of a blog series on synthetic weather generators. In Parts I and II, I described common parametric and non-parametric methods of weather generation, respectively. Both of these blogs only discussed how to use these methods to generate daily weather simulations at one site. However, we are often interested in generating weather at multiple sites simultaneously, for instance, if we are using the weather as input to a distributed watershed model. If there is a high degree of spatial heterogeneity within the system being modeled, applying average weather conditions everywhere will not provide representative simulations of the hydrology, nor will independently generating weather at multiple sites within the system. Fortunately, several techniques have been developed to correlate synthetic weather sequences generated at multiple nearby sites that are consistent with observed data.

This blog will focus on two methods of generating correlated weather data: 1) driving multiple single-site parametric weather generators with spatially correlated random numbers (Wilks, 1998; Wilks 1999; Wilks, 2008; Wilks, 2009; Srikanthan and Pegram, 2009; Baigorria and Jones, 2010) and 2) resampling concurrent weather data from multiple sites with a non-parametric weather generator (Buishand and Brandsma, 2001; Beersma and Buishand, 2003; Apipattanavis et al., 2007; Steinschneider and Brown, 2013). Other methods exist, such as jointly sampling weather at multiple sites from a copula (Bardossy and Pegram, 2009; Serinaldi, 2009, Aghakouchak et al., 2010a; Aghakouchak et al., 2010b), modeling rainfall occurrence and amounts processes with generalized linear models (GLMs) (Kigobe et al., 2011), and disaggregating spatially averaged synthetic rainfall with a nonhomogeneous random cascade process (Jothityangkoon et al., 2000), but will not be discussed here.

Method 1

The first method, driving multiple single-site weather generators with spatially correlated random numbers, is described by Wilks (1998) for use with parametric generators of the Richardson type described in Part I. In the single-site formulation of this model, the first step is to generate precipitation occurrences at each site. Normally, this is done by generating a uniform random number, u ~ U[0,1], and comparing it to the critical probably, pc, of transitioning from the current state (wet or dry) to a wet state. If upc, the state transitions to a wet state; otherwise it transitions to a dry state.

An alternative to generating uniform random numbers for the transition computation is to generate standard normal random numbers, w ~ N(0,1). In this case, state transitions from the current state to a wet state occur if w ≤ Φ-1(pc), where Φ-1 (·) is the inverse of the standard normal CDF. The benefit to using this alternative is that it is easy to generate correlated normally distributed random variables (Wilks, 2011, pp.499-500), and we will need to correlate the random number streams generated at each of K sites to ensure realistic weather simulations.

Let Xt(k) and Xt(l) be the precipitation states on day t at sites k and l, respectively, where X = 0 corresponds to a dry day and X = 1 corresponds to a wet day. In order to simulate realistic sequences of precipitation states at these two sites, one must find the correlation between the normally distributed random variables driving the transitions, ωo(k,l) = ρ(wt(k), wt(l)), that reproduces the historical correlation between the precipitation states, ξo(k,l) = ρ(Xt(k), Xt(l)). Unfortunately, this relationship must be calculated empirically by simulating precipitation states at two sites for different values of ω and calculating the corresponding values of ξ. This must be done for every one of the K(K-1)/2 station pairs. Figure 1 shows one such relationship for the July transition parameters at Ithaca and Freeville from Wilks (1998).

Figure1_WGENblog3

Once the relationship between ξ and ω has been mapped for each station pair, one can determine the correlation ωo(k,l) that should be used for the weather generator. In the case of Ithaca and Freeville, the historical correlation in precipitation states between the two sites, ξo, is 0.800. The dotted line in Figure 1 shows that ξo = 0.8 corresponds to a value of ωo = 0.957. Once all of the K(K-1)/2 pairs of correlations ωo(k,l) have been determined, one can generate a daily vector of standard normal random variables, wt, from a multi-variate normal distribution with mean vector 0 and correlation matrix [Ω], whose elements are the correlations ωo(k,l). One drawback to this method is that, while it does a good job replicating the lag 0 correlations between sites, it tends to under-represent lag 1 correlations between them (Wilks, 1998).

As in the single site weather generator, after simulating whether or not rain occurs at a particular site each day, the next step is to determine the depth of rain on wet days. Not surprisingly, the conditional distribution of rainfall amounts at a particular site depends on whether or not it is raining at nearby sites. A convenient way to accommodate this is to generate the precipitation amounts at wet sites from the mixed exponential distribution (Wilks, 1998).

Recall that the mixed exponential distribution, f(x) = αλ1exp(-λ1x) + (1–α) λ2exp(-λ2x), has three parameters: two rate parameters, λ1 and λ2 where λ1 < λ2, and a weight parameter, α. Thus, with probability α the rainfall amounts are generated from an exponential distribution with rate λ1, and with probability 1–α they are generated from an exponential distribution with rate λ2. Since the mean of the exponential distribution is 1/λ and λ1 < λ2, rainfall amounts generated from an exponential distribution with rate λ1 will have a greater mean than those generated from an exponential distribution with rate λ2. Therefore, rainfall amounts at sites neighboring dry sites are more likely to have come from an exponential distribution with rate λ2, while rainfall amounts at wet sites neighboring other wet sites are more likely to have come from an exponential distribution with rate λ1.

Keeping this in mind, one can conditionally generate rainfall amounts at a given site, k, from one of these two exponential distributions. As stated earlier, if w ≤ Φ-1(pc), or equivalently if Φ(w(k)) < pc(k), then rainfall is generated at site k. Because the transition probabilities at nearby sites are correlated, if Φ(w(k)) << pc(k), then it is very likely that nearby sites also transitioned to a wet state, so rainfall amounts are likely high. Therefore, if Φ(w(k))/pc(k) ≤ α(k), rainfall at the site should be generated from the exponential distribution with the greater mean, that with rate λ1. On the contrary, if Φ(w(k)) is less than, but close to pc(k), it is very likely that nearby sites did not transition to a wet state, so rainfall amounts are likely lower. Therefore, if Φ(w(k))/pc(k) > α(k), rainfall at the site should be generated from the exponential distribution with the smaller mean, that with rate λ2 (Wilks, 1998).

This is a coarse method that can be used to decide on the rate parameter for rainfall generation; Wilks (1998) also describes a method to continuously vary the smaller rate parameter to improve the simulation of precipitation fields. Regardless of which method is chosen, once one has decided on the rate parameter, the rainfall amount itself can be generated from the exponential distribution with this parameter. Just as the precipitation state can be determined by generation of a standard normal random variable, so too can the precipitation amount. Letting the rainfall amount at a particular site k equal r(k), the depth of precipitation is simulated as r(k) = rmin­ – (1/λ)*ln(Φ(v(k))), where rmin is the minimum threshold below which a day is recorded as dry and λ is the rate parameter determined earlier.

The final step in precipitation simulation then is to generate a daily vector of correlated standard normal random variables, vt, that reproduces the historical spatial correlation structure in precipitation amounts across sites. This procedure is analogous to that used to reproduce the historical spatial correlation structure in precipitation occurrences across sites. However, Wilks (1998) found that this procedure did not always produce positive definite correlation matrices [Z] for generation of the multivariate normal random variables, vt. To overcome this problem, the correlations ζo(k,l) making up [Z] can be determined as smooth functions of the distance between station pairs. See Wilks (1998) for further details.

After generating spatially correlated precipitation, one must then generate spatially correlated minimum temperature, maximum temperature and solar radiation. Recall from Part I that in single site weather generators of the Richardson type, these three variables are generated from a first order vector auto-regressive (VAR(1)) model. As Wilks (1999) points out, this method can easily be extended by fitting a VAR(1) model to the simultaneous minimum temperature, maximum temperature and solar radiation across all sites, rather than just at one site. This increases the dimensionality from 3 for the 3 variables at one site, to 3K for the 3 variables at K sites. However, Wilks cautions that this may be difficult due to limited solar radiation data at many sites and inconsistent correlations if the record lengths are different at the K sites. See Wilks (1999) for details on how one can adjust the observed historical correlations such that they are mutually consistent.

The method just described will generate weather at multiple sites with long enough historical records to fit the weather generator parameters. One can also interpolate these parameters using locally weighted regressions (Wilks, 2008) to generate weather at arbitrary locations between stations with historical weather records. Wilks (2009) shows how to do just that to generate gridded weather data, which is particularly useful for driving distributed hydrologic models.

Method 2

The second method, resampling concurrent weather data from multiple sites with a non-parametric weather generator, is a simpler method and a natural extension of the non-parametric single site weather generator. As discussed in Part II, most non-parametric weather generators probabilistically re-sample weather data from the historical record based on how “near” the most recently generated daily weather characteristics are to previous days in the historical record. “Nearness” is typically measured using Euclidean distance.

Buishand and Brandsma (2001) modify the single-site approach to generate weather at multiple sites by describing nearness in terms of Euclidean distance of spatially averaged temperature and precipitation standardized anomalies. The simulation begins with a randomly selected day from the historical record. The temperature and precipitation on that day are standardized by subtracting the historical daily mean and dividing by the historical daily standard deviation. The standardized temperature and precipitation anomalies at each site are then averaged and compared to the spatially averaged standardized anomalies on all other historical days within some window of the current day using Euclidean distance. One of the k nearest neighbors (k-nn) is then probabilistically selected using the Kernel density estimator derived by Lall and Sharma (1996), and the weather at all sites on the following day in the historical record is chosen to be the weather on the next simulated day.

While this non-parametric method of multi-site weather generation is much simpler than the parametric method of Wilks (1998), Mehrotra (2006) found that the Wilks multi-site weather generator was better able to accommodate varying orders of serial dependence at each site and still maintain the historical correlation structure across sites. Similarly, Young (1994) found that the k-nn weather generation approach tended to under-simulate the lengths of wet and dry spells. For this reason, Apitpattanavis et al. (2007) adopt a semi-parametric multi-site weather generation approach that combines elements of each method.

The Apipattanavis generator, also employed by Steinschneider and Brown (2013), first generates spatially averaged precipitation states with a Markov chain model. To begin the simulation, the historical precipitation record is first converted to a time series of spatially averaged precipitation. If the spatially averaged rainfall on the first simulated day is below 0.3 mm, it is classified as a dry day. If it is greater than the 80th percentile of spatially averaged daily precipitation in the simulated month, it is classified as extremely wet. Otherwise it is simply wet. A three-state, first-order Markov chain is then used to simulate the next day’s spatially averaged precipitation state. One of the k-nearest neighbors from the historical record with the same transition (e.g. wet to extremely wet) is selected as in Buishand and Brandsma (2001). The weather at each site on the next simulated day is then chosen to be the weather at each site on the day following the selected neighbor. The main difference between these two methods then, is that Buishand and Brandsma (2001) consider all historical neighbors within some window of the simulated day, while Apipattanavis et al. (2007) only consider those with the same transition in precipitation states as is simulated.

All of the above described methods can reasonably reproduce the spatial and temporal dependence structure of weather at multiple sites, while maintaining the moments at each site. Additionally, they can be modified conditional on short-term, seasonal or long-term forecasts of climate change. That will be the subject of the next blog in this five-part series.

Works Cited

Aghakouchak, A., Bárdossy, A. & Habib, E. (2010a). Conditional simulation of remotely sensed rainfall data using a non-Gaussian v-transformed copula. Advances in Water Resources, 33, 624-634.

—. (2010b). Copula-based uncertainty modelling: application to multisensory precipitation estimates. Hydrological Processes, 24, 2111-2124.

Apipattanavis, S., Podesta, G., Rajagopalan, B., & Katz, R. W. (2007). A semiparametric multivariate and multisite weather generator. Water Resources Research, 43(11).

Baigorria, G. A., & Jones, J. W. (2010). GiST: A stochastic model for generating spatially and temporally correlated daily rainfall data. Journal of Climate, 23(22), 5990-6008.

Bárdossy, A., & Pegram, G. G. S. (2009). Copula based multisite model for daily precipitation simulation. Hydrology and Earth System Sciences, 13, 2299-2314.

Beersma, J. J., & Buishand, T. A. (2003). Multi-site simulation of daily precipitation and temperature conditional on the atmospheric circulation. Climate Research, 25, 121-133.

Buishand, T. A., & Brandsma, T. (2001). Multsite simulation of daily precipitation and temperature in the Rhine basin by nearest-neighbor resampling. Water Resources Research, 37(11), 2761-2776.

Jothityangkoon, C., Sivalapan, M., & Viney, N. R. (2000). Tests of a space-time model of daily rainfall in southwestern Australia based on nonhomogeneous random cascades. Water Resources Research, 36(1), 267-284.

Kigobe, M., McIntyre, N., Wheater, H., & Chandler, R. (2011). Multi-site stochastic modelling of daily rainfall in Uganda. Hydrological Sciences Journal, 56(1), 17-33.

Lall, U., & Sharma, A. (1996). A nearest neighbor bootstrap for resampling hydrologic time series. Water Resources Research, 32(3), 679-693.

Mehrotra, R., Srikanthan, R., & Sharma, A. (2006). A comparison of three stochastic multi-site precipitation occurrence generators. Journal of Hydrology, 331(1-2), 280-292.

Richardson, C. W. (1981). Stochastic simulation of daily precipitation, temperature and solar radiation. Water Resources Research, 17, 182-190.

Serinaldi, F. (2009). A multisite daily rainfall generator driven by bivariate copula-based mixed distributions. Journal of Geophysical Research: Atmospheres, 114(D10).

Srikanthan, R., & Pegram, G. G. S. (2009). A nested multisite daily rainfall stochastic generation model. Journal of Hydrology, 371, 142-153.

Steinschneider, S., & Brown, C. (2013). A semiparametric multivariate, multisite weather generator with low-frequency variability for use in climate risk assessments. Water Resources Research, 49, 7205-7220.

Wilks, D. S. (1998). Multisite generalization of a daily stochastic precipitation generation model. Journal of Hydrology, 210, 178-191.

Wilks, D. S. (1999). Simultaneous stochastic simulation of daily precipitation, temperature and solar radiation at multiple sites in complex terrain. Agricultural and Forest Meteorology, 96, 85-101.

Wilks, D. S. (2008). High-resolution spatial interpolation of weather generator parameters using local weighted regressions. Agricultural and Forest Meteorology, 148, 111-120.

Wilks, D. S. (2009). A gridded multisite weather generator and synchronization to observed weather data. Water Resources Research, 45(10).

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences (Vol. 100). Academic press.

Young, K. C. (1994). A multivariate chain model for simulating climatic parameters from daily data. Journal of Applied Meteorology33(6), 661-671.

Collecting Borg’s operator dynamics

Once you have collected your Borg  runtime dynamics you can also retrieve Borg’s operator dynamics using the MOEAFramework by typing the following command:

 

java -cp MOEAFramework-2.4-Exacutable.jar org.moeaframework.analysis.sensitivity.ExtractData -d 6 -i myproblem.runtime -o operator_dynamics.txt -s "," NFE ElapsedTime SBX DE PCX SPX UNDX UM

 

where -d is the dimension of your problem, or the number of objectives, -i stands for input, in this case you would indicate the name of your runtime output, -o is the output file name and -s is the delimiter for your output file.  After these flags, you may write the list of operators that you want to collect separated by a space. This is a snippet of what I obtained after running the previous command:operators.png

Survival Function Plots in R

Survival Function Plots in R

A survival function (aka survivor function or reliability function) is a function often used in risk management for visualizing system failure points. For example, it can be used to show the frequency of a coastal defense structure failure (such as a breach in a levee) in a future state of the world.

The function itself is quite simple. For a distribution of events, the survival function (SF) is 1-CDF where CDF is the cumulative distribution function. If you’re deriving the distribution empirically, you can substitute the CDF with the cumulative frequency. It is often plotted on a semi-log scale which makes tail-area analysis easier.

I’ve written some R code that creates a primitive Survival Function plot from a vector of data.  Below is the function (Note: You can find the code and an example of its usage on bitbucket https://bitbucket.org/ggg121/r_survival_function.git)

plot.sf <- function(x, xlab=deparse(substitute(x)), left.tail=F,
  ylab=ifelse(left.tail, "SF [Cum. Freq.]", "SF  [1 - Cum. Freq.]"),
  make.plot=T, ...)
{
  num.x <- length(x)
  num.ytics <- floor(log10(num.x))
  sf <- seq(1,1/num.x,by=-1/num.x)
  
  if(left.tail){
    order.x <- order(x, decreasing=T)
    order.sf <- sf[order(order.x)]
    
  }  else {
    order.x <- order(x)
    order.sf <- sf[order(order.x)]
  }
  
  if(make.plot) {
    plot(x[order.x], sf, log="y", xlab=xlab, ylab=ylab, yaxt="n", ...)
    axis(2, at=10^(-num.ytics:0), 
         label=parse(text=paste("10^", -num.ytics:0, sep="")), las=1)
  }
  invisible(order.sf)
}

Download and source the code at the start of your R script and you’re good to go. The function, by default, creates a plot in the current plotting device and invisibly returns the survival function values corresponding to the vector of data provided. The parameter left.tail sets the focus on the left-tail of the distribution (or essentially plots the CDF on a semi-log scale). By default, the function puts the focus on the right tail (left.tail = FALSE). The make.plot parameter allows you to toggle plotting of the survival function (default is on or make.plot=TRUE. This is useful when you simply need the survival function values for further calculations or custom plots. Additional parameters are passed to the plot() function. Below is an example (which is also available in the repository).

# Source the function
source("plot_sf.r")

# Set the seed
set.seed(1234)

# Generate some data to use
my.norm <- rnorm(10000, 10, 2)
my.unif <- runif(10000)
my.weib <- rweibull(10000, 20, 5)
my.lnorm <- rlnorm(10000, 1, 0.5)


# Make the plots ----------------------
par(mfrow=c(2,2), mar=c(5,4,1,1)+0.1)

# Default plot settings
plot.sf(my.norm)

# Function wraps the standard "plot" function, so you can pass
# the standard "plot" parameters to the function
plot.sf(my.unif, type="l", lwd=2, col="blue", bty="l",
        ylab="Survival", xlab="Uniform Distribution")

# If the parameter "left.tail" is true, the plot turns into 
# a cumulative frequency plot (kind of like a CDF) that's plotted
# on a log scale.  This is good for when your data exhibits a left or
# negative skew.
plot.sf(my.weib, type="l", left.tail=T, xlab="Left-tailed Weibull Dist.")

# The function invisibly returns the survival function value.
lnorm.sf <- plot.sf(my.lnorm, type="l")
points(my.lnorm, lnorm.sf, col="red")
legend("topright", bty="n", 
       legend=c("Function Call", "Using returned values"), 
       lty=c(1,NA), pch=c(NA,1), col=c("black", "red") )

# The 'make.plot' parameter toggles plotting.
# Useful if you just want the survival function values.
norm.sf <- plot.sf(my.norm, make.plot=F)

And here’s the resulting figure from this example:

survival_function_plot_example

Now you can easily show, for example, tail-area frequency of events. For example, below is a survival function plot of a normal distribution:

survival_plot_norm_didactic

For this example, we can imagine this as a distribution of flood heights (x-axis would be flood height – note that a real distribution of flood heights would likely look drastically different from a normal distribution). With this visualization, we can easily depict the “1 in 10” or the “1 in 1,000” flood height by following the appropriate survival function value over to the corresponding flood height on the plot. Alternatively, you can determine the return period of a given flood height by following the flood height up to the plot and reading off the survival function value. Comparing multiple distributions together on a single plot (think deep uncertainty) can produce interesting decision-relevant discussion about changes in return periods for a given event or the range of possible events for a given return period.

I hope this post is useful. Survival function plots are incredibly versatile and informative…and I’ve only started to scratch the surface!