Evaluating and visualizing sampling quality

State sampling is a necessary step for any computational experiment, and the way sampling is carried out will influence the experiment’s results. This is the case for instance, for sensitivity analysis (i.e., the analysis of model output sensitivity to values of the input variables). The popular method of Sobol’ (Sobol’, 2001) relies on tailor-made sampling techniques that have been perfected through time (e.g., Joe and Kuo, 2008; Saltelli et al., 2010). Likewise, the method of Morris (Morris, 1991), less computationally demanding than Sobol’s (Herman et al., 2013) and used for screening (i.e., understanding which are the inputs that most influence outputs), relies on specific sampling techniques (Morris, 1991; Campolongo et al., 2007).

But what makes a good sample, and how can we understand the strengths and weaknesses of the sampling techniques (and also of the associated sensitivity techniques we are using) through quick visualization of some associated metrics?

This post aims to answer this question. It will first look at what makes a good sample using some examples from a sampling technique called latin hypercube sampling. Then it will show some handy visualization tools for quickly testing and visualizing a sample.

What makes a good sample?

Intuitively, the first criterion for a good sample is how well it covers the space from which to sample. The difficulty though, is how we define “how well” it practice, and the implications that has.

Let us take an example. A quick and popular way to generate a sample that covers the space fairly well is latin hypercube sampling (LHS; McKay et al., 1979). This algorithm relies on the following steps for drawing N samples from a hypercube-shaped of dimension p.:

1) Divide each dimension of the space in N equiprobabilistic bins. If we want uniform sampling, each bin will have the same length. Number bins from 1 to N each dimension.

2) Randomly draw points such that you have exactly one in each bin in each dimension.

For instance, for 6 points in 2 dimensions, this is a possible sample (points are selected randomly in each square labelled A to F):

It is easy to see that by definition, LHS has a good space coverage when projected on each individual axis. But space coverage in multiple dimensions all depends on the luck of the draw. Indeed, this is also a perfectly valid LHS configuration:

In the above configuration, it is easy to see that on top of poor space coverage, correlation between the sampled values along both axes is also a huge issue. For instance, if output values are hugely dependent on values of input 1, there will be large variations of the output values as values of input 2 change, regardless of the real impact of input 2 on the output.

Therefore, there are two kinds of issues to look at. One is correlation between sampled values of the input variables. We’ll look at it first because it is pretty straightforward. Then we’ll look at space coverage metrics, which are more numerous, do not look exactly at the same things, and can be sometimes conflicting. In fact, it is illuminating to see that sample quality metrics sometimes trade-off with one another, and several authors have turned to multi-objective optimization to come up with Pareto-optimal sample designs (e.g., Cioppa and Lucas, 2007; De Rainville et al., 2012).

One can look at authors such as Sheikholeslami and Razavi (2017) who summarize similar sets of variables. The goal there is not to write a summary of summaries but rather to give a sense that there is a relationship between which indicators of sampling quality matter, which sampling strategy to use, and what we want to do.

In what follows we note $x_{k,i}$ the kth  sampled value of input variable i, with $1\leq k \leq N$ and $1\leq i \leq p$.

Correlation

Sample correlation is usually measured through the Pearson statistic. For inputs variables i and j among the p input variables, we note $x_{k,i}$ and $x_{k,j}$ the values of these variables i and j in sample k $(1\leq k \leq N)$ have:

$\rho_{ij} = \frac{\sum_{k=1}^N (x_{k,i}-\bar{x}_i)(x_{k,j}-\bar{x}_j)}{\sqrt{\sum_{k=1}^N (x_{k,i}-\bar{x}_i)^2 \sum_{k=1}^N (x_{k,j}-\bar{x}_j)^2}}$

In the above equation, ${\bar{x}_i}$  and ${\bar{x}_j}$ are the average sampled values of inputs i and j .

Then, the indicator of sample quality looks at the maximal level of correlation across all variables:

$\rho_{\max} = \max_{1\leq i \leq j \leq N} |\rho_{ij}|$

This definition relies on the remark that $\rho_{ij} = \rho_{ji}$.

Space Coverage

There are different measures of space coverage.

We are best equipped to visualize space coverage via 1D or 2D projections of a sample. In 1D a measure of space coverage is by dividing each dimension in N equiprobable bins, and count the fraction of bins that have at least a point. Since N is the sample size, this measure is maximized when there is exactly one point in each bin — it is a measure that LHS maximizes.

Other measures of space coverage consider all dimension at once. A straightforward measure of space filling is the minimum Euclidean distance between two sampled points X in the generated ensemble:

$D = \min_{1\leq k \leq m \leq N} \left\{ d(\textbf{X}^k, \textbf{X}^m) \right\}$

Other indicators measure discrepancy which is a concept closely related to space coverage. In simple terms, a low discrepancy means that when we look at a subset of a sampled input space, its volume is roughly proportional to the number of points that are in it. In other words, there is no large subset with relatively few sampled points, and there is no small subset with a relatively large density of sampled points. A low discrepancy is desirable and in fact, Sobol’ sequences that form the basis of the Sobol’ sensitivity analysis method, are meant to minimize discrepancy.

Sample visualization

The figures that follow can be easily reproduced by cloning a little repository SampleVis I put together, and by entering on the command line python readme.py &amp;&gt; output.txt. That Python routine can be used with both latin hypercube and Sobol’ sampling (using the SAlib sampling tool; SAlib is a Python library developed primarily by Jon Herman and Will Usher, and which is extensively discussed in this blog.)

In what follows I give examples using a random draw of latin hypercube sampling with 100 members and 7 sampled variables.

Correlation

No luck, there is statistically significant pairwise correlation between in three pairs of variables: x1 and x4, x4 and x6, and x5 and x6. Using LHS, it can take some time to be lucky enough until the drawn sample is correlation-free (alternatively, methods to minimize correlations have been extensively researched over the years, though no “silver bullet” really emerges).

This means any inference that works for both variables in any of these pairs may be suspect. The SampleVis toolbox contains also tools to plot whether these correlations are positive or negative.

Space coverage

The toolbox enables to plot several indicators of space coverage, assuming that the sampled space is the unit hypercube of dimension p (p=7 in this example). It computes discrepancy and minimal distance indicators. Ironically, my random LHS with 7 variables and 100 members has a better discrepancy (here I use an indicator called L2-star discrepancy) than a Sobol’ sequence with as many variables and members. The minimal Euclidean distance as well is better than for Sobol’ (0.330 vs. 0.348). This means that if for our experiment, space coverage is more important than correlation, the drawn LHS is pretty good.

To better grasp how well points cover the whole space, it is interesting to plot the distance of the point that is closest to each point, and to represent that in growing order:

This means that some points are not evenly spaced, and some are more isolated than others. When dealing with a limited number of variables, it can also be interesting to visualize 2D projections of the sample, like this one:

This again goes to show that the sample is pretty-well distributed in space. We can compare with the same diagram for a Sobol’ sampling with 100 members and 7 variables:

It is pretty clear that the deterministic nature of Sobol’ sampling, for so few points, leaves more systematic holes in the sampled space. Of course, this sample is too small for any serious Sobol’ sensitivity analysis, and holes are plugged by a larger sample. But again, this comparison is a visual heuristic that tells a similar story as the global coverage indicator: this LHS draw is pretty good when it comes to coverage.

References

Campolongo, F., Cariboni, J. & Saltelli, A. (2007). An effective screening design for sensitivity analysis of large models. Environmental Modelling & Software, 22, 1509 – 1518.

Cioppa, T. M. & Lucas, T. W. (2007). Efficient Nearly Orthogonal and Space-Filling Latin Hypercubes. Technometrics, 49, 45-55.

De Rainville, F.-M., Gagné, C., Teytaud, O. & Laurendeau, D. (2012). Evolutionary Optimization of Low-discrepancy Sequences. ACM Trans. Model. Comput. Simul., ACM, 22, 9:1-9:25.

Herman, J. D., Kollat, J. B., Reed, P. M. & Wagener, T. (2013). Technical Note: Method of Morris effectively reduces the computational demands of global sensitivity analysis for distributed watershed models. Hydrology and Earth System Sciences, 17, 2893-2903.

Joe, S. & Kuo, F. (2008). Constructing Sobol Sequences with Better Two-Dimensional Projections. SIAM Journal on Scientific Computing, 30, 2635-2654.

McKay, M.D., Beckman R.J. & Conover, W.J. (1979).A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, 21(2), 239-245.

Morris, M. D. (1991). Factorial Sampling Plans for Preliminary Computational Experiments. Technometrics, 33, 161-174.

Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M. & Tarantola, S. (2010). Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Computer Physics Communications, 181, 259 – 270.

Sheikholeslami, R. & Razavi, S. (2017). Progressive Latin Hypercube Sampling: An efficient approach for robust sampling-based analysis of environmental models. Environmental Modelling & Software, 93, 109 – 126.

Sobol’, I. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271 – 280.